Update README.md
Browse files
README.md
CHANGED
@@ -42,13 +42,13 @@ inference:
|
|
42 |
|
43 |
Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
|
44 |
|
45 |
-
Layers: 22
|
46 |
-
Attention Heads: 32
|
47 |
-
Context Length: 2048
|
48 |
-
Hidden Dimension: 2048
|
49 |
-
Expansion Dimension: 5632
|
50 |
-
Vocabulary Size: 69632
|
51 |
-
This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
|
52 |
|
53 |
Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
|
54 |
|
|
|
42 |
|
43 |
Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
|
44 |
|
45 |
+
- Layers: 22
|
46 |
+
- Attention Heads: 32
|
47 |
+
- Context Length: 2048
|
48 |
+
- Hidden Dimension: 2048
|
49 |
+
- Expansion Dimension: 5632
|
50 |
+
- Vocabulary Size: 69632
|
51 |
+
- This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
|
52 |
|
53 |
Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
|
54 |
|