upperwal commited on
Commit
3f98724
·
verified ·
1 Parent(s): 7b1ab7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -42,13 +42,13 @@ inference:
42
 
43
  Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
44
 
45
- Layers: 22
46
- Attention Heads: 32
47
- Context Length: 2048
48
- Hidden Dimension: 2048
49
- Expansion Dimension: 5632
50
- Vocabulary Size: 69632
51
- This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
52
 
53
  Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
54
 
 
42
 
43
  Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
44
 
45
+ - Layers: 22
46
+ - Attention Heads: 32
47
+ - Context Length: 2048
48
+ - Hidden Dimension: 2048
49
+ - Expansion Dimension: 5632
50
+ - Vocabulary Size: 69632
51
+ - This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
52
 
53
  Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
54