soketlabs
/

pragna-1b

Text2Text Generation

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

upperwal commited on Apr 29, 2024

Commit

3f98724

·

verified ·

1 Parent(s): 7b1ab7c

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -42,13 +42,13 @@ inference:
 Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
-Layers: 22
-Attention Heads: 32
-Context Length: 2048
-Hidden Dimension: 2048
-Expansion Dimension: 5632
-Vocabulary Size: 69632
-This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
 Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.

 Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:
+- Layers: 22
+- Attention Heads: 32
+- Context Length: 2048
+- Hidden Dimension: 2048
+- Expansion Dimension: 5632
+- Vocabulary Size: 69632
+- This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
 Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.