Overview
LightLM is a series of 3 language models trained on open-access data (Cosmopedia v2). We present three configurations (one with Mixture-of-Experts and two without) that aim to optimize parameter distribution between Attention and Feed-Forward layers. Despite a relatively modest training corpus of ~28B tokens, these models approach or surpass performance of other models in their parameter range (e.g., MobileLLM, GPT-neo-125M).
Model 1 (Model Attn)
- Layers: 34
- Attention dim: 832
- FFN dim: 556
- Context length: 1536
Model 2 (Model FFN)
- Layers: 32
- Attention dim: 512
- FFN dim: 512 ร 4 = 2048
- Context length: 1536
Model 3 (Model MoE 2+1)
- Layers: 32
- Attention dim: 384 (experimental setting)
- FFN: 2 routed experts + 1 shared expert
- Each expert has 512 ร 2 = 1024 hidden units
- 100% of parameters are active; router assigns expert weights per token
- Context length: 1024
Results
Model | #Params | ARC-c | WinoGrande |
---|---|---|---|
GPT-neo-125M | 125M | 24.8 | 50.7 |
Pythia-160M | 162M | 25.3 | 50.9 |
RWKV-169M | 169M | 25.3 | 51.5 |
MobileLLM-125M | 125M | 27.1 | 53.1 |
LightLM (Attn) | 146M | 25.1 | 52.0 |
LightLM (FFN) | 146M | 27.2 | 47.5 |
LightLM (MoE) | 144M | 26.3 | 52.8 |
Example Output
Prompt: "Hello, I am a language model,"
Hello, I am a language model, and I can help you learn more about the language you are interested in.
Let's start with the basics.
Hello, I am a language model, and I can help you learn some new words and phrases. Maybe you could try
saying "hello" in English first, then move on to Spanish, ...
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.