Overview

LightLM is a series of 3 language models trained on open-access data (Cosmopedia v2). We present three configurations (one with Mixture-of-Experts and two without) that aim to optimize parameter distribution between Attention and Feed-Forward layers. Despite a relatively modest training corpus of ~28B tokens, these models approach or surpass performance of other models in their parameter range (e.g., MobileLLM, GPT-neo-125M).

  1. Model 1 (Model Attn)

    • Layers: 34
    • Attention dim: 832
    • FFN dim: 556
    • Context length: 1536
  2. Model 2 (Model FFN)

    • Layers: 32
    • Attention dim: 512
    • FFN dim: 512 ร— 4 = 2048
    • Context length: 1536
  3. Model 3 (Model MoE 2+1)

    • Layers: 32
    • Attention dim: 384 (experimental setting)
    • FFN: 2 routed experts + 1 shared expert
      • Each expert has 512 ร— 2 = 1024 hidden units
      • 100% of parameters are active; router assigns expert weights per token
    • Context length: 1024

Results

Model #Params ARC-c WinoGrande
GPT-neo-125M 125M 24.8 50.7
Pythia-160M 162M 25.3 50.9
RWKV-169M 169M 25.3 51.5
MobileLLM-125M 125M 27.1 53.1
LightLM (Attn) 146M 25.1 52.0
LightLM (FFN) 146M 27.2 47.5
LightLM (MoE) 144M 26.3 52.8

Example Output
Prompt: "Hello, I am a language model,"

Hello, I am a language model, and I can help you learn more about the language you are interested in. 
Let's start with the basics.
Hello, I am a language model, and I can help you learn some new words and phrases. Maybe you could try 
saying "hello" in English first, then move on to Spanish, ...

๐Ÿ”— View on GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support