KodaLite-1.3B (Koda-v0.1)

A 1.27B parameter LLaMA-style decoder-only language model, trained entirely from scratch on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format.

TL;DR — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is severely undertrained (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the Chinchilla scaling law: tokens matter more than parameters at this budget.

Benchmark results (zero-shot, 8 standard tasks)

Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI).

Rank Model Params Train tokens Avg accuracy
1 TinyLlama-1.1B 1.10B 3000B 50.3%
2 Pythia-1.4B 1.41B 300B 50.2%
3 GPT-2-XL 1.56B 40B 49.4%
4 OPT-1.3B 1.32B 180B 49.1%
5 Pythia-1B 1.01B 300B 47.6%
6 GPT-2-large 0.77B 40B 46.2%
7 GPT-2-medium 0.35B 40B 44.2%
8 GPT-2-124m 0.12B 40B 39.7%
9 KodaLite-1.3B 1.27B 1.64B 36.8%

Per-task breakdown

Task KodaLite-1.3B GPT-2-124M GPT-2-XL Pythia-1.4B TinyLlama-1.1B Random
HellaSwag 25.65 29.22 47.94 49.21 56.2 25.0
ARC-Easy 32.79 38.30 50.80 51.73 43.9 25.0
ARC-Challenge 21.50 22.70 28.16 29.01 30.0 25.0
WinoGrande 49.57 49.49 51.93 52.88 52.2 50.0
PIQA 58.92 62.24 70.89 71.22 72.1 50.0
BoolQ 44.34 49.76 61.59 63.70 60.6 50.0
OpenBookQA 25.00 26.40 34.20 33.40 37.2 25.0
LAMBADA (acc / ppl) 18.22 / 93.8 30.84 / 17.5 50.79 / 6.4 61.03 / 3.8

Why KodaLite scores below GPT-2-124M (despite being 10× bigger)

The Chinchilla scaling law (DeepMind, 2022) states that a model with N parameters needs approximately 20×N training tokens to be well-trained:

Model Params Chinchilla target (~20× params) Actual tokens Ratio
KodaLite-1.3B 1.27B ~25B 1.64B 6.5 % 🔴
GPT-2-XL 1.5B ~30B 40B 133 %
Pythia-1.4B 1.4B ~28B 300B 1070 %
TinyLlama-1.1B 1.1B ~22B 3000B 13600 %

KodaLite has seen only 6.5% of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged.

On PIQA (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling.

Chat Format

Model uses 3 text markers (no special tokens): <|user|>, <|assistant|>, <|end|>.

<|user|>
Your question
<|assistant|>
Model response
<|end|>

Important: <|end|> is NOT a single token (it tokenizes to 5 BPE tokens). Always pass it as a stop_strings parameter when generating, otherwise the model will run past its natural end-of-turn.

Usage (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained(
    "YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40,
    stop_strings=["<|end|>"], tokenizer=tok,
)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))

Usage (MLX — Apple Silicon)

See YoAbriel/KodaLite-1.3B-mlx.

from mlx_lm import load, stream_generate
model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit")

def chat(q):
    prompt = tok.apply_chat_template([{"role": "user", "content": q}], tokenize=False)
    text = ""
    for resp in stream_generate(model, tok, prompt=prompt, max_tokens=150):
        text += resp.text
        if "<|end|>" in text:
            return text.split("<|end|>")[0]
    return text

print(chat("What is the capital of France?"))

Usage (llama.cpp / Ollama / LM Studio)

See YoAbriel/KodaLite-1.3B-GGUF.

ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M

LM Studio note: the model was trained with <|end|> as a multi-token end marker. Since GGUF only supports single-token EOS, you need to manually add <|end|> as a Stop String in LM Studio's Advanced Settings.

Architecture (LLaMA-compatible)

Component Value
Parameters 1.27B
Layers 24
Hidden size 2048
Attention GQA (32Q / 8KV heads)
Head dim 64
FFN SwiGLU, intermediate 5504
Normalization RMSNorm (pre-norm)
Position RoPE (theta=10000)
Context 1024 tokens
Vocab 50,257 (GPT-2 BPE)

Training

Pre-training

  • Dataset: SlimPajama-6B (streaming)
  • Tokens seen: 1.64B
  • Hardware: 2x NVIDIA L40S (96GB VRAM total)
  • Precision: bfloat16
  • Framework: JAX + Flax NNX (trained from scratch, no base model)

SFT

  • Datasets: Databricks Dolly-15K + OpenAssistant OASST1
  • Method: LoRA (rank=16, alpha=32), then merged into base weights
  • End-of-turn marker: <|end|> (5 BPE tokens, NOT a special token)

Limitations

  • Severely undertrained (6.5% of Chinchilla-optimal) — factual accuracy is low
  • May produce repetitive or inaccurate responses
  • English only
  • 1024 context window
  • Educational / research project — not production-ready

Lessons learned (for a potential v0.2)

  1. Train longer: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B).
  2. Use <|endoftext|> (single token) as end-of-turn marker for native GGUF/LM Studio stop support.
  3. SwiGLU + RMSNorm + GQA + RoPE architecture is correct — no issues there, confirmed by the fact that our scaling follows the expected curve.

License

Apache 2.0

Downloads last month
548
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YoAbriel/KodaLite-1.3B

Finetunes
1 model
Quantizations
4 models

Evaluation results