KodaLite-1.3B (Koda-v0.1)

A 1.27B parameter LLaMA-style decoder-only language model, trained entirely from scratch on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format.

TL;DR — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is severely undertrained (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the Chinchilla scaling law: tokens matter more than parameters at this budget.

Benchmark results (zero-shot, 8 standard tasks)

Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI).

Rank	Model	Params	Train tokens	Avg accuracy
1	TinyLlama-1.1B	1.10B	3000B	50.3%
2	Pythia-1.4B	1.41B	300B	50.2%
3	GPT-2-XL	1.56B	40B	49.4%
4	OPT-1.3B	1.32B	180B	49.1%
5	Pythia-1B	1.01B	300B	47.6%
6	GPT-2-large	0.77B	40B	46.2%
7	GPT-2-medium	0.35B	40B	44.2%
8	GPT-2-124m	0.12B	40B	39.7%
9	KodaLite-1.3B	1.27B	1.64B	36.8%

Per-task breakdown

Task	KodaLite-1.3B	GPT-2-124M	GPT-2-XL	Pythia-1.4B	TinyLlama-1.1B	Random
HellaSwag	25.65	29.22	47.94	49.21	56.2	25.0
ARC-Easy	32.79	38.30	50.80	51.73	43.9	25.0
ARC-Challenge	21.50	22.70	28.16	29.01	30.0	25.0
WinoGrande	49.57	49.49	51.93	52.88	52.2	50.0
PIQA	58.92	62.24	70.89	71.22	72.1	50.0
BoolQ	44.34	49.76	61.59	63.70	60.6	50.0
OpenBookQA	25.00	26.40	34.20	33.40	37.2	25.0
LAMBADA (acc / ppl)	18.22 / 93.8	30.84 / 17.5	50.79 / 6.4	61.03 / 3.8	—	—

Why KodaLite scores below GPT-2-124M (despite being 10× bigger)

The Chinchilla scaling law (DeepMind, 2022) states that a model with N parameters needs approximately 20×N training tokens to be well-trained:

Model	Params	Chinchilla target (~20× params)	Actual tokens	Ratio
KodaLite-1.3B	1.27B	~25B	1.64B	6.5 % 🔴
GPT-2-XL	1.5B	~30B	40B	133 %
Pythia-1.4B	1.4B	~28B	300B	1070 %
TinyLlama-1.1B	1.1B	~22B	3000B	13600 %

KodaLite has seen only 6.5% of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged.

On PIQA (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling.

Chat Format

Model uses 3 text markers (no special tokens): <|user|>, <|assistant|>, <|end|>.

<|user|>
Your question
<|assistant|>
Model response
<|end|>

Important: <|end|> is NOT a single token (it tokenizes to 5 BPE tokens). Always pass it as a stop_strings parameter when generating, otherwise the model will run past its natural end-of-turn.

Usage (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained(
    "YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40,
    stop_strings=["<|end|>"], tokenizer=tok,
)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))

Usage (MLX — Apple Silicon)

See YoAbriel/KodaLite-1.3B-mlx.

from mlx_lm import load, stream_generate
model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit")

def chat(q):
    prompt = tok.apply_chat_template([{"role": "user", "content": q}], tokenize=False)
    text = ""
    for resp in stream_generate(model, tok, prompt=prompt, max_tokens=150):
        text += resp.text
        if "<|end|>" in text:
            return text.split("<|end|>")[0]
    return text

print(chat("What is the capital of France?"))

Usage (llama.cpp / Ollama / LM Studio)

See YoAbriel/KodaLite-1.3B-GGUF.

ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M

LM Studio note: the model was trained with <|end|> as a multi-token end marker. Since GGUF only supports single-token EOS, you need to manually add <|end|> as a Stop String in LM Studio's Advanced Settings.

Architecture (LLaMA-compatible)

Component	Value
Parameters	1.27B
Layers	24
Hidden size	2048
Attention	GQA (32Q / 8KV heads)
Head dim	64
FFN	SwiGLU, intermediate 5504
Normalization	RMSNorm (pre-norm)
Position	RoPE (theta=10000)
Context	1024 tokens
Vocab	50,257 (GPT-2 BPE)

Training

Pre-training

Dataset: SlimPajama-6B (streaming)
Tokens seen: 1.64B
Hardware: 2x NVIDIA L40S (96GB VRAM total)
Precision: bfloat16
Framework: JAX + Flax NNX (trained from scratch, no base model)

SFT

Datasets: Databricks Dolly-15K + OpenAssistant OASST1
Method: LoRA (rank=16, alpha=32), then merged into base weights
End-of-turn marker: <|end|> (5 BPE tokens, NOT a special token)

Limitations

Severely undertrained (6.5% of Chinchilla-optimal) — factual accuracy is low
May produce repetitive or inaccurate responses
English only
1024 context window
Educational / research project — not production-ready

Lessons learned (for a potential v0.2)

Train longer: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B).
Use <|endoftext|> (single token) as end-of-turn marker for native GGUF/LM Studio stop support.
SwiGLU + RMSNorm + GQA + RoPE architecture is correct — no issues there, confirmed by the fact that our scaling follows the expected curve.

License

Apache 2.0

Downloads last month: 548

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for YoAbriel/KodaLite-1.3B

Finetunes

1 model

Quantizations

4 models

Evaluation results

accuracy on HellaSwag (zero-shot)
self-reported

0.257
accuracy on ARC-Easy (zero-shot)
self-reported

0.328
accuracy on ARC-Challenge (zero-shot)
self-reported

0.215
accuracy on WinoGrande (zero-shot)
self-reported

0.496
accuracy on PIQA (zero-shot)
self-reported

0.589
accuracy on BoolQ (zero-shot)
self-reported

0.443
accuracy on OpenBookQA (zero-shot)
self-reported

0.250
accuracy on LAMBADA (OpenAI, zero-shot)
self-reported

0.182
perplexity on LAMBADA (OpenAI, zero-shot)
self-reported

93.780