RSCaLM-138M-core

RSCaLM (Research Scale Causal Language Model) β€” Core Edition β€” is an experimental 138M-parameter decoder-only transformer trained for 20,000 steps. Unlike the LLaMA variant, this model is implemented entirely with a custom minimal GPT architecture (standalone_transformer_lm.GPT) and SentencePiece tokenization β€” no Hugging Face Transformers dependency.


πŸ“Œ Experiment Summary

  • Architecture: Custom GPT-style causal decoder

    • Implemented in standalone_transformer_lm.py
    • Learned positional embeddings (absolute)
    • Multi-head self-attention with KV caching
    • GELU feed-forward layers
    • LayerNorm
  • Parameter Count: ~138M

  • Context Length: 2048 tokens

  • Tokenizer: SentencePiece (tokenizer.model)

  • Training Framework: Pure PyTorch (no Transformers)

  • Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)

  • Scheduler: Cosine decay with warmup

  • Precision: Mixed FP16/BF16 training

  • Steps Completed: 20,000 (~32% of planned total)


πŸ“‰ Validation Loss Progress

Step Val Loss
1,000 5.6011
2,000 4.8598
5,000 4.2239
10,000 3.9756
15,000 3.8608
20,000 3.7984

⚠️ Notes

  • Prototype only β€” repetition loops expected in longer generations.
  • Requires standalone_transformer_lm.py and SentencePiece to run.
  • Does not load with transformers.AutoModelForCausalLM.

πŸ”§ Example Usage

import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

# Load checkpoint & config
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])

# Init model & load weights
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

# Encode prompt
ids = torch.tensor([sp.encode("Dubai is", out_type=int)])

# Generate text
out = model.generate(ids, max_new_tokens=40)
print(sp.decode(out[0].tolist()))

πŸ”§ Example Usage (with repetition control)

import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

prompt = "when a man goes to fishing"
ids = torch.tensor([sp.encode(prompt, out_type=int)])

# Manual repetition control
out = model.generate(
    ids,
    max_new_tokens=100,
    temperature=0.7,        # Lower temp = more focused
    top_k=50,                # Top-K sampling
    top_p=0.9,               # Nucleus sampling
    repetition_penalty=1.2,  # Penalize repeats
    no_repeat_ngram_size=3,  # Block repeating trigrams
)
print(sp.decode(out[0].tolist()))

πŸ’‘ Tips to Reduce Loops

  • Increase repetition_penalty to 1.2–1.5
  • Use no_repeat_ngram_size=3 or higher
  • Combine top_k and top_p for better sampling variety
  • Lower temperature for more deterministic completions

πŸ“œ License

Apache-2.0


Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train yasserrmd/RSCaLM-138M-Core