WordLlama Detect

WordLlama Detect is a WordLlama-like library focused on the task of language identification. It supports identification of 148 languages, and high accuracy and fast CPU & numpy-only inference. WordLlama detect was trained from static token embeddings extracted from Gemma3-series LLMs.

WordLlamaDetect

Overview

Features:

NumPy-only inference with no PyTorch dependency
Pre-trained model (148 languages), with 103 @ >95% accuracy
Sparse lookup table (13MB)
Fast inference: >70k texts/s single thread
Simple interface

Installation

pip install wldetect

Or install from source:

git clone https://github.com/dleemiller/WordLlamaDetect.git
cd WordLlamaDetect
uv sync

Quick Start

Python API

from wldetect import WLDetect

# Load bundled model (no path needed)
wld = WLDetect.load()

# Detect language for single text
lang, confidence = wld.predict("Hello, how are you today?")
# ('eng_Latn', 0.9564036726951599)

CLI Usage

# Detect from text
uv run wldetect detect --text "Bonjour le monde"

# Detect from file
uv run wldetect detect --file input.txt

Included Model

WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:

Languages: 148 (from OpenLID-v2 dataset)
Accuracy: 92.92% on FLORES+ dev set
F1 (macro): 92.74%
Language codes: ISO 639-3 + ISO 15924 script (e.g., eng_Latn, cmn_Hans, arb_Arab)

See docs/languages.md for the complete list of supported languages with performance metrics.

Gemma3 is a good choice for this application, because it was trained on over 140 languages. The tokenizer, vocab size (262k) and multi-language training are critical for performance.

Architecture

Simple Inference Pipeline (NumPy-only)

Tokenize: Use HuggingFace fast tokenizer (512-length truncation)
Lookup: Index into pre-computed exponential lookup table (vocab_size × n_languages)
Pool: LogSum pooling over token sequence
Softmax: Calculate language probabilities

The lookup table is pre-trained using: exp((embeddings * token_weights) @ projection.T + bias), where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2. During training, token vectors are aggregated using logsumexp pooling along the sequence dimension.

To optimize artifact size and compute, we perform exp(logits) before saving the lookup table. Then we apply a threshold to make the table sparse. This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.

Sparse Lookup Table

The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:

Sparsity: 97.15% (values below threshold (<10) set to zero)
Format: COO (row, col, data) indices stored as int32, values as fp32
Performance impact: Negligible (0.003% accuracy loss)

Performance

FLORES+ Benchmark Results

Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):

Split	Accuracy	F1 (macro)	F1 (weighted)	Samples
dev	92.92%	92.74%	92.75%	150,547
devtest	92.86%	92.71%	92.69%	153,824

See docs/languages.md for detailed results.

Inference Speed

Benchmarked on 12th gen Intel-i9 (single thread):

Single text: 71,500 texts/second (0.014 ms/text)
Batch (1000): 82,500 texts/second (12.1 ms/batch)

Supported Languages

The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., eng_Latn, cmn_Hans, arb_Arab).

See model_config.yaml for the complete list of supported languages.

Training

Installation for Training

# CPU or default CUDA version
uv sync --extra training

# With CUDA 12.8 (Blackwell)
uv sync --extra cu128

Training Pipeline

Configure model in configs/models/custom-config.yaml:

model:
  name: google/gemma-3-27b-pt
  hidden_dim: 5376
  shard_pattern: model-00001-of-00012.safetensors
  embedding_layer_name: language_model.model.embed_tokens.weight

languages:
  eng_Latn: 0
  spa_Latn: 1
  fra_Latn: 2
  # ... add more languages

inference:
  max_sequence_length: 512
  pooling: logsumexp

Configure training in configs/training/custom-training.yaml:

model_config_path: "configs/models/custom-model.yaml"

dataset:
  name: "laurievb/OpenLID-v2"
  filter_languages: true

training:
  batch_size: 1536
  learning_rate: 0.002
  epochs: 2

Train:

uv run wldetect train --config configs/training/custom-training.yaml

Artifacts saved to artifacts/:

lookup_table_exp.safetensors - Sparse exp lookup table (for inference)
projection.safetensors - Projection matrix (fp32, for fine-tuning)
model_config.yaml - Model configuration
model.pt - Full PyTorch checkpoint

Training Commands

# Train model
uv run wldetect train --config configs/training/gemma3-27b.yaml

# Evaluate on FLORES+
uv run wldetect eval --model-path artifacts/ --split dev

# Generate sparse lookup table from checkpoint (default: threshold=10.0)
uv run wldetect create-lookup \
  --checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
  --config configs/training/gemma3-27b.yaml \
  --output-dir artifacts/

Training Details

Embedding extraction: Downloads only embedding tensor shards from HuggingFace (not full models)
Dataset: OpenLID-v2 with configurable language filtering and balancing
Model: Simple linear projection (hidden_dim → n_languages) with dropout
Pooling: LogSumExp or max pooling over token sequences
Training time: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
Evaluation: Automatic FLORES+ evaluation after training

License

Apache 2.0 License

Citations

If you use WordLlama Detect in your research or project, please consider citing it as follows:

@software{miller2025wordllamadetect,
  author = {Miller, D. Lee},
  title = {WordLlama Detect: The Language of the Token},
  year = {2025},
  url = {https://github.com/dleemiller/WordLlamaDetect},
  version = {0.1.0}
}

Acknowledgments

OpenLID-v2 dataset: laurievb/OpenLID-v2
FLORES+ dataset: openlanguagedata/flores_plus
HuggingFace transformers and tokenizers libraries
Google Gemma model team

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

dleemiller
/

WordLlamaDetect

WordLlama Detect

Overview

Installation

Quick Start

Python API

CLI Usage

Included Model

Architecture

Simple Inference Pipeline (NumPy-only)

Sparse Lookup Table

Performance

FLORES+ Benchmark Results

Inference Speed

Supported Languages

Training

Installation for Training

Training Pipeline

Training Commands

Training Details

License

Citations

Acknowledgments

Dataset used to train dleemiller/WordLlamaDetect