ogma-base · 13.3M efficient text embedding model · MTEB 57.04
High-quality English text embedding model for semantic search, RAG, vector search, retrieval, clustering, classification, STS, and agent memory — MTEB 57.04, 13.3M parameters, 1024-token context
Ogma Base is the top performer in the Ogma family. At 13.3M parameters it scores 57.04 MTEB — 0.95 points ahead of all-MiniLM-L6-v2 (56.09) using only 59% of its parameters and handling 4× longer input sequences (1024 vs 256 tokens). The sweet spot for quality-first RAG pipelines and agent memory.
Why the name Ogma?
Ogma is named after Ogma (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing Ogham, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.
Use cases
ogma-base is the quality-first Ogma model for semantic search, RAG retrieval, agent memory, vector databases, document retrieval, text classification, clustering, STS / sentence similarity, and retrieval-heavy agent pipelines. It is aimed at users who want better quality than MiniLM-class models while keeping the model small enough for practical CPU deployment.
Good fits:
- Production RAG and enterprise search where retrieval quality matters but the model still needs to be lightweight.
- Local or private embedding services for teams that want to avoid external embedding APIs for sensitive text.
- Agent memory systems with long context chunks, asymmetric query/document encoding, and frequent retrieval.
- Efficient CPU deployments where 13.3M parameters is easier to host than larger embedding transformers.
- Classification, clustering, and routing features where embedding quality directly affects downstream decisions.
Choose ogma-base when you want the strongest general-purpose Ogma model before moving into larger, accuracy-first territory.
Highlights
- 🏆 MTEB avg 57.04 — beats all-MiniLM-L6-v2 (56.09) at 59% of its parameters
- 🏆 +5.38 points over Potion-32M (51.66) using less than half its parameters
- 📏 1024-token context — 4× longer than all-MiniLM-L6-v2 (256 tokens)
- 🔀 Asymmetric encoding via task tokens:
[QRY],[DOC],[SYM] - 📐 Matryoshka dims: [256, 128, 64, 32] — one model, any precision
- 🛡️ +4.0% F1 on prompt injection detection vs MiniLM (same architecture series)
Performance
MTEB English — 54/54 tasks (category-averaged)
Benchmarked with MTEB v2.10.7 on the standard 54-task English benchmark using category averaging (same methodology as the MTEB leaderboard).
| Category | ogma-base | all-MiniLM-L6-v2 | Δ vs MiniLM |
|---|---|---|---|
| Classification | 67.89 | 62.62 | +5.27 |
| Clustering | 41.49 | 41.94 | -0.45 |
| PairClassification | 83.73 | 82.37 | +1.36 |
| Reranking | 51.25 | 58.04 | -6.79 |
| Retrieval | 42.36 | 41.95 | +0.41 |
| STS | 82.84 | 78.90 | +3.94 |
| Summarization | 29.73 | 30.81 | -1.08 |
| Overall | 57.04 | 56.09 | +0.95 |
MiniLM and Potion reference scores from the Model2Vec results page.
Why choose Ogma Base?
ogma-base is the recommended choice when you need the best MTEB scores without going to full transformer scale (bge, e5). It outperforms MiniLM-L6-v2 on classification, clustering, retrieval, and STS while being smaller and context-aware.
CPU Inference Benchmark
Benchmarked on AMD Ryzen Threadripper PRO 3955WX (16-core/32-thread), PyTorch 2.10, batch of 100 mixed-length documents.
| Model | Params | 1T·bs1 (docs/s) | 1T·bs1 latency | 1T·bs32 (docs/s) | 16T·bs32 (docs/s) |
|---|---|---|---|---|---|
| potion-base-8M | 7.6M | 6,892 | 0.14 ms | 18,021 | 17,040 |
| potion-base-32M | 32.3M | 6,826 | 0.15 ms | 17,984 | 17,328 |
| ogma-small | 8.6M | 92.9 | 10.8 ms | 60.9 | 255.6 |
| all-MiniLM-L6-v2 | 22.7M | 53.1 | 18.8 ms | 40.5 | 227.9 |
| ogma-base | 13.3M | 48.3 | 20.7 ms | 28.9 | 121.6 |
| bge-small-en-v1.5 | 33.4M | 26.8 | 37.3 ms | 19.8 | 115.3 |
| ogma-base | 13.3M | 48.3 | 20.7 ms | 28.9 | 121.6 |
| bge-base-en-v1.5 | 109.5M | 7.6 | 131.7 ms | 4.8 | 30.2 |
Potion models are static (lookup-based); their near-zero inference cost is the trade-off for no contextual understanding and fixed 256-token context. Transformer models like Ogma and MiniLM understand context. ogma-small is 1.75× faster than MiniLM single-threaded and 1.12× faster batched.
Safety — Toxicity & Prompt Injection Detection
Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against all-MiniLM-L6-v2 as baseline.
1. Jigsaw Toxic Comment Classification
Dataset: Arsive/toxicity_classification_jigsaw — Binary toxicity classification
Train: 25,960 · Test: 6,490
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| Ogma | LogReg | 89.12% | 88.26% | 89.09% | 87.44% | 95.74% |
| Ogma | MLP | 88.91% | 87.98% | 89.14% | 86.85% | 95.92% |
| MiniLM | LogReg | 87.32% | 86.25% | 87.46% | 85.07% | 94.96% |
| MiniLM | MLP | 91.71% | 91.24% | 90.13% | 92.39% | 97.16% |
Ogma (LR) leads MiniLM (LR) by +2.01% F1. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.
2. Prompt Injection Detection — deepset/prompt-injections
Dataset: deepset/prompt-injections — Binary injection detection
Train: 546 · Test: 116 (low-data regime)
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| Ogma | LogReg | 86.21% | 84.62% | 100.0% | 73.33% | 97.77% |
| Ogma | MLP | 90.52% | 90.27% | 96.23% | 85.0% | 98.1% |
| MiniLM | LogReg | 82.76% | 80.39% | 97.62% | 68.33% | 94.52% |
| MiniLM | MLP | 87.07% | 86.24% | 95.92% | 78.33% | 93.96% |
Ogma leads across both classifiers: +4.03% F1 (MLP), +4.23% F1 (LogReg). Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.
3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset
Dataset: neuralchemy/Prompt-injection-dataset — Binary injection detection
Train: 4,391 · Test: 942
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| Ogma | LogReg | 95.22% | 95.93% | 95.84% | 96.01% | 99.30% |
| Ogma | MLP | 95.44% | 96.16% | 94.89% | 97.46% | 99.37% |
| MiniLM | LogReg | 94.59% | 95.38% | 95.46% | 95.29% | 98.92% |
| MiniLM | MLP | 93.95% | 94.85% | 94.59% | 95.11% | 98.92% |
Ogma leads across all metrics: +0.78% F1 (MLP), +0.55% F1 (LR). Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).
Summary
| Task | Ogma best F1 | MiniLM best F1 | Δ |
|---|---|---|---|
| Jigsaw Toxicity | 88.26% (LR) | 91.24% (MLP) | −2.98% |
| deepset Injection | 90.27% (MLP) | 86.24% (MLP) | +4.03% |
| neuralchemy Injection | 96.16% (MLP) | 95.38% (LR) | +0.78% |
Ogma is a stronger feature extractor for prompt injection detection — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.
Architecture
| Property | Value |
|---|---|
| Architecture | Custom Transformer |
Internal dim (d_model) |
256 |
Output dim (d_output) |
256 |
| Transformer layers | 12 |
| Attention heads | 4 |
| Vocabulary | 30,000 (SentencePiece / AlbertTokenizer) |
| Max sequence length | 1,024 tokens |
| Pooling | Mean pooling |
| Task tokens | [QRY] (query), [DOC] (document), [SYM] (symmetric) |
| Matryoshka dims | [32, 64, 128, 256] |
| Output normalisation | L2 (unit sphere) |
| Parameters | 13.3M |
| Model file | model.safetensors (51 MB) |
Key design choices:
- Task token prepend: A learnable task token (
[QRY],[DOC], or[SYM]) is prepended to the input sequence before the transformer. This enables true asymmetric encoding in a single model with a single forward pass. - Matryoshka training: The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
- Mean pooling: The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
- L2 normalisation: All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.
Usage
Installation
pip install torch tokenizers huggingface_hub pyyaml
Basic Encoding
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
import sys, torch
# 1. Download model files
model_path = snapshot_download("axiotic/ogma-base")
# 2. Load model (bundled source code)
sys.path.insert(0, model_path)
from ogma_model import OgmaModel
model = OgmaModel.from_checkpoint(model_path, device="cpu")
model.eval()
# 3. Tokenizer
N_SPECIAL = 7
_tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")
def encode(texts: list, max_length: int = 1024):
all_ids = []
for text in texts:
enc = _tok.encode(text)
ids, toks = enc.ids, enc.tokens
# Strip CLS/SEP added by tokenizer
if toks and toks[0] in ("[CLS]", "<s>"):
ids, toks = ids[1:], toks[1:]
if toks and toks[-1] in ("[SEP]", "</s>"):
ids = ids[:-1]
# Shift into Ogma's vocabulary space and add BOS/EOS
ogma_ids = [2] + [rid + N_SPECIAL for rid in ids] + [3]
all_ids.append(ogma_ids[:max_length])
ml = max(len(ids) for ids in all_ids)
token_ids = torch.zeros(len(texts), ml, dtype=torch.long)
attn_mask = torch.zeros(len(texts), ml, dtype=torch.long)
for i, ids in enumerate(all_ids):
token_ids[i, :len(ids)] = torch.tensor(ids)
attn_mask[i, :len(ids)] = 1
return token_ids, attn_mask
# 4. Encode (symmetric mode — good for clustering, classification, STS)
from config import TaskToken
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn vulpine leaps over an idle canine",
]
with torch.no_grad():
token_ids, attn_mask = encode(sentences)
embeddings = model.encode(token_ids, attn_mask, task=TaskToken.SYM)
print(embeddings.shape) # (256,)
sim = (embeddings[0] @ embeddings[1]).item()
print(f"Cosine similarity: {sim:.4f}") # L2-normalised, dot product = cosine
Asymmetric Retrieval (Query / Document)
Use TaskToken.QRY for query embeddings and TaskToken.DOC for document embeddings in retrieval pipelines. This asymmetric encoding is a first-class feature of the Ogma architecture.
# Asymmetric retrieval — encode queries with QRY, passages with DOC
from config import TaskToken
queries = [
"What is knowledge distillation?",
"How does retrieval-augmented generation work?",
]
documents = [
"Knowledge distillation trains a smaller student model to mimic a larger teacher...",
"Retrieval-Augmented Generation (RAG) combines a dense retriever with a language model...",
]
with torch.no_grad():
q_ids, q_mask = encode(queries)
d_ids, d_mask = encode(documents)
q_emb = model.encode(q_ids, q_mask, task=TaskToken.QRY) # (N, 256)
d_emb = model.encode(d_ids, d_mask, task=TaskToken.DOC) # (M, 256)
# Dot product == cosine similarity (embeddings are L2-normalised)
scores = q_emb @ d_emb.T # (N, M)
print(scores)
Matryoshka — Flexible Dimensionality
Ogma supports Matryoshka Representation Learning. Truncate and re-normalise to any supported sub-dimension for faster indexing or lower memory usage — no retraining required.
import torch.nn.functional as F
with torch.no_grad():
token_ids, attn_mask = encode(sentences)
emb_full = model.encode(token_ids, attn_mask) # (256d, full precision)
# Truncate to any supported sub-dimension and re-normalise — no retraining needed
# Supported dims: [32, 64, 128, 256]
emb_32 = torch.nn.functional.normalize(emb_full[:, :32], dim=-1)
emb_64 = torch.nn.functional.normalize(emb_full[:, :64], dim=-1)
emb_128 = torch.nn.functional.normalize(emb_full[:, :128], dim=-1)
LangChain Integration
# LangChain integration (custom embeddings class)
from langchain.embeddings.base import Embeddings
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from config import TaskToken
import sys, torch
class OgmaEmbeddings(Embeddings):
def __init__(self, model_name: str = "axiotic/ogma-base", device: str = "cpu"):
model_path = snapshot_download(model_name)
sys.path.insert(0, model_path)
from ogma_model import OgmaModel
self.model = OgmaModel.from_checkpoint(model_path, device=device)
self.model.eval()
self._tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")
self._device = device
def _encode(self, texts, task=TaskToken.SYM):
# (encode function from Basic Usage above)
from your_module import encode # or inline the encode function
with torch.no_grad():
ids, mask = encode(texts)
return self.model.encode(ids.to(self._device), mask.to(self._device), task=task)
def embed_documents(self, texts):
return self._encode(texts, task=TaskToken.DOC).cpu().numpy().tolist()
def embed_query(self, text):
return self._encode([text], task=TaskToken.QRY).cpu().numpy()[0].tolist()
embeddings = OgmaEmbeddings()
Model Family
| Model | Params | Size | MTEB Avg | Class | Clust | PairClass | Rerank | Ret | STS | Summ | d_out | Context |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ogma-large | 32.4M | 124 MB | 57.41 | 68.6 | 41.6 | 84.0 | 53.1 | 43.7 | 83.7 | 30.9 | 256 | 1024 |
| ogma-base | 13.3M | 51 MB | 57.04 | 67.89 | 41.49 | 83.73 | 51.25 | 42.36 | 82.84 | 29.73 | 256 | 1024 |
| ogma-small | 8.6M | 33 MB | 56.34 | 66.67 | 40.69 | 82.91 | 50.51 | 42.05 | 82.00 | 29.59 | 256 | 1024 |
| ogma-mini | 3.5M | 14 MB | 53.07 | 61.80 | 37.38 | 79.66 | 47.39 | 36.21 | 77.71 | 31.33 | 256 | 1024 |
| ogma-micro | 2.3M | 8.9 MB | 52.19 | 59.57 | 36.88 | 78.62 | 49.74 | 33.09 | 75.63 | 31.77 | 128 | 1024 |
| all-MiniLM-L6-v2 | 22.7M | 87 MB | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 384 | 256 |
| potion-base-32M | 32.3M | 123 MB | 51.66 | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 256 | inf |
| potion-base-8M | 7.6M | 29 MB | 50.03 | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 256 | inf |
All Ogma: MTEB 2.10.7, 54-task standard English set, category-averaged. MiniLM/Potion: published scores from Model2Vec results page.
Training Details
| Property | Value |
|---|---|
| Teacher model | jinaai/jina-embeddings-v5-text-small (CC-BY-NC-4.0) |
| Training paradigm | Knowledge distillation from cached teacher embeddings |
| Training data | ~7M curated English sentence pairs |
| Tokenizer | AlbertTokenizer (SentencePiece, vocab=30,000) |
| Embedding initialisation | PCA of teacher embeddings (128d) projected to d_model |
| Loss | Distillation + contrastive (balanced schedule) |
| Evaluation framework | MTEB 2.10.7 |
Limitations
- No text generation. Ogma is an encoder-only embedding model.
- English only. Training data and evaluation are English-only.
- Slower than static models. Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
- Non-commercial licence. Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
- Reranking gap. Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -6.8). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.
Licence & Attribution
This model is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Required attribution (must be included in all uses):
This model was trained via knowledge distillation from
jina-embeddings-v5-text-small(https://huggingface.co/jinaai/jina-embeddings-v5-text-small) by Jina AI, licensed under CC-BY-NC-4.0.
Citation
@misc{ogma2026,
title = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
author = {Axiotic AI},
year = {2026},
url = {https://huggingface.co/axiotic/ogma-base},
}
- Downloads last month
- -
Evaluation results
- cosine_spearman on MTEB STSBenchmarktest set self-reported82.840
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported67.890
- v_measure on MTEB RedditClusteringtest set self-reported41.490
- cos_sim_ap on MTEB TwitterSemEval2015test set self-reported83.730
- map on MTEB MindSmallRerankingvalidation set self-reported51.250
- ndcg_at_10 on MTEB MSMARCOself-reported42.360
- cos_sim_spearman on MTEB SummEvaltest set self-reported29.730