SauerkrautLM-Multi-ColBERT-33m

This model is a compact Late Interaction retriever that leverages:

Pretraining with over 8.2 billion tokens in a two-phase approach (4.6B multilingual + 3.6B English tokens). Knowledge Distillation from state-of-the-art reranker models during pretraining. Efficient architecture with 33M parameters – optimized for edge deployment while maintaining high performance.

🎯 Core Features and Innovations:

Two-Phase Pretraining Strategy:
- Phase 1: 4,641,714,000 tokens of multilingual data covering 7 European languages
- Phase 2: 3,620,166,317 tokens of high-quality English data for enhanced performance
- Total: Over 8.2 billion tokens of pretrained knowledge
Advanced Knowledge Distillation: Learning from powerful reranker models throughout the pretraining process
Balanced Efficiency: With 33M parameters, achieving the sweet spot between performance and deployability

💪 The Foundation Model: Compact yet Powerful

With 33 million parameters – that's less than 1/200th the size of some competing models – SauerkrautLM-Multi-ColBERT-33m represents efficient pretraining at scale:

200× smaller than 7B+ parameter models
4× smaller than typical BERT models (110M)
2× larger than the ultra-compact 15M variant
Trained on 8.2 billion tokens - that's 248 tokens per parameter!

This balanced architecture combined with pretraining creates a powerful foundation for downstream applications, offering superior performance compared to the 15M variant while remaining highly efficient.

Model Overview

Model: VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m
Type: Pretrained foundation model for Late Interaction retrieval
Architecture: PyLate / ColBERT (Late Interaction)
Languages: Multilingual (optimized for 7 European languages: German, English, Spanish, French, Italian, Dutch, Portuguese)
License: Apache 2.0
Model Size: 33M parameters Training Data: 8.2B tokens (4.6B multilingual + 3.6B English)

Model Description

Model Type: PyLate model with innovative Late Interaction architecture
Document Length: 8192 tokens (32× longer than traditional BERT models)
Query Length: 256 tokens (optimized for complex, multi-part queries)
Output Dimensionality: 128 tokens (efficient vector representation)
Similarity Function: MaxSim (enables precise token-level matching)
Training Method: Two-phase knowledge distillation from reranker models

Architecture

ColBERT(
  (0): Transformer(CompressedModernBertModel)
  (1): Dense(384 -> 128 dim, no bias)
)

🔬 Technical Innovations in Detail

Two-Phase Pretraining: Building Multilingual then English Excellence

Our 33M parameter model undergoes sophisticated two-phase pretraining:

Phase 1: Multilingual Foundation (4.6B tokens)

Data Volume: 4,641,714,000 tokens across 7 European languages
Languages: Balanced representation of German, English, Spanish, French, Italian, Dutch, and Portuguese
Objective: Build robust multilingual understanding and cross-lingual capabilities

Phase 2: English Enhancement (3.6B tokens)

Data Volume: 3,620,166,317 high-quality English tokens
Focus: Enhance English performance while maintaining multilingual capabilities
Result: State-of-the-art English retrieval without sacrificing other languages

Knowledge Distillation Throughout Pretraining

Unlike typical pretraining, we leverage continuous knowledge distillation:

Teacher Models: State-of-the-art reranker models guide the learning process
Distillation Objective: Learn optimal ranking patterns from the ground up
Efficiency Gain: Achieves superior performance with 200× fewer parameters

Compact Yet Capable Design

SauerkrautLM-Multi-ColBERT-33m achieves optimal balance through:

Compact Architecture (~33 M params)
Balanced BERT design — 12 layers, hidden_size = 384
Multi-head attention — 24 attention heads (16-dim each) for nuanced understanding
Production-ready — deployable on standard infrastructure
Intermediate size — 1152 (3× hidden size) for sufficient expressiveness

This architecture enables Late Interaction Retrieval with significantly better performance than the 15M variant while maintaining excellent efficiency.

🔬 Benchmarks: Foundation Model Performance

SauerkrautLM-Multi-ColBERT-33m delivers strong multilingual retrieval performance, demonstrating the effectiveness of our two-phase pretraining approach at this parameter scale.

NanoBEIR Europe (multilingual retrieval)

Average nDCG@10 across seven European languages, showing excellent multilingual capabilities from our two-phase pretraining:

Language	nDCG@10	Performance Notes
en	51.74	Enhanced by Phase 2 English pretraining
de	38.46	Strong german language performance
es	43.10	Excellent spanish language capabilities
fr	40.96	Consistent cross-lingual transfer
it	40.44	Balanced multilingual representation
nl	37.51	Effective on lower-resource languages
pt	39.55	Maintains quality across language families

Key Observations:

English Excellence: The two-phase training strategy yields exceptional English performance (51.74) while maintaining strong multilingual capabilities
Significant Improvement over 15M: All languages show substantial gains compared to the 15M variant (5-7 points improvement on average)
Balanced Multilingual: Non-English languages show strong performance (37-43 nDCG@10), demonstrating effective multilingual pretraining
Token Efficiency: With 8.2B training tokens on 33M parameters, the model achieves excellent data efficiency (248 tokens per parameter)

Why SauerkrautLM-Multi-ColBERT-33m Matters as a Foundation Model

Optimal Balance: Perfect sweet spot between the ultra-compact 15M and larger models
Superior Performance: Significant improvements over 15M variant across all languages
Production Ready: Deployable on standard GPUs and cloud infrastructure
High context length: Suitable for big documents up to 8192 tokens
True Multilingual Foundation: Native support for 7 European languages from pretraining
Ideal for Fine-tuning: Strong base model for task-specific adaptations
Cost-Effective: Train specialized models without massive compute requirements

This pretrained model serves as an ideal foundation for:

High-performance retrieval systems
Multilingual search applications
Standard deployment scenarios
Rapid prototyping with better accuracy
Production systems requiring reliability

Real-World Applications

The combination of massive pretraining and balanced efficiency enables:

Production Search Systems: Deploy on standard infrastructure with confidence
Multilingual Products: Single model serving users across 7 languages with high quality
Hybrid Deployments: Run on-premise or in cloud with reasonable resource requirements
Enhanced Accuracy: Better performance for critical applications compared to 15M
Scalable Solutions: Handle larger workloads without exponential resource growth

📈 Summary: The Power of Balanced Pretraining

SauerkrautLM-Multi-ColBERT-33m demonstrates that thoughtful parameter scaling combined with strong pretraining creates optimal foundation models. By training on 8.2 billion tokens across two phases, we've created a model that:

Delivers superior performance compared to ultra-compact variants
Maintains excellent efficiency with just 33M parameters (248 tokens per parameter!)
Achieves strong multilingual results across 7 European languages
Provides exceptional English retrieval (51.74 nDCG@10) through targeted enhancement
Enables practical deployments on standard infrastructure
Offers an ideal foundation for diverse downstream applications

This model represents the optimal balance between performance and efficiency for production-grade multilingual retrieval systems.

PyLate

This is a PyLate model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Citation

BibTeX

SauerkrautLM‑Multi‑ColBERT-33m

@misc{SauerkrautLM-Multi-ColBERT-33m,
  title={SauerkrautLM-Multi-ColBERT-33m},
  author={David Golchinfar},
  url={https://huggingface.co/VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m},
  year={2025}
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
  title = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author = {Reimers, Nils and Gurevych, Iryna},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  month = {11},
  year = {2019},
  publisher = {Association for Computational Linguistics},
  url = {https://arxiv.org/abs/1908.10084}
}

PyLate

@misc{PyLate,
  title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
  author={Chaffin, Antoine and Sourty, Raphaël},
  url={https://github.com/lightonai/pylate},
  year={2024}
}

Acknowledgements

We thank the PyLate team for providing the training framework that made this work possible.

VAGOsolutions
/

SauerkrautLM-Multi-ColBERT-33m

SauerkrautLM-Multi-ColBERT-33m

🎯 Core Features and Innovations:

💪 The Foundation Model: Compact yet Powerful

Model Overview

Model Description

Architecture

🔬 Technical Innovations in Detail

Two-Phase Pretraining: Building Multilingual then English Excellence

Phase 1: Multilingual Foundation (4.6B tokens)

Phase 2: English Enhancement (3.6B tokens)

Knowledge Distillation Throughout Pretraining

Compact Yet Capable Design

🔬 Benchmarks: Foundation Model Performance

NanoBEIR Europe (multilingual retrieval)

Why SauerkrautLM-Multi-ColBERT-33m Matters as a Foundation Model

Real-World Applications

📈 Summary: The Power of Balanced Pretraining

PyLate

Usage

Retrieval

Indexing documents

Retrieving top-k documents for queries

Reranking

Citation

BibTeX

SauerkrautLM‑Multi‑ColBERT-33m

Sentence Transformers

PyLate

Acknowledgements

Model tree for VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m

Collection including VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m

SauerkrautLM-Multilingual-(Reason)-ColBERT