|
--- |
|
language: de |
|
license: apache-2.0 |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- embeddings |
|
- german |
|
- text-embedding |
|
model-index: |
|
- name: smollm3-3b-embed-de |
|
results: [] |
|
--- |
|
|
|
# SmolLM3-3B German Embeddings |
|
|
|
Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder. |
|
|
|
## Model Description |
|
|
|
This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations. |
|
|
|
### Key Features |
|
- **Architecture**: SmolLM3-3B with bidirectional attention |
|
- **Embedding Dimension**: 2048 |
|
- **Max Sequence Length**: 512 tokens |
|
- **Language**: German (primary), may have some cross-lingual capabilities |
|
- **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning) |
|
|
|
## Training Process |
|
|
|
### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction) |
|
|
|
1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by: |
|
- Removing causal attention masks |
|
- Enabling position-agnostic attention computation |
|
- Preserving the original model weights |
|
|
|
2. **MNTP Training**: |
|
- **Dataset**: 50,000 samples from German Wikipedia |
|
- **Task**: Predicting masked tokens using bidirectional context |
|
- **Training Steps**: 1,000 |
|
- **Batch Size**: 512 (64 per device × 8 gradient accumulation) |
|
- **LoRA Configuration**: rank=16, alpha=32 |
|
- **Learning Rate**: 1e-4 with warmup |
|
|
|
### Stage 2: Supervised Contrastive Learning |
|
|
|
3. **Supervised Fine-tuning**: |
|
- **Dataset**: German text pairs with positive/negative examples |
|
- **Training Format**: Contrastive learning using (query, positive, negative) triplets |
|
- **Training Steps**: 500 steps |
|
- **Batch Size**: 32 (16 per device × 2 gradient accumulation) |
|
- **Learning Rate**: 2e-4 with warmup |
|
- **Loss**: Contrastive loss to maximize similarity between semantically related texts |
|
|
|
### Training Infrastructure |
|
- **Hardware**: NVIDIA RTX A6000 (48GB VRAM) |
|
- **Precision**: bfloat16 |
|
- **Framework**: Transformers + PEFT + LLM2Vec |
|
|
|
## Usage |
|
|
|
### Using with LLM2Vec Library |
|
|
|
```python |
|
from llm2vec import LLM2Vec |
|
import torch |
|
|
|
# Load model |
|
model = LLM2Vec.from_pretrained( |
|
"mayflowergmbh/smollm3-3b-embed-de", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16, |
|
) |
|
|
|
# Encode German texts |
|
texts = [ |
|
"Berlin ist die Hauptstadt von Deutschland.", |
|
"Die deutsche Hauptstadt ist Berlin.", |
|
"München ist eine Stadt in Bayern." |
|
] |
|
|
|
embeddings = model.encode(texts) |
|
|
|
# Calculate similarity |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
similarity_matrix = cosine_similarity(embeddings) |
|
``` |
|
|
|
### Using with Sentence Transformers |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Note: Requires adapter for sentence-transformers compatibility |
|
model = SentenceTransformer('path/to/smollm3-3b-embed-de') |
|
embeddings = model.encode(texts) |
|
``` |
|
|
|
## Intended Uses |
|
|
|
### Primary Use Cases |
|
- **Semantic Search**: Find relevant documents in German text corpora |
|
- **Text Classification**: Use embeddings as features for downstream classifiers |
|
- **Clustering**: Group similar German texts together |
|
- **Duplicate Detection**: Identify semantically similar content |
|
- **Question Answering**: Match questions with relevant answers |
|
|
|
### Example: Semantic Search |
|
|
|
```python |
|
# Create document embeddings |
|
documents = [ |
|
"Die Katze sitzt auf dem Sofa.", |
|
"Der Hund spielt im Garten.", |
|
"Python ist eine Programmiersprache.", |
|
"Machine Learning revolutioniert die Technologie." |
|
] |
|
doc_embeddings = model.encode(documents) |
|
|
|
# Search with a query |
|
query = "Haustiere und ihre Aktivitäten" |
|
query_embedding = model.encode([query]) |
|
|
|
# Find most similar documents |
|
similarities = cosine_similarity(query_embedding, doc_embeddings)[0] |
|
top_indices = similarities.argsort()[-3:][::-1] |
|
|
|
for idx in top_indices: |
|
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}") |
|
``` |
|
|
|
## Performance Characteristics |
|
|
|
### Strengths |
|
- Excellent German language understanding |
|
- Strong performance on semantic similarity tasks |
|
- Efficient inference despite larger model size |
|
- Benefits from SmolLM3's strong foundation |
|
|
|
### Limitations |
|
- Larger than typical embedding models (3B parameters) |
|
- Requires GPU for optimal performance |
|
- Limited to 512 token sequences |
|
- Primarily optimized for German (cross-lingual performance not evaluated) |
|
|
|
## Model Architecture Details |
|
|
|
``` |
|
Base Model: SmolLM3-3B |
|
- Hidden Size: 2048 |
|
- Intermediate Size: 11008 |
|
- Number of Layers: 36 |
|
- Number of Attention Heads: 16 |
|
- Vocabulary Size: 128256 |
|
- Position Embeddings: 65536 (RoPE) |
|
``` |
|
|
|
## Training Hyperparameters |
|
|
|
**MNTP Stage:** |
|
- Learning Rate: 1e-4 |
|
- Batch Size: 512 |
|
- Max Sequence Length: 512 |
|
- Gradient Accumulation: 8 |
|
- LoRA r: 16 |
|
- LoRA alpha: 32 |
|
- Warmup Steps: 100 |
|
- Total Steps: 1000 |
|
|
|
**Supervised Stage:** |
|
- Learning Rate: 2e-4 |
|
- Batch Size: 32 |
|
- Max Sequence Length: 256 |
|
- Training Epochs: 3 |
|
- Warmup Steps: 100 |
|
- Weight Decay: 0.01 |
|
|
|
## Ethical Considerations |
|
|
|
- **Bias**: Model may reflect biases present in German Wikipedia and training data |
|
- **Use Cases**: Should not be used for making decisions about individuals |
|
- **Privacy**: Do not use with personally identifiable information |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{smollm3-embed-de, |
|
title={SmolLM3-3B German Embeddings}, |
|
author={Johann-Peter Hartmann}, |
|
year={2025}, |
|
publisher={Mayflower GmbH}, |
|
url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de} |
|
} |
|
|
|
@article{llm2vec, |
|
title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders}, |
|
author={Behnamghader, Parishad and others}, |
|
journal={arXiv preprint arXiv:2404.05961}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) |
|
- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec) |
|
- Training data: German Wikipedia |
|
|
|
## Contact |
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed). |
|
|