File size: 6,340 Bytes
cf2641a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
---
language: de
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- german
- text-embedding
model-index:
- name: smollm3-3b-embed-de
results: []
---
# SmolLM3-3B German Embeddings
Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.
## Model Description
This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.
### Key Features
- **Architecture**: SmolLM3-3B with bidirectional attention
- **Embedding Dimension**: 2048
- **Max Sequence Length**: 512 tokens
- **Language**: German (primary), may have some cross-lingual capabilities
- **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning)
## Training Process
### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)
1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by:
- Removing causal attention masks
- Enabling position-agnostic attention computation
- Preserving the original model weights
2. **MNTP Training**:
- **Dataset**: 50,000 samples from German Wikipedia
- **Task**: Predicting masked tokens using bidirectional context
- **Training Steps**: 1,000
- **Batch Size**: 512 (64 per device × 8 gradient accumulation)
- **LoRA Configuration**: rank=16, alpha=32
- **Learning Rate**: 1e-4 with warmup
### Stage 2: Supervised Contrastive Learning
3. **Supervised Fine-tuning**:
- **Dataset**: German text pairs with positive/negative examples
- **Training Format**: Contrastive learning using (query, positive, negative) triplets
- **Training Steps**: 500 steps
- **Batch Size**: 32 (16 per device × 2 gradient accumulation)
- **Learning Rate**: 2e-4 with warmup
- **Loss**: Contrastive loss to maximize similarity between semantically related texts
### Training Infrastructure
- **Hardware**: NVIDIA RTX A6000 (48GB VRAM)
- **Precision**: bfloat16
- **Framework**: Transformers + PEFT + LLM2Vec
## Usage
### Using with LLM2Vec Library
```python
from llm2vec import LLM2Vec
import torch
# Load model
model = LLM2Vec.from_pretrained(
"mayflowergmbh/smollm3-3b-embed-de",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Encode German texts
texts = [
"Berlin ist die Hauptstadt von Deutschland.",
"Die deutsche Hauptstadt ist Berlin.",
"München ist eine Stadt in Bayern."
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
```
### Using with Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
```
## Intended Uses
### Primary Use Cases
- **Semantic Search**: Find relevant documents in German text corpora
- **Text Classification**: Use embeddings as features for downstream classifiers
- **Clustering**: Group similar German texts together
- **Duplicate Detection**: Identify semantically similar content
- **Question Answering**: Match questions with relevant answers
### Example: Semantic Search
```python
# Create document embeddings
documents = [
"Die Katze sitzt auf dem Sofa.",
"Der Hund spielt im Garten.",
"Python ist eine Programmiersprache.",
"Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)
# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])
# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]
for idx in top_indices:
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
```
## Performance Characteristics
### Strengths
- Excellent German language understanding
- Strong performance on semantic similarity tasks
- Efficient inference despite larger model size
- Benefits from SmolLM3's strong foundation
### Limitations
- Larger than typical embedding models (3B parameters)
- Requires GPU for optimal performance
- Limited to 512 token sequences
- Primarily optimized for German (cross-lingual performance not evaluated)
## Model Architecture Details
```
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
```
## Training Hyperparameters
**MNTP Stage:**
- Learning Rate: 1e-4
- Batch Size: 512
- Max Sequence Length: 512
- Gradient Accumulation: 8
- LoRA r: 16
- LoRA alpha: 32
- Warmup Steps: 100
- Total Steps: 1000
**Supervised Stage:**
- Learning Rate: 2e-4
- Batch Size: 32
- Max Sequence Length: 256
- Training Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01
## Ethical Considerations
- **Bias**: Model may reflect biases present in German Wikipedia and training data
- **Use Cases**: Should not be used for making decisions about individuals
- **Privacy**: Do not use with personally identifiable information
## Citation
If you use this model, please cite:
```bibtex
@misc{smollm3-embed-de,
title={SmolLM3-3B German Embeddings},
author={Johann-Peter Hartmann},
year={2025},
publisher={Mayflower GmbH},
url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}
@article{llm2vec,
title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
author={Behnamghader, Parishad and others},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
```
## Acknowledgments
- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec)
- Training data: German Wikipedia
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).
|