File size: 6,340 Bytes
cf2641a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
language: de
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- german
- text-embedding
model-index:
- name: smollm3-3b-embed-de
  results: []
---

# SmolLM3-3B German Embeddings

Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.

## Model Description

This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.

### Key Features
- **Architecture**: SmolLM3-3B with bidirectional attention
- **Embedding Dimension**: 2048
- **Max Sequence Length**: 512 tokens
- **Language**: German (primary), may have some cross-lingual capabilities
- **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning)

## Training Process

### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)

1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by:
   - Removing causal attention masks
   - Enabling position-agnostic attention computation
   - Preserving the original model weights

2. **MNTP Training**:
   - **Dataset**: 50,000 samples from German Wikipedia
   - **Task**: Predicting masked tokens using bidirectional context
   - **Training Steps**: 1,000
   - **Batch Size**: 512 (64 per device × 8 gradient accumulation)
   - **LoRA Configuration**: rank=16, alpha=32
   - **Learning Rate**: 1e-4 with warmup

### Stage 2: Supervised Contrastive Learning

3. **Supervised Fine-tuning**:
   - **Dataset**: German text pairs with positive/negative examples
   - **Training Format**: Contrastive learning using (query, positive, negative) triplets
   - **Training Steps**: 500 steps
   - **Batch Size**: 32 (16 per device × 2 gradient accumulation)
   - **Learning Rate**: 2e-4 with warmup
   - **Loss**: Contrastive loss to maximize similarity between semantically related texts

### Training Infrastructure
- **Hardware**: NVIDIA RTX A6000 (48GB VRAM)
- **Precision**: bfloat16
- **Framework**: Transformers + PEFT + LLM2Vec

## Usage

### Using with LLM2Vec Library

```python
from llm2vec import LLM2Vec
import torch

# Load model
model = LLM2Vec.from_pretrained(
    "mayflowergmbh/smollm3-3b-embed-de",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Encode German texts
texts = [
    "Berlin ist die Hauptstadt von Deutschland.",
    "Die deutsche Hauptstadt ist Berlin.",
    "München ist eine Stadt in Bayern."
]

embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
```

### Using with Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
```

## Intended Uses

### Primary Use Cases
- **Semantic Search**: Find relevant documents in German text corpora
- **Text Classification**: Use embeddings as features for downstream classifiers
- **Clustering**: Group similar German texts together
- **Duplicate Detection**: Identify semantically similar content
- **Question Answering**: Match questions with relevant answers

### Example: Semantic Search

```python
# Create document embeddings
documents = [
    "Die Katze sitzt auf dem Sofa.",
    "Der Hund spielt im Garten.",
    "Python ist eine Programmiersprache.",
    "Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)

# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])

# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]

for idx in top_indices:
    print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
```

## Performance Characteristics

### Strengths
- Excellent German language understanding
- Strong performance on semantic similarity tasks
- Efficient inference despite larger model size
- Benefits from SmolLM3's strong foundation

### Limitations
- Larger than typical embedding models (3B parameters)
- Requires GPU for optimal performance
- Limited to 512 token sequences
- Primarily optimized for German (cross-lingual performance not evaluated)

## Model Architecture Details

```
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
```

## Training Hyperparameters

**MNTP Stage:**
- Learning Rate: 1e-4
- Batch Size: 512
- Max Sequence Length: 512
- Gradient Accumulation: 8
- LoRA r: 16
- LoRA alpha: 32
- Warmup Steps: 100
- Total Steps: 1000

**Supervised Stage:**
- Learning Rate: 2e-4
- Batch Size: 32
- Max Sequence Length: 256
- Training Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01

## Ethical Considerations

- **Bias**: Model may reflect biases present in German Wikipedia and training data
- **Use Cases**: Should not be used for making decisions about individuals
- **Privacy**: Do not use with personally identifiable information

## Citation

If you use this model, please cite:

```bibtex
@misc{smollm3-embed-de,
  title={SmolLM3-3B German Embeddings},
  author={Johann-Peter Hartmann},
  year={2025},
  publisher={Mayflower GmbH},
  url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}

@article{llm2vec,
  title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
  author={Behnamghader, Parishad and others},
  journal={arXiv preprint arXiv:2404.05961},
  year={2024}
}
```

## Acknowledgments

- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec)
- Training data: German Wikipedia

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).