File size: 6,340 Bytes

cf2641a

---
language: de
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- german
- text-embedding
model-index:
- name: smollm3-3b-embed-de
  results: []
---

# SmolLM3-3B German Embeddings

Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.

## Model Description

This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.

### Key Features
- **Architecture**: SmolLM3-3B with bidirectional attention
- **Embedding Dimension**: 2048
- **Max Sequence Length**: 512 tokens
- **Language**: German (primary), may have some cross-lingual capabilities
- **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning)

## Training Process

### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)

1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by:
   - Removing causal attention masks
   - Enabling position-agnostic attention computation
   - Preserving the original model weights

2. **MNTP Training**:
   - **Dataset**: 50,000 samples from German Wikipedia
   - **Task**: Predicting masked tokens using bidirectional context
   - **Training Steps**: 1,000
   - **Batch Size**: 512 (64 per device × 8 gradient accumulation)
   - **LoRA Configuration**: rank=16, alpha=32
   - **Learning Rate**: 1e-4 with warmup

### Stage 2: Supervised Contrastive Learning

3. **Supervised Fine-tuning**:
   - **Dataset**: German text pairs with positive/negative examples
   - **Training Format**: Contrastive learning using (query, positive, negative) triplets
   - **Training Steps**: 500 steps
   - **Batch Size**: 32 (16 per device × 2 gradient accumulation)
   - **Learning Rate**: 2e-4 with warmup
   - **Loss**: Contrastive loss to maximize similarity between semantically related texts

### Training Infrastructure
- **Hardware**: NVIDIA RTX A6000 (48GB VRAM)
- **Precision**: bfloat16
- **Framework**: Transformers + PEFT + LLM2Vec

## Usage

### Using with LLM2Vec Library

```python
from llm2vec import LLM2Vec
import torch

# Load model
model = LLM2Vec.from_pretrained(
    "mayflowergmbh/smollm3-3b-embed-de",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Encode German texts
texts = [
    "Berlin ist die Hauptstadt von Deutschland.",
    "Die deutsche Hauptstadt ist Berlin.",
    "München ist eine Stadt in Bayern."
]

embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
```

### Using with Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
```

## Intended Uses

### Primary Use Cases
- **Semantic Search**: Find relevant documents in German text corpora
- **Text Classification**: Use embeddings as features for downstream classifiers
- **Clustering**: Group similar German texts together
- **Duplicate Detection**: Identify semantically similar content
- **Question Answering**: Match questions with relevant answers

### Example: Semantic Search

```python
# Create document embeddings
documents = [
    "Die Katze sitzt auf dem Sofa.",
    "Der Hund spielt im Garten.",
    "Python ist eine Programmiersprache.",
    "Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)

# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])

# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]

for idx in top_indices:
    print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
```

## Performance Characteristics

### Strengths
- Excellent German language understanding
- Strong performance on semantic similarity tasks
- Efficient inference despite larger model size
- Benefits from SmolLM3's strong foundation

### Limitations
- Larger than typical embedding models (3B parameters)
- Requires GPU for optimal performance
- Limited to 512 token sequences
- Primarily optimized for German (cross-lingual performance not evaluated)

## Model Architecture Details

```
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
```

## Training Hyperparameters

**MNTP Stage:**
- Learning Rate: 1e-4
- Batch Size: 512
- Max Sequence Length: 512
- Gradient Accumulation: 8
- LoRA r: 16
- LoRA alpha: 32
- Warmup Steps: 100
- Total Steps: 1000

**Supervised Stage:**
- Learning Rate: 2e-4
- Batch Size: 32
- Max Sequence Length: 256
- Training Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01

## Ethical Considerations

- **Bias**: Model may reflect biases present in German Wikipedia and training data
- **Use Cases**: Should not be used for making decisions about individuals
- **Privacy**: Do not use with personally identifiable information

## Citation

If you use this model, please cite:

```bibtex
@misc{smollm3-embed-de,
  title={SmolLM3-3B German Embeddings},
  author={Johann-Peter Hartmann},
  year={2025},
  publisher={Mayflower GmbH},
  url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}

@article{llm2vec,
  title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
  author={Behnamghader, Parishad and others},
  journal={arXiv preprint arXiv:2404.05961},
  year={2024}
}
```

## Acknowledgments

- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec)
- Training data: German Wikipedia

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).