Upload German SmolLM3-3B embedding model

cf2641a verified about 2 months ago

6.34 kB

	---
	language: de
	license: apache-2.0
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- embeddings
	- german
	- text-embedding
	model-index:
	- name: smollm3-3b-embed-de
	results: []
	---

	# SmolLM3-3B German Embeddings

	Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.

	## Model Description

	This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.

	### Key Features
	- Architecture: SmolLM3-3B with bidirectional attention
	- Embedding Dimension: 2048
	- Max Sequence Length: 512 tokens
	- Language: German (primary), may have some cross-lingual capabilities
	- Training Method: LLM2Vec (MNTP + Supervised Contrastive Learning)

	## Training Process

	### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)

	1. Model Transformation: Modified SmolLM3-3B architecture to enable bidirectional attention by:
	- Removing causal attention masks
	- Enabling position-agnostic attention computation
	- Preserving the original model weights

	2. MNTP Training:
	- Dataset: 50,000 samples from German Wikipedia
	- Task: Predicting masked tokens using bidirectional context
	- Training Steps: 1,000
	- Batch Size: 512 (64 per device × 8 gradient accumulation)
	- LoRA Configuration: rank=16, alpha=32
	- Learning Rate: 1e-4 with warmup

	### Stage 2: Supervised Contrastive Learning

	3. Supervised Fine-tuning:
	- Dataset: German text pairs with positive/negative examples
	- Training Format: Contrastive learning using (query, positive, negative) triplets
	- Training Steps: 500 steps
	- Batch Size: 32 (16 per device × 2 gradient accumulation)
	- Learning Rate: 2e-4 with warmup
	- Loss: Contrastive loss to maximize similarity between semantically related texts

	### Training Infrastructure
	- Hardware: NVIDIA RTX A6000 (48GB VRAM)
	- Precision: bfloat16
	- Framework: Transformers + PEFT + LLM2Vec

	## Usage

	### Using with LLM2Vec Library

	```python
	from llm2vec import LLM2Vec
	import torch

	# Load model
	model = LLM2Vec.from_pretrained(
	"mayflowergmbh/smollm3-3b-embed-de",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	# Encode German texts
	texts = [
	"Berlin ist die Hauptstadt von Deutschland.",
	"Die deutsche Hauptstadt ist Berlin.",
	"München ist eine Stadt in Bayern."
	]

	embeddings = model.encode(texts)

	# Calculate similarity
	from sklearn.metrics.pairwise import cosine_similarity
	similarity_matrix = cosine_similarity(embeddings)
	```

	### Using with Sentence Transformers

	```python
	from sentence_transformers import SentenceTransformer

	# Note: Requires adapter for sentence-transformers compatibility
	model = SentenceTransformer('path/to/smollm3-3b-embed-de')
	embeddings = model.encode(texts)
	```

	## Intended Uses

	### Primary Use Cases
	- Semantic Search: Find relevant documents in German text corpora
	- Text Classification: Use embeddings as features for downstream classifiers
	- Clustering: Group similar German texts together
	- Duplicate Detection: Identify semantically similar content
	- Question Answering: Match questions with relevant answers

	### Example: Semantic Search

	```python
	# Create document embeddings
	documents = [
	"Die Katze sitzt auf dem Sofa.",
	"Der Hund spielt im Garten.",
	"Python ist eine Programmiersprache.",
	"Machine Learning revolutioniert die Technologie."
	]
	doc_embeddings = model.encode(documents)

	# Search with a query
	query = "Haustiere und ihre Aktivitäten"
	query_embedding = model.encode([query])

	# Find most similar documents
	similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
	top_indices = similarities.argsort()[-3:][::-1]

	for idx in top_indices:
	print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
	```

	## Performance Characteristics

	### Strengths
	- Excellent German language understanding
	- Strong performance on semantic similarity tasks
	- Efficient inference despite larger model size
	- Benefits from SmolLM3's strong foundation

	### Limitations
	- Larger than typical embedding models (3B parameters)
	- Requires GPU for optimal performance
	- Limited to 512 token sequences
	- Primarily optimized for German (cross-lingual performance not evaluated)

	## Model Architecture Details

	```
	Base Model: SmolLM3-3B
	- Hidden Size: 2048
	- Intermediate Size: 11008
	- Number of Layers: 36
	- Number of Attention Heads: 16
	- Vocabulary Size: 128256
	- Position Embeddings: 65536 (RoPE)
	```

	## Training Hyperparameters

	MNTP Stage:
	- Learning Rate: 1e-4
	- Batch Size: 512
	- Max Sequence Length: 512
	- Gradient Accumulation: 8
	- LoRA r: 16
	- LoRA alpha: 32
	- Warmup Steps: 100
	- Total Steps: 1000

	Supervised Stage:
	- Learning Rate: 2e-4
	- Batch Size: 32
	- Max Sequence Length: 256
	- Training Epochs: 3
	- Warmup Steps: 100
	- Weight Decay: 0.01

	## Ethical Considerations

	- Bias: Model may reflect biases present in German Wikipedia and training data
	- Use Cases: Should not be used for making decisions about individuals
	- Privacy: Do not use with personally identifiable information

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{smollm3-embed-de,
	title={SmolLM3-3B German Embeddings},
	author={Johann-Peter Hartmann},
	year={2025},
	publisher={Mayflower GmbH},
	url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
	}

	@article{llm2vec,
	title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
	author={Behnamghader, Parishad and others},
	journal={arXiv preprint arXiv:2404.05961},
	year={2024}
	}
	```

	## Acknowledgments

	- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
	- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec)
	- Training data: German Wikipedia

	## Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).