--- language: de license: apache-2.0 tags: - sentence-transformers - feature-extraction - sentence-similarity - embeddings - german - text-embedding model-index: - name: smollm3-3b-embed-de results: [] --- # SmolLM3-3B German Embeddings Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder. ## Model Description This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations. ### Key Features - **Architecture**: SmolLM3-3B with bidirectional attention - **Embedding Dimension**: 2048 - **Max Sequence Length**: 512 tokens - **Language**: German (primary), may have some cross-lingual capabilities - **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning) ## Training Process ### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction) 1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by: - Removing causal attention masks - Enabling position-agnostic attention computation - Preserving the original model weights 2. **MNTP Training**: - **Dataset**: 50,000 samples from German Wikipedia - **Task**: Predicting masked tokens using bidirectional context - **Training Steps**: 1,000 - **Batch Size**: 512 (64 per device × 8 gradient accumulation) - **LoRA Configuration**: rank=16, alpha=32 - **Learning Rate**: 1e-4 with warmup ### Stage 2: Supervised Contrastive Learning 3. **Supervised Fine-tuning**: - **Dataset**: German text pairs with positive/negative examples - **Training Format**: Contrastive learning using (query, positive, negative) triplets - **Training Steps**: 500 steps - **Batch Size**: 32 (16 per device × 2 gradient accumulation) - **Learning Rate**: 2e-4 with warmup - **Loss**: Contrastive loss to maximize similarity between semantically related texts ### Training Infrastructure - **Hardware**: NVIDIA RTX A6000 (48GB VRAM) - **Precision**: bfloat16 - **Framework**: Transformers + PEFT + LLM2Vec ## Usage ### Using with LLM2Vec Library ```python from llm2vec import LLM2Vec import torch # Load model model = LLM2Vec.from_pretrained( "mayflowergmbh/smollm3-3b-embed-de", device_map="auto", torch_dtype=torch.bfloat16, ) # Encode German texts texts = [ "Berlin ist die Hauptstadt von Deutschland.", "Die deutsche Hauptstadt ist Berlin.", "München ist eine Stadt in Bayern." ] embeddings = model.encode(texts) # Calculate similarity from sklearn.metrics.pairwise import cosine_similarity similarity_matrix = cosine_similarity(embeddings) ``` ### Using with Sentence Transformers ```python from sentence_transformers import SentenceTransformer # Note: Requires adapter for sentence-transformers compatibility model = SentenceTransformer('path/to/smollm3-3b-embed-de') embeddings = model.encode(texts) ``` ## Intended Uses ### Primary Use Cases - **Semantic Search**: Find relevant documents in German text corpora - **Text Classification**: Use embeddings as features for downstream classifiers - **Clustering**: Group similar German texts together - **Duplicate Detection**: Identify semantically similar content - **Question Answering**: Match questions with relevant answers ### Example: Semantic Search ```python # Create document embeddings documents = [ "Die Katze sitzt auf dem Sofa.", "Der Hund spielt im Garten.", "Python ist eine Programmiersprache.", "Machine Learning revolutioniert die Technologie." ] doc_embeddings = model.encode(documents) # Search with a query query = "Haustiere und ihre Aktivitäten" query_embedding = model.encode([query]) # Find most similar documents similarities = cosine_similarity(query_embedding, doc_embeddings)[0] top_indices = similarities.argsort()[-3:][::-1] for idx in top_indices: print(f"Score: {similarities[idx]:.3f} - {documents[idx]}") ``` ## Performance Characteristics ### Strengths - Excellent German language understanding - Strong performance on semantic similarity tasks - Efficient inference despite larger model size - Benefits from SmolLM3's strong foundation ### Limitations - Larger than typical embedding models (3B parameters) - Requires GPU for optimal performance - Limited to 512 token sequences - Primarily optimized for German (cross-lingual performance not evaluated) ## Model Architecture Details ``` Base Model: SmolLM3-3B - Hidden Size: 2048 - Intermediate Size: 11008 - Number of Layers: 36 - Number of Attention Heads: 16 - Vocabulary Size: 128256 - Position Embeddings: 65536 (RoPE) ``` ## Training Hyperparameters **MNTP Stage:** - Learning Rate: 1e-4 - Batch Size: 512 - Max Sequence Length: 512 - Gradient Accumulation: 8 - LoRA r: 16 - LoRA alpha: 32 - Warmup Steps: 100 - Total Steps: 1000 **Supervised Stage:** - Learning Rate: 2e-4 - Batch Size: 32 - Max Sequence Length: 256 - Training Epochs: 3 - Warmup Steps: 100 - Weight Decay: 0.01 ## Ethical Considerations - **Bias**: Model may reflect biases present in German Wikipedia and training data - **Use Cases**: Should not be used for making decisions about individuals - **Privacy**: Do not use with personally identifiable information ## Citation If you use this model, please cite: ```bibtex @misc{smollm3-embed-de, title={SmolLM3-3B German Embeddings}, author={Johann-Peter Hartmann}, year={2025}, publisher={Mayflower GmbH}, url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de} } @article{llm2vec, title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders}, author={Behnamghader, Parishad and others}, journal={arXiv preprint arXiv:2404.05961}, year={2024} } ``` ## Acknowledgments - Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Training data: German Wikipedia ## Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).