--- language: hi license: mit tags: - hindi - embeddings - sentence-embeddings - semantic-search - text-similarity datasets: - custom pipeline_tag: sentence-similarity library_name: transformers --- # Hindi Sentence Embeddings Model This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences. ## Features - Specialized for Hindi language text - Advanced transformer architecture with optimized attention mechanism - Multiple pooling strategies for enhanced semantic representations - Creates normalized vector representations for semantic similarity - Supports semantic search and text similarity applications ## Usage ### Installation ```bash pip install torch sentencepiece scikit-learn matplotlib git lfs install git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model-10B cd hindi-embedding-foundational-model-10B ``` ### Enhanced RAG System This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval. #### Setup and Installation 1. Install additional dependencies: ```bash pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu ``` 2. Index your documents: ```bash python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index ``` 3. Run in QA mode with LLM: ```bash python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa ``` ### Basic Embedding Usage ```python from hindi_embeddings import HindiEmbedder # Initialize the embedder model = HindiEmbedder("path/to/hindi-embedding-foundational-model-10B") # Encode sentences to embeddings sentences = [ "मुझे हिंदी भाषा बहुत पसंद है।", "मैं हिंदी भाषा सीख रहा हूँ।" ] embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # Compute similarity between sentences similarity = model.compute_similarity(sentences[0], sentences[1]) print(f"Similarity: {similarity:.4f}") # Perform semantic search query = "भारत की राजधानी" documents = [ "दिल्ली भारत की राजधानी है।", "मुंबई भारत का सबसे बड़ा शहर है।", "हिमालय पर्वत भारत के उत्तर में स्थित है।" ] results = model.search(query, documents) for i, result in enumerate(results): print(f"{i+1}. Score: {result['score']:.4f}") print(f" Document: {result['document']}") # Visualize embeddings example_sentences = [ "मुझे हिंदी में पढ़ना बहुत पसंद है।", "आज मौसम बहुत अच्छा है।", "भारत एक विशाल देश है।" ] model.visualize_embeddings(example_sentences) ``` ## Model Details This model uses an advanced transformer-based architecture with the following enhancements: - Pre-layer normalization for stable training - Specialized attention mechanism with relative positional encoding - Multiple pooling strategies (weighted, mean, attention-based) - L2-normalized vectors for cosine similarity Technical specifications: - Embedding dimension: 768 - Hidden dimension: 768 - Layers: 12 - Attention heads: 12 - Vocabulary size: 50,000 - Context length: 128 tokens ## Applications - Semantic search and information retrieval - Text clustering and categorization - Recommendation systems - Question answering - Document similarity comparison - Content-based filtering - RAG systems for Hindi language content ## License This model is released under the MIT License. ## Citation If you use this model in your research or application, please cite us: ``` @misc{DeepMostInnovations2025hindi, author = {DeepMost Innovations}, title = {Hindi Sentence Embeddings Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model-10B}} } ```