DeepMostInnovations
/

hindi-embedding-foundational-model-10B

+---
+language: hi
+license: mit
+tags:
+  - hindi
+  - embeddings
+  - sentence-embeddings
+  - semantic-search
+  - text-similarity
+datasets:
+  - custom
+pipeline_tag: sentence-similarity
+library_name: transformers
+---
+# Hindi Sentence Embeddings Model
+This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.
+## Features
+- Specialized for Hindi language text
+- Advanced transformer architecture with optimized attention mechanism
+- Multiple pooling strategies for enhanced semantic representations
+- Creates normalized vector representations for semantic similarity
+- Supports semantic search and text similarity applications
+## Usage
+### Installation
+```bash
+pip install torch sentencepiece scikit-learn matplotlib
+git lfs install
+git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
+cd hindi-embedding-foundational-model
+```
+### Enhanced RAG System
+This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.
+#### Setup and Installation
+1. Install additional dependencies:
+```bash
+pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu
+```
+2. Index your documents:
+```bash
+python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index
+```
+3. Run in QA mode with LLM:
+```bash
+python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa
+```
+### Basic Embedding Usage
+```python
+from hindi_embeddings import HindiEmbedder
+# Initialize the embedder
+model = HindiEmbedder("path/to/hindi-embedding-foundational-model")
+# Encode sentences to embeddings
+sentences = [
+    "मुझे हिंदी भाषा बहुत पसंद है।",
+    "मैं हिंदी भाषा सीख रहा हूँ।"
+]
+embeddings = model.encode(sentences)
+print(f"Embedding shape: {embeddings.shape}")
+# Compute similarity between sentences
+similarity = model.compute_similarity(sentences[0], sentences[1])
+print(f"Similarity: {similarity:.4f}")
+# Perform semantic search
+query = "भारत की राजधानी"
+documents = [
+    "दिल्ली भारत की राजधानी है।",
+    "मुंबई भारत का सबसे बड़ा शहर है।",
+    "हिमालय पर्वत भारत के उत्तर में स्थित है।"
+]
+results = model.search(query, documents)
+for i, result in enumerate(results):
+    print(f"{i+1}. Score: {result['score']:.4f}")
+    print(f"   Document: {result['document']}")
+# Visualize embeddings
+example_sentences = [
+    "मुझे हिंदी में पढ़ना बहुत पसंद है।",
+    "आज मौसम बहुत अच्छा है।",
+    "भारत एक विशाल देश है।"
+]
+model.visualize_embeddings(example_sentences)
+```
+## Model Details
+This model uses an advanced transformer-based architecture with the following enhancements:
+- Pre-layer normalization for stable training
+- Specialized attention mechanism with relative positional encoding
+- Multiple pooling strategies (weighted, mean, attention-based)
+- L2-normalized vectors for cosine similarity
+Technical specifications:
+- Embedding dimension: 768
+- Hidden dimension: 768
+- Layers: 12
+- Attention heads: 12
+- Vocabulary size: 50,000
+- Context length: 128 tokens
+## Applications
+- Semantic search and information retrieval
+- Text clustering and categorization
+- Recommendation systems
+- Question answering
+- Document similarity comparison
+- Content-based filtering
+- RAG systems for Hindi language content
+## License
+This model is released under the MIT License.
+## Citation
+If you use this model in your research or application, please cite us:
+```
+@misc{DeepMostInnovations2025hindi,
+  author = {DeepMost Innovations},
+  title = {Hindi Sentence Embeddings Model},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
+}
+```