# Norwegian LLM and Embedding Models Research ## Open-Source LLMs with Norwegian Language Support ### 1. NorMistral-7b-scratch - **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts). - **Architecture**: Based on Mistral architecture with 7 billion parameters - **Context Length**: 2k tokens - **Performance**: - Perplexity on NCC validation set: 7.43 - Good performance on reading comprehension, sentiment analysis, and machine translation tasks - **License**: Apache-2.0 - **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch - **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo ### 2. Viking 7B - **Description**: The first multilingual large language model for all Nordic languages (including Norwegian) - **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention - **Context Length**: 4k tokens - **Performance**: Best-in-class performance in all Nordic languages without compromising English performance - **License**: Apache 2.0 - **Notes**: - Developed by Silo AI and University of Turku's research group TurkuNLP - Also available in larger sizes (13B and 33B parameters) - Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages ### 3. NorskGPT - **Description**: A Norwegian large language model made for Norwegian society - **Versions**: - NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B - NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2 - **License**: cc-by-nc-sa-4.0 (non-commercial) - **Website**: https://www.norskgpt.com/norskgpt-llm ## Embedding Models for Norwegian ### 1. NbAiLab/nb-sbert-base - **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset - **Architecture**: Based on nb-bert-base - **Vector Dimensions**: 768 - **Performance**: - Cosine Similarity: Pearson 0.8275, Spearman 0.8245 - **License**: apache-2.0 - **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base - **Use Cases**: - Sentence similarity - Semantic search - Few-shot classification (with SetFit) - Keyword extraction (with KeyBERT) - Topic modeling (with BERTopic) - **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications ### 2. FFI/SimCSE-NB-BERT-large - **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology - **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large ## Vector Database Options for Hugging Face RAG Integration ### 1. Milvus - **Integration**: Well-documented integration with Hugging Face for RAG pipelines - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus ### 2. MongoDB - **Integration**: Can be used with Hugging Face models for RAG systems - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb ### 3. MyScale - **Integration**: Supports building RAG applications with Hugging Face embedding models - **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293 ### 4. FAISS (Facebook AI Similarity Search) - **Integration**: Lightweight vector database that works well with Hugging Face - **Notes**: Can be used with `autofaiss` for quick experimentation ## Hugging Face RAG Implementation Options 1. **Transformers Library**: Provides access to pre-trained models 2. **Sentence Transformers**: For text embeddings 3. **Datasets**: For managing and processing data 4. **LangChain Integration**: For advanced RAG pipelines 5. **Spaces**: For deploying and sharing the application