| # Norwegian LLM and Embedding Models Research | |
| ## Open-Source LLMs with Norwegian Language Support | |
| ### 1. NorMistral-7b-scratch | |
| - **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts). | |
| - **Architecture**: Based on Mistral architecture with 7 billion parameters | |
| - **Context Length**: 2k tokens | |
| - **Performance**: | |
| - Perplexity on NCC validation set: 7.43 | |
| - Good performance on reading comprehension, sentiment analysis, and machine translation tasks | |
| - **License**: Apache-2.0 | |
| - **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch | |
| - **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo | |
| ### 2. Viking 7B | |
| - **Description**: The first multilingual large language model for all Nordic languages (including Norwegian) | |
| - **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention | |
| - **Context Length**: 4k tokens | |
| - **Performance**: Best-in-class performance in all Nordic languages without compromising English performance | |
| - **License**: Apache 2.0 | |
| - **Notes**: | |
| - Developed by Silo AI and University of Turku's research group TurkuNLP | |
| - Also available in larger sizes (13B and 33B parameters) | |
| - Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages | |
| ### 3. NorskGPT | |
| - **Description**: A Norwegian large language model made for Norwegian society | |
| - **Versions**: | |
| - NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B | |
| - NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2 | |
| - **License**: cc-by-nc-sa-4.0 (non-commercial) | |
| - **Website**: https://www.norskgpt.com/norskgpt-llm | |
| ## Embedding Models for Norwegian | |
| ### 1. NbAiLab/nb-sbert-base | |
| - **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset | |
| - **Architecture**: Based on nb-bert-base | |
| - **Vector Dimensions**: 768 | |
| - **Performance**: | |
| - Cosine Similarity: Pearson 0.8275, Spearman 0.8245 | |
| - **License**: apache-2.0 | |
| - **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base | |
| - **Use Cases**: | |
| - Sentence similarity | |
| - Semantic search | |
| - Few-shot classification (with SetFit) | |
| - Keyword extraction (with KeyBERT) | |
| - Topic modeling (with BERTopic) | |
| - **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications | |
| ### 2. FFI/SimCSE-NB-BERT-large | |
| - **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology | |
| - **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large | |
| ## Vector Database Options for Hugging Face RAG Integration | |
| ### 1. Milvus | |
| - **Integration**: Well-documented integration with Hugging Face for RAG pipelines | |
| - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus | |
| ### 2. MongoDB | |
| - **Integration**: Can be used with Hugging Face models for RAG systems | |
| - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb | |
| ### 3. MyScale | |
| - **Integration**: Supports building RAG applications with Hugging Face embedding models | |
| - **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293 | |
| ### 4. FAISS (Facebook AI Similarity Search) | |
| - **Integration**: Lightweight vector database that works well with Hugging Face | |
| - **Notes**: Can be used with `autofaiss` for quick experimentation | |
| ## Hugging Face RAG Implementation Options | |
| 1. **Transformers Library**: Provides access to pre-trained models | |
| 2. **Sentence Transformers**: For text embeddings | |
| 3. **Datasets**: For managing and processing data | |
| 4. **LangChain Integration**: For advanced RAG pipelines | |
| 5. **Spaces**: For deploying and sharing the application | |