HyGPT-10b-it
HyGPT-10b-it is an instruction-tuned version of HyGPT-10b, the first Armenian large language model that was pretrained on a corpus of Armenian text data. This model has been fine-tuned on a diverse instruction dataset to enhance its ability to follow instructions, engage in multi-turn conversations, and perform various language tasks in Armenian, Russian, and English.
Model Details
Model Description
HyGPT-10b-it is a decoder-only language model based on the HyGPT-10b base model that was first pretrained on 10B tokens of Armenian text and then instruction-tuned (SFT) on a diverse dataset of 50,000 instruction samples.
- Developed by: Gen2B & NCCAIT
- Model type: Instruction-tuned decoder-only language model
- Language(s) (NLP): Armenian, English, Russian
- Technical Report: link
- License: HyGPT Permissive Use License
Uses
First, install the Transformers library with:
pip install -U transformers
Then, run this example:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_path = "Gen2B/HyGPT-10b-it"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto",
)
# Example of a single-turn conversation
chat = [
{"role": "user", "content": "Ինչու է խոտը Կանաչ:"}
]
# Example of a multi-turn conversation
# chat = [
# {"role": "user", "content": "Բարև, ինչպե՞ս ես:"},
# {"role": "assistant", "content": "Բարև, ես լավ եմ: Ինչով կարող եմ օգնել քեզ այսօր:"},
# {"role": "user", "content": "Ինչու է խոտը Կանաչ:"}
# ]
PROMPT = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(
PROMPT,
return_tensors="pt",
)
print("Generating...")
generation_output = model.generate(
input_ids=inputs["input_ids"].cuda(),
generation_config=GenerationConfig(
temperature=0.0001,
repetition_penalty=1.1,
do_sample=True
),
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=1024,
)
for s in generation_output.sequences:
print(tokenizer.decode(s))
# Խոտի մեջ կան պիգմենտներ, որոնք կլանում են լույսի միայն կարճ ալիքները և անդրադարձնում երկար ալիքները։ Դրանք նաև բաց են թողնում ուլտրամանուշակագույն և ինֆրակարմիր ալիքները։ Մարդու աչքերը զգայուն չեն այս ալիքների նկատմամբ, ուստի դրանք տեսանելի չեն։ Այսպիսով, երբ արևի լույսը հարվածում է խոտին, այն անդրադարձնում է երկար ալիքները՝ առաջացնելով կանաչ գույնը, որը մենք տեսնում ենք:
Direct Use
HyGPT-10b-it can be used directly for:
- Multi-turn conversations in Armenian
- Rephrasing and paraphrasing Armenian text
- Question answering in Armenian
- Text summarization and paraphrasing
- Translation between Armenian, Russian, and English
- Mathematical problem solving
- General knowledge queries
- Educational content assistance
Bias, Risks, and Limitations
- The model may reflect biases present in both the pretraining and instruction-tuning datasets
- Accuracy may vary across different Armenian dialects and regional variations
- The model may not have up-to-date knowledge beyond its training data
- Like all language models, it may occasionally generate incorrect or nonsensical responses
- The model's understanding of specialized Armenian terminology may be limited in certain domains
- Performance on complex reasoning tasks may be inconsistent
Training Details
Base Model
The base model (HyGPT-10b) was pretrained on a diverse corpus of Armenian text data comprising approximately 10 billion tokens, including:
- Armenian web content
- Armenian literature and publications
- Armenian news articles
- Armenian Wikipedia
- Other publicly available Armenian text sources
Instruction Tuning Dataset
The model was fine-tuned on a diverse instruction dataset consisting of 50,000 samples with the following characteristics:
Dataset Composition:
- Single-turn instruction-response pairs
- Multi-turn conversations (dialogues with multiple exchanges)
- Approximately 50% synthetic data generated with Gemini Flash 2.0
Task Types:
- Summarization tasks
- Paraphrasing exercises
- Translation between Armenian, Russian, and English
- Everyday conversational dialogues
- Wikipedia-based knowledge questions
- Mathematical and educational problems
- General knowledge queries
Preprocessing
The instruction tuning data underwent several preprocessing steps:
- Formatting into consistent instruction-response pairs
- Translation of some samples into Armenian language
- Quality filtering
- Conversion to chat format with appropriate role assignments
- Tokenization using the base model's tokenizer
Training Procedure
The model was fine-tuned from the HyGPT-10b base model using supervised fine-tuning (SFT) techniques. The training focused on teaching the model to:
- Follow instructions accurately
- Maintain context across multi-turn conversations (context is better provided after the question)
- Generate helpful, accurate, and contextually appropriate responses
- Handle a variety of task types including translation, summarization, and question answering
Benchmarks
The model was evaluated on several standard benchmarks that were translated into Armenian to accurately assess its performance in the target language. The benchmarks include:
- Flores: Tests the model's ability to translate text between Armenian, Russian, and English languages
- ARC: A multiple-choice question benchmark that evaluates reasoning capabilities
- Truthful QA: Another multiple-choice benchmark that assesses the model's ability to provide truthful answers
- GSM8K: Evaluates the model's mathematical reasoning skills with school-level math problems
Below is a table of accuracy of different models on 4 benchmarks. The results demonstrate significant improvements over the base model across these tasks:
Gen2B/HyGPT-10b-it | google/gemma-3-12b-it | mistralai/Mistral-Small-3.1-24B-Instruct-2503 | google/gemma-2-9b-it | mistralai/Mistral-Nemo-Instruct-2407 | meta-llama/Llama-3.1-8B-Instruct | |
---|---|---|---|---|---|---|
Flores | 79.33 | 80.59 | 80.62 | 78.61 | 79.1 | 77.67 |
ARC | 76.1 | 79.42 | 81.76 | 72.54 | 73.2 | 58.91 |
Truthful QA | 72.83 | 65.52 | 67.98 | 67.49 | 39.9 | 39.41 |
GSM8K | 68.0 | 65.8 | 41.07 | 38.0 | 44.19 | 17.22 |
avg | 74.06 | 72.83 | 67.86 | 64.16 | 59.1 | 48.3 |
Results
The instruction-tuned model demonstrates significantly improved capabilities in following instructions and engaging in conversations compared to the base model. It shows enhanced abilities in:
- Understanding and responding to complex instructions
- Maintaining context across multi-turn dialogues
- Generating more natural and helpful responses
- Performing specific tasks like translation and summarization
Summary
HyGPT-10b-it builds upon the strong foundation of HyGPT-10b to provide a more interactive and instruction-following Armenian language model. It is particularly well-suited for conversational applications, educational tools, and multilingual assistance systems that require Armenian language support.
License and Terms of Use
This model is based on Gemma and is distributed according to the Gemma Terms of Use.
Notice: Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms.
Modifications Notice
This model is a modified version of the original Gemma-2-9b model. The modifications include:
- Further pretraining on 10 billion tokens of Armenian text data
- Decoupling of the embedding and LM head layers to allow independent training of the output layer
- Instruction tuning (SFT) on a dataset of 50,000 instruction samples
Use Restrictions
According to the Gemma Terms of Use, the model should not be used:
- For purposes outlined in the Gemma Prohibited Use Policy
- In violation of applicable laws and regulations
Disclaimer of Warranty
UNLESS REQUIRED BY APPLICABLE LAW, THE GEMMA SERVICES, AND OUTPUTS, ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE GEMMA SERVICES OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR USE OR DISTRIBUTION OF ANY OF THE GEMMA SERVICES OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.
- Downloads last month
- 228