HyGPT-10b-it

HyGPT-10b-it is an instruction-tuned version of HyGPT-10b, the first Armenian large language model that was pretrained on a corpus of Armenian text data. This model has been fine-tuned on a diverse instruction dataset to enhance its ability to follow instructions, engage in multi-turn conversations, and perform various language tasks in Armenian, Russian, and English.

Model Details

Model Description

HyGPT-10b-it is a decoder-only language model based on the HyGPT-10b base model that was first pretrained on 10B tokens of Armenian text and then instruction-tuned (SFT) on a diverse dataset of 50,000 instruction samples.

Developed by: Gen2B & NCCAIT
Model type: Instruction-tuned decoder-only language model
Language(s) (NLP): Armenian, English, Russian
Technical Report: link
License: HyGPT Permissive Use License

Uses

First, install the Transformers library with:

pip install -U transformers

Then, run this example:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_path = "Gen2B/HyGPT-10b-it"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Example of a single-turn conversation
chat = [
    {"role": "user", "content": "Ինչու է խոտը Կանաչ:"}
]

# Example of a multi-turn conversation
# chat = [
#     {"role": "user", "content": "Բարև, ինչպե՞ս ես:"},
#     {"role": "assistant", "content": "Բարև, ես լավ եմ: Ինչով կարող եմ օգնել քեզ այսօր:"},
#     {"role": "user", "content": "Ինչու է խոտը Կանաչ:"}
# ]

PROMPT = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(
    PROMPT,
    return_tensors="pt",
)

print("Generating...")
generation_output = model.generate(
    input_ids=inputs["input_ids"].cuda(),
    generation_config=GenerationConfig(
        temperature=0.0001,
        repetition_penalty=1.1,
        do_sample=True
    ),
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=1024,
)
for s in generation_output.sequences:
    print(tokenizer.decode(s))

# Խոտի մեջ կան պիգմենտներ, որոնք կլանում են լույսի միայն կարճ ալիքները և անդրադարձնում երկար ալիքները։ Դրանք նաև բաց են թողնում ուլտրամանուշակագույն և ինֆրակարմիր ալիքները։ Մարդու աչքերը զգայուն չեն այս ալիքների նկատմամբ, ուստի դրանք տեսանելի չեն։ Այսպիսով, երբ արևի լույսը հարվածում է խոտին, այն անդրադարձնում է երկար ալիքները՝ առաջացնելով կանաչ գույնը, որը մենք տեսնում ենք:

Direct Use

HyGPT-10b-it can be used directly for:

Multi-turn conversations in Armenian
Rephrasing and paraphrasing Armenian text
Question answering in Armenian
Text summarization and paraphrasing
Translation between Armenian, Russian, and English
Mathematical problem solving
General knowledge queries
Educational content assistance

Bias, Risks, and Limitations

The model may reflect biases present in both the pretraining and instruction-tuning datasets
Accuracy may vary across different Armenian dialects and regional variations
The model may not have up-to-date knowledge beyond its training data
Like all language models, it may occasionally generate incorrect or nonsensical responses
The model's understanding of specialized Armenian terminology may be limited in certain domains
Performance on complex reasoning tasks may be inconsistent

Training Details

Base Model

The base model (HyGPT-10b) was pretrained on a diverse corpus of Armenian text data comprising approximately 10 billion tokens, including:

Armenian web content
Armenian literature and publications
Armenian news articles
Armenian Wikipedia
Other publicly available Armenian text sources

Instruction Tuning Dataset

The model was fine-tuned on a diverse instruction dataset consisting of 50,000 samples with the following characteristics:

Dataset Composition:
- Single-turn instruction-response pairs
- Multi-turn conversations (dialogues with multiple exchanges)
- Approximately 50% synthetic data generated with Gemini Flash 2.0
Task Types:
- Summarization tasks
- Paraphrasing exercises
- Translation between Armenian, Russian, and English
- Everyday conversational dialogues
- Wikipedia-based knowledge questions
- Mathematical and educational problems
- General knowledge queries

Preprocessing

The instruction tuning data underwent several preprocessing steps:

Formatting into consistent instruction-response pairs
Translation of some samples into Armenian language
Quality filtering
Conversion to chat format with appropriate role assignments
Tokenization using the base model's tokenizer

Training Procedure

The model was fine-tuned from the HyGPT-10b base model using supervised fine-tuning (SFT) techniques. The training focused on teaching the model to:

Follow instructions accurately
Maintain context across multi-turn conversations (context is better provided after the question)
Generate helpful, accurate, and contextually appropriate responses
Handle a variety of task types including translation, summarization, and question answering

Benchmarks

The model was evaluated on several standard benchmarks that were translated into Armenian to accurately assess its performance in the target language. The benchmarks include:

Flores: Tests the model's ability to translate text between Armenian, Russian, and English languages
ARC: A multiple-choice question benchmark that evaluates reasoning capabilities
Truthful QA: Another multiple-choice benchmark that assesses the model's ability to provide truthful answers
GSM8K: Evaluates the model's mathematical reasoning skills with school-level math problems

Below is a table of accuracy of different models on 4 benchmarks. The results demonstrate significant improvements over the base model across these tasks:

	Gen2B/HyGPT-10b-it	google/gemma-3-12b-it	mistralai/Mistral-Small-3.1-24B-Instruct-2503	google/gemma-2-9b-it	mistralai/Mistral-Nemo-Instruct-2407	meta-llama/Llama-3.1-8B-Instruct
Flores	79.33	80.59	80.62	78.61	79.1	77.67
ARC	76.1	79.42	81.76	72.54	73.2	58.91
Truthful QA	72.83	65.52	67.98	67.49	39.9	39.41
GSM8K	68.0	65.8	41.07	38.0	44.19	17.22
avg	74.06	72.83	67.86	64.16	59.1	48.3

Results

The instruction-tuned model demonstrates significantly improved capabilities in following instructions and engaging in conversations compared to the base model. It shows enhanced abilities in:

Understanding and responding to complex instructions
Maintaining context across multi-turn dialogues
Generating more natural and helpful responses
Performing specific tasks like translation and summarization

Summary

HyGPT-10b-it builds upon the strong foundation of HyGPT-10b to provide a more interactive and instruction-following Armenian language model. It is particularly well-suited for conversational applications, educational tools, and multilingual assistance systems that require Armenian language support.

License and Terms of Use

This model is based on Gemma and is distributed according to the Gemma Terms of Use.

Notice: Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms.

Modifications Notice

This model is a modified version of the original Gemma-2-9b model. The modifications include:

Further pretraining on 10 billion tokens of Armenian text data
Decoupling of the embedding and LM head layers to allow independent training of the output layer
Instruction tuning (SFT) on a dataset of 50,000 instruction samples

Use Restrictions

According to the Gemma Terms of Use, the model should not be used:

For purposes outlined in the Gemma Prohibited Use Policy
In violation of applicable laws and regulations

Disclaimer of Warranty

UNLESS REQUIRED BY APPLICABLE LAW, THE GEMMA SERVICES, AND OUTPUTS, ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE GEMMA SERVICES OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR USE OR DISTRIBUTION OF ANY OF THE GEMMA SERVICES OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.

Gen2B
/

HyGPT-10b-it