|
--- |
|
library_name: transformers |
|
tags: |
|
- reasoning |
|
license: apache-2.0 |
|
datasets: |
|
- d0rj/gsm8k-ru |
|
language: |
|
- ru |
|
base_model: |
|
- attn-signs/GPTR-8b-base |
|
--- |
|
# GPT Reasoner (V1) |
|
|
|
- [EN] |
|
Reasoning model adapted for russian text generation. |
|
**Based on YandexGPT-pretrain -> GPTR-8b-base** |
|
- [RU] |
|
Модель рассуждений, адаптированная для генерации русскоязычного текста. |
|
**Построена на YandexGPT-pretrain -> GPTR-8b-base** |
|
|
|
## Model Details / Детализация модели |
|
- [EN] |
|
**Reinforced GRPO version** to invoke general reasoning capabilities. |
|
This model can generate conditional and coherent chain-of-thought |
|
- [RU] |
|
**Версия RL GRPO** для возможностей размышления и глубокого понимания запроса. |
|
Модель может генерировать когерентный текст русского языка на этой итерации. |
|
|
|
### Important: |
|
- [EN] |
|
This is the first stage of reinforcement learning, don't expect the model to solve every mathematical problem. |
|
The training is ongoing. Still, this model is stable and can solve something now. |
|
- [RU] |
|
Это первая стадия RL обучения, поэтому не стоит возлагать надежды на решения любой математической проблемы данной моделью. |
|
Обучение продолжается, данная версия модели скорее proof-of-concept, чем готовый математический ассистент. |
|
Несмотря на это, модель стабильна. |
|
|
|
### Further development |
|
- GRPO on Gromov dataset series |
|
|
|
### Model Description / Описание модели |
|
|
|
- **Developed by:** [Reisen Raumberg (Attention Signs team)] |
|
- **Language(s) (NLP):** [RU/EN] |
|
- **SFT from model:** [YandexGPT-5-lite-8B-pretrain] |
|
|
|
Utilized HF.Accelerator |
|
**GPU hours**: ~24h of NVIDIA A100 |
|
|
|
Для обучения использовался HuggingFace Accelerator |
|
**GPU часы**: ~24h часа NVIDIA A100 |
|
|
|
### Training Framework |
|
**GPTR was trained using MyLLM framework (by Attention Signs):** |
|
--==[MyLLM](https://github.com/Raumberg/myllm)==-- |
|
|
|
### Model configuration (MyLLM Framework) |
|
```toml |
|
[model] |
|
model_name_or_path = "attn-signs/GPTR-8-base" |
|
|
|
[datasets] |
|
dataset = "d0rj/gsm8k-ru" |
|
problem_field = "question" |
|
solution_field = "answer" |
|
dataloader_num_workers = 2 |
|
test_size = 0.1 |
|
extract_hash = true |
|
|
|
[run] |
|
run_name = "rl-gptr-8" |
|
report_to = "wandb" |
|
logging_first_step = true |
|
logging_steps = 1 |
|
save_strategy = "steps" |
|
save_steps = 500 |
|
save_total_limit = 5 |
|
output_dir = "models/attn-signs-gptr-8-grpo" |
|
project_name = "rl-gptr" |
|
|
|
[training] |
|
num_train_epochs = 1 |
|
per_device_train_batch_size = 2 |
|
learning_rate = 0.00001 |
|
bf16 = true |
|
seed = 42 |
|
use_peft = true |
|
|
|
[grpo] |
|
use_vllm = true |
|
num_generations = 2 |
|
max_completion_length = 2048 |
|
num_iterations = 1 # https://github.com/huggingface/trl/releases/tag/v0.16.0 |
|
scale_rewards = false # should be default var |
|
beta = 0.04 # reference model beta in vllm |
|
epsilon_high = 0.28 # Increasing upper bound epsilon leads to higher entropy during generation, promoting better exploration |
|
preload_rm = false |
|
|
|
[lora] |
|
lora_target_modules = [ |
|
"k_proj", |
|
"v_proj", |
|
"q_proj", |
|
"o_proj", |
|
"gate_proj", |
|
"up_proj", |
|
"down_proj", |
|
] |
|
lora_r = 32 |
|
lora_alpha = 64 |
|
|
|
[fusion] |
|
use_liger = false |
|
attn_implementation = "flash_attention_2" |
|
|
|
[tokenizer] |
|
eos_token = "</s>" |
|
pad_token = "<unk>" |
|
chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<s>' + message['role'] + '\n' + message['content'] + '</s>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<s>assistant\n' }}{% endif %}" |
|
force_chat_template = true |
|
added_special_tokens = [ |
|
"<think>", |
|
"</think>" |
|
] |
|
system_prompt = """ |
|
[MODE: Reflection] |
|
""" |
|
``` |
|
### Rewards: |
|
- Equation structure reward |
|
- Correctness reward |
|
- Multilingual coherence reward |
|
- Strict chinese penalty |
|
- Format reward |
|
- Russian purity reward |
|
|
|
### Using the model / Как запустить? |
|
|
|
```python |
|
repo = 'attn-signs/GPTR-8-v1' |
|
|
|
model = AutoModelForCausalLM.from_pretrained(repo) |
|
tokenizer = AutoTokenizer.from_pretrained(repo) |
|
|
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
model.to(device) |
|
|
|
user_prompt = ''' |
|
У уравнений x**2 + 2019ax + b = 0 и x**2 + 2019bx + a = 0 есть один общий корень. Чему может быть равен этот корень, если известно, что a != b? |
|
''' |
|
system_prompt = "[MODE: Reflection]" |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": user_prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
generated_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=4096 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
print(response) |
|
``` |