DianJin/DianJin-R1-7B · Hugging Face

DianJin-R1-7B

Qwen DianJin Platform | Github | ModelScope

Introduction

We propose DianJin-R1, a novel framework that enhances financial reasoning in LLMs through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. We adopt a structured training paradigm where models generate reasoning steps and final answers using supervised fine-tuning. To further improve reasoning quality, we use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that incorporates dual reward signals for output structure and answer accuracy.

We open-source our models, DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, which train by two steps: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Learning Reasoning with SFT

DianJin-R1-Data are utilized to fine-tune LLMs to generate a chain-of-thought (CoT) followed by a final answer. Each training instance consists of a question, a reasoning path formatted as <think>· · · </think>, and an answer formatted as <answer>· · · </answer>. During fine-tuning, the question serves as the input to the model, while the reasoning and final answer are treated as the target output, enabling the model to learn to produce coherent reasoning steps along with the correct solution.

Enhancing Reasoning with RL

We adopt the Group Relative Policy Optimization (GRPO) algorithm for RL, incorporating two reward mechanisms: a format reward to ensure the generated output adheres to the desired structure, and an accuracy reward to encourage correct answers.

Quickstart

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "DianJin/DianJin-R1-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "假设你是一位金融行业专家，请回答下列问题。\n在宏观分析中，描述在既定利率水平下产品市场达到均衡状态的曲线是什么？\n请一步步思考。"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

DianJin
/

DianJin-R1-7B

DianJin-R1-7B

Introduction

Learning Reasoning with SFT

Enhancing Reasoning with RL

Quickstart

Model tree for DianJin/DianJin-R1-7B