Model Card for qwen2.5-0.5B-Instruct-Inshort
SFT(model='Qwen2.5-0.5B-Instruct', dataset='Inshorts-english)') = 'qwen2.5-0.5B-Instruct-Inshort '
Model Details
Model Description
The model Qwen2.5-0.5B-Instruct was fine-tuned on selected layers (Qwen2DecoderLayer's) using the Inshorts-english dataset, resulting in the new model: qwen2.5-0.5B-Instruct-Inshort .
NOTE
This model is part of my project, where I explore pruning a capable teacher model and recovering its performance through distillation (specifically, behavior cloning) and supervised fine-tuning (SFT), focused on an Inshorts-style summarization task.
This model will act as a teacher model.
- Developed by: nis12ram
- Model type: Autoregressive model
- Language(s) (NLP): English
- License: Apache License 2.0
- Finetuned from model : Qwen2.5-0.5B-Instruct
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nis12ram/qwen2.5-0.5B-Instruct-Inshort"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
content = """Veteran leg-spinner Piyush Chawla has retired from all forms of cricket at the age of 36. "Cricket will always live within me," he said in his farewell note. Chawla took 43 wickets for India in three Tests, 25 ODIs and seven T20Is. He was a part of India's T20 World Cup 2007 and ODI World Cup 2011-winning teams."""
text = f'''<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Generate a concise news headline based on the following news content. The headline should clearly and accurately summarize the key point of the article. Avoid exaggeration or misleading phrasing.
News Content: {content}<|im_end|>
<|im_start|>assistant
'''
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128,
do_sample=False, ## greedy_sampling is tested out to be best
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Training Details
Training Data
Training Procedure
- Last 5 Qwen2DecoderLayer's are only trainable, rest of the model is fixed.
- SFT(supervised fine-tuning) training method is used.
Training Hyperparameters
- Batch = 8, Gradient Accumulation = 1
- Warmup Steps = 50
- epochs = 1Β½
- Optimizer = adamw_8bit
- Learning Rate = 5e-5
- Lr Scheduler Type = linear
Training Code
Evaluation
The initial evaluation began with ROUGE SCORE; however, this approach was quickly abandoned as ROUGE fails to capture semantic meaning and contextual understandingβboth of which are crucial for evaluating abstractive summarization.
As a result, a custom evaluation pipeline was adopted. This pipeline uses an LLM-as-a-judge to assess the quality of summaries, assigning an accuracy score on a scale from 1 to 5. Additionally, side-by-side ROUGE scores are also provided for reference.
Check out the Colab Notebook for the code of custom evaluation pipeline
LLM-as-a-judge details
- model = Qwen/Qwen2.5-32B-Instruct
- sampling technique = greedy sampling
- prompt =
system_prompt_for_accuracy = '''YOU ARE A HIGHLY RELIABLE NEWS HEADLINE EVALUATION JUDGE, TRAINED TO ASSESS PREDICTED HEADLINES BASED SOLELY ON THEIR ACCURACY AND FAITHFULNESS TO THE ORIGINAL NEWS CONTENT. YOUR PRIMARY OBJECTIVE IS TO ENSURE THAT THE PREDICTED HEADLINES ARE:
1. **NOT MISLEADING OR HALLUCINATED**: The predicted headline must accurately reflect the original news content without adding false information or exaggerating details.
2. **FAITHFUL TO THE ORIGINAL NEWS CONTENT**: The headline should summarize the essence of the news while maintaining neutrality and factual correctness.
### INSTRUCTIONS ###
FOR EACH PREDICTED HEADLINE, FOLLOW THIS EVALUATION PROCESS:
1. **UNDERSTAND THE INPUTS:**
- ORIGINAL_NEWS_CONTENT: The full news article that serves as the source.
- PREDICTED_HEADLINE: The generated headline to be evaluated.
2. **EVALUATE FOR MISREPRESENTATION & HALLUCINATION:**
- CHECK if the predicted headline introduces **any false claims** and **misleading phrases** that are **not supported** by the source.
- RATE on a scale of 1-5:
- (1) **Severely Misleading** β The headline contains major inaccuracies, false claims, or is entirely unrelated to the news content.
- (2) **Largely Inaccurate** β The headline distorts key facts, introduces misleading implications, or exaggerates information.
- (3) **Partially Accurate** β The headline is mostly correct but includes minor distortions,or slightly misleading phrasing.
- (4) **Mostly Accurate** β The headline aligns well with the source but may have slight nuances or wording that could be improved.
- (5) **Fully Accurate** β The headline is entirely faithful to the source, correctly summarizing key details with no factual distortions.
### WHAT NOT TO DO ###
- NEVER ACCEPT A HEADLINE THAT IS FACTUALLY INCORRECT OR MISLEADING.
- NEVER IGNORE SUBTLE DIFFERENCES IN MEANING THAT COULD CHANGE THE FACTUAL ACCURACY.
### OUTPUT FORMAT ###
Your evaluation should be structured as follows:
```json
{
"predicted_headline": "...",
"score": "X/5",
"feedback": "..."
}
```'''
user_prompt_for_accuracy = '''News Content: {content}
Predicted Headline: {predicted_headline}
'''
Results
β Accuracy Score [main evaluation criteria]
Metric | Value |
---|---|
Accuracy Score | 3.7 |
π ROUGE Score
Metric | Score |
---|---|
ROUGE-1 | 0.3889 |
ROUGE-2 | 0.1669 |
ROUGE-L | 0.3445 |
ROUGE-Lsum | 0.3442 |
π― Accuracy-Aware ROUGE Score
Metric | Score |
---|---|
ROUGE-1 | 0.2877 |
ROUGE-2 | 0.1235 |
ROUGE-L | 0.2549 |
ROUGE-Lsum | 0.2547 |
Gitub Repository
- Downloads last month
- 17