Add pipeline tag, library name and link to Github repository (#1)

d258d01 verified 4 months ago

4.58 kB

	---
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	datasets:
	- BAAI/Infinity-Instruct
	license: llama3.2
	pipeline_tag: text-generation
	library_name: transformers
	---

	## Model Overview

	This weight is a fine-tuned version of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using the [LLM-Neo](https://arxiv.org/abs/2411.06839) method. Usage is identical to the original Llama-3.2-1B-Instruct model.

	The official implementation can be found here: https://github.com/yang3121099/LLM-Neo

	## Training Details

	The training process employs the LLM-Neo method. The dataset is derived from a mixed sample of [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct), specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), with the following hyperparameters:

	- Learning Rate: 1e-4
	- Epochs: 1
	- KD Ratio: 0.9
	- Rank: 128

	## Model Performance Evaluation

	<img src="https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png" alt="Neo_radar" width="600">

	<!-- ![Neo_radar](https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png) -->

	The evaluation of this model is divided into two parts: results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness) frameworks.

	> Note: The results are influenced by the specific benchmark versions and testing hardware/software configurations.
	> Therefore, the reported metrics should be interpreted as relative performance within a given setup.

	### Part 1: lm-evaluation-harness results

	In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between Llama-3.2-1B-Instruct and the current model:

	\| Dataset \| Llama-3.2-1B-Instruct \| Llama-3.2-1B-Instruct-Neo \|
	\|---------------------\|------------------------\|---------------\|
	\| ARC Challenge \| 36.09 \| 36.43 \|
	\| ARC Easy \| 68.52 \| 67.51 \|
	\| CEval \| 39.45 \| 39.67 \|
	\| CMMLU \| 35.62 \| 36.48 \|
	\| MMLU \| 45.91 \| 46.27 \|
	\| HellaSwag \| 45.07 \| 45.84 \|
	\| OpenBookQA \| 24.40 \| 25.40 \|
	\| PIQA \| 73.88 \| 74.32 \|
	\| Winogrande \| 59.27 \| 61.17 \|

	The results demonstrate that the current model outperforms Llama-3.2-1B-Instruct in several tasks, especially in reasoning tasks (e.g., Winogrande) and commonsense tasks (e.g., PIQA).

	---

	### Part 2: math-evaluation-harness results

	In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.

	\| Dataset \| Llama-3.2-1B-Instruct \| Llama-3.2-1B-Instruct-Neo \|
	\|---------------------\|------------------------\|---------------\|
	\| GSM8K \| 35.00 \| 39.30 \|
	\| Minerva Math \| 14.80 \| 22.80 \|
	\| SVAMP \| 50.40 \| 54.50 \|
	\| ASDiv \| 67.40 \| 71.20 \|
	\| MAWPS \| 83.50 \| 85.60 \|
	\| TabMWP \| 41.90 \| 35.40 \|
	\| MathQ \| 44.20 \| 48.30 \|
	\| MMLU-STEM \| 37.90 \| 38.90 \|

	The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as Minerva Math and GSM8K.

	---

	### Summary

	- Strengths: The current model demonstrates notable improvements over Llama-3.2-1B-Instruct across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
	- Future Directions: Further optimization in logical reasoning tasks (e.g., TabMWP) and continued enhancements in general language and mathematical adaptability.

	---
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	datasets:
	- BAAI/Infinity-Instruct
	license: llama3.2
	pipeline_tag: text-generation
	library_name: transformers
	---

	## Model Overview

	This weight is a fine-tuned version of [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using the [LLM-Neo](https://arxiv.org/abs/2411.06839) method. Usage is identical to the original Llama-3.2-1B-Instruct model.

	The official implementation can be found here: https://github.com/yang3121099/LLM-Neo

	## Training Details

	The training process employs the LLM-Neo method. The dataset is derived from a mixed sample of [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct), specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), with the following hyperparameters:

	- Learning Rate: 1e-4
	- Epochs: 1
	- KD Ratio: 0.9
	- Rank: 128

	## Model Performance Evaluation

	<img src="https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png" alt="Neo_radar" width="600">

	<!-- ![Neo_radar](https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png) -->

	The evaluation of this model is divided into two parts: results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness) frameworks.

	> Note: The results are influenced by the specific benchmark versions and testing hardware/software configurations.
	> Therefore, the reported metrics should be interpreted as relative performance within a given setup.

	### Part 1: lm-evaluation-harness results

	In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between Llama-3.2-1B-Instruct and the current model:

	\| Dataset \| Llama-3.2-1B-Instruct \| Llama-3.2-1B-Instruct-Neo \|
	\|---------------------\|------------------------\|---------------\|
	\| ARC Challenge \| 36.09 \| 36.43 \|
	\| ARC Easy \| 68.52 \| 67.51 \|
	\| CEval \| 39.45 \| 39.67 \|
	\| CMMLU \| 35.62 \| 36.48 \|
	\| MMLU \| 45.91 \| 46.27 \|
	\| HellaSwag \| 45.07 \| 45.84 \|
	\| OpenBookQA \| 24.40 \| 25.40 \|
	\| PIQA \| 73.88 \| 74.32 \|
	\| Winogrande \| 59.27 \| 61.17 \|

	The results demonstrate that the current model outperforms Llama-3.2-1B-Instruct in several tasks, especially in reasoning tasks (e.g., Winogrande) and commonsense tasks (e.g., PIQA).

	---

	### Part 2: math-evaluation-harness results

	In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.

	\| Dataset \| Llama-3.2-1B-Instruct \| Llama-3.2-1B-Instruct-Neo \|
	\|---------------------\|------------------------\|---------------\|
	\| GSM8K \| 35.00 \| 39.30 \|
	\| Minerva Math \| 14.80 \| 22.80 \|
	\| SVAMP \| 50.40 \| 54.50 \|
	\| ASDiv \| 67.40 \| 71.20 \|
	\| MAWPS \| 83.50 \| 85.60 \|
	\| TabMWP \| 41.90 \| 35.40 \|
	\| MathQ \| 44.20 \| 48.30 \|
	\| MMLU-STEM \| 37.90 \| 38.90 \|

	The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as Minerva Math and GSM8K.

	---

	### Summary

	- Strengths: The current model demonstrates notable improvements over Llama-3.2-1B-Instruct across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
	- Future Directions: Further optimization in logical reasoning tasks (e.g., TabMWP) and continued enhancements in general language and mathematical adaptability.