Chinese Chat + Code Revise Patch for Llama2-base-13B

Introduction

本项目旨在研究代码相关的SFT任务对于对话能力，逻辑推理的影响。

目前坊间一直流传，代码数据训练能极大的提升模型的COT能力。因此，本项目除了处理继续使用BELLE项目中采样的50万SFT数据进行SFT训练以外，还加入了code-review-instruct-critique-revision-python数据集以及部分网上爬取的逻辑推理问答数据。

该模型使用Llama2-chat-13B 作为基底模型，使用带embedding和LM Head 的LoRA 方式进行训练。模型已完成参数合并，可直接使用。也可以手动将last_model 同Llama2-chat 13B 进行合并。

The primary objective of this model repo is to delve into the ramifications of code-related SFT training on conversational proficiency and logical reasoning.

Currently, there is a prevailing notion circulating within the community that training models with code-based data can significantly enhance the COT capabilities of the LLM model. Therefore, in addition to the continued utilization of the sampled 500,000 SFT data from the BELLE project for SFT training, this model training incorporates the code-review-instruct-critique-revision-python dataset, along with a subset of online-scraped logical reasoning question-answer data.

The model employs Llama2-chat-13B as its foundational architecture, undergoing training via the LoRA methodology, which include embedding and an LM Head tuning. The model's parameters have been merged and are ready for direct use. Alternatively, one can also manually merge the last_model with the Llama2-chat-13B model.

Train Detail

一些训练上的细节：

训练框架：该模型训练代码为ChinChunMei-LLM。并且由于显存限制，本次训练采用了全新的Llama2-chat-7B-Chinese-50W-LoRA-ZERO3 项目代码。该项目使用了Deepspeed ZERO3策略搭配 LoRA，以节省显存。
Tokenizer：该模型使用了Chinese-Alpaca-Plus模型的tokenizer.model。这是因为LLama2本身的tokenizer.model同LLama1是一摸一样的。因此理论上可以完全复用Chinese-LLaMa项目的tokenizer而不会产生如何错位问题。
训练参数：该模型训练使用的超参数为：LoRA rank: 8, LR: 2e-4, Warmup ratio: 0.05. Max Seq Length: 2048
训练资源：8卡V100。155小时
训练起始的loss：8.2501
训练终止的loss：1.5515

Some details in training:

Trianing Framework: This model was trained on ChinChunMei-LLM project. And due to the limitation of V100 GPU memory, this model was trained on the new egs project: Llama2-chat-7B-Chinese-50W-LoRA-ZERO3, which utilized ZERO3 strategy and LoRA
Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
Training Parameters: The hyperparams are: LoRA rank: 8, LR: 2e-4, Warmup ratio: 0.05. Max Seq Length: 2048
Training Resource: 8*V100, 155 hours.
Initial Loss: 8.2501
Train Loss: 1.5515

Inference

该模型依然采用stanford alpaca 模版。因此在测试时且别忘记添加开场白。开场白如下：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

对于带上文的对话，开场白如下：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Assistance Content}\nHuman:${Your Question}\n\n### Response:\n\n"

如此之外，如需使用代码相关任务，则需要在上述开场白的${Your Content} 中，再添加下述模版:

对于code comment 任务：

"Please critique the following codes based on their requirements: \n\nORIGINAL:\n{ORIGINAL}"

对于基于代码评论的修改，开场白如下：

"Please revise the following codes based on their requirement and comment: \n\nORIGINAL:\n{ORIGINAL}\n\nCOMMENT:\n{CRITIQUE}"

对于基于代码和修改后的代码的改动总结，开场白如下：

"Please summarise the flaws of the code by comparing the original code and its revise: \n\nORIGINAL:\n{ORIGINAL}\n\nREVISED:\n{REVISED}"

This model still using the Stanford Alpaca template. Therefore, don't forget to add prologue template. The prologue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

For dialogue with context, the prelogue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Machine Content}\nHuman:${Your Question}\n\n### Response:\n\n"

Furthermore, for code related tasks, it is requisite to incorporate the following template within the aforementioned introductory section denoted as ${Your Content}:

For code review:

"Please critique the following codes based on their requirements: \n\nORIGINAL:\n{ORIGINAL}"

For code revise based on original code and its comment:

"Please revise the following codes based on their requirement and comment: \n\nORIGINAL:\n{ORIGINAL}\n\nCOMMENT:\n{CRITIQUE}"

For code diff summarization:

"Please summarise the flaws of the code by comparing the original code and its revise: \n\nORIGINAL:\n{ORIGINAL}\n\nREVISED:\n{REVISED}"

Licence

本仓库的模型依照 Apache-2.0 协议开源，模型的权重的使用则需要遵循LLama2MODEL LICENCE。

This repository's models are open-sourced under the Apache-2.0 license, and their weight usage must adhere to LLama2 MODEL LICENCE license.

Future Work

将会在近期逐步放出

掺入不同规模的code相关数据的LLM
在不同规模的模型上掺入code相关数据的LLM

I will release the following models:

Infusing the Large Language Model (LLM) with code-related data of varying scales
Incorporating code-related data into the Large Language Model (LLM) across models of varying scales