File size: 6,704 Bytes
8fbc569 4eb108b 8fbc569 5821e73 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# chatglm-maths
chatglm-6b微调/LORA/PPO/推理, 样本为自动生成的整数/小数加减乘除运算, 可gpu/cpu
# Github
[https://github.com/yongzhuo/chatglm-maths](https://github.com/yongzhuo/chatglm-maths)
## 踩坑
```python
1. eps=1e-5(不要改小), 半精度float16, 以及LN采用的是Post-LN(泛化性更好) + DeepNorm, 【害, Attention前也有LN】目的是大模型为了防止梯度溢出等;
2. 模型输入输出, 默认的tokenization_chatglm.py/modeling_chatglm.py不能用, 因为那是完全为生成generate设置的, 需要自己写好所有缩入参数, 或者机子改成适配的;
2.1 ChatGLMModel中, get_masks()正常, get_position_ids()函数中‘context_length = seq.index(150004) + 1’ 改为 ‘context_length = len(seq)’;
2.2 训练输入input_ids格式暂定为(训练后post-padding, 推理前pre-padding[tokenization_chatglm.py默认pre-padding])
x: prompt_1 + "_" + text_1 + "\n" + prompt_2 + [gMASK] + [BOS] + "_" + text_2 + [PAD]*N
2.3 训练输入label_ids格式暂定为(CrossEntropyLoss默认忽略-100不参与计算loss)
y = [-100]*len(text_1) + [BOS] + text_2 + [EOS] + [-100]*N
2.4 注意position/mask(自带的只是推理用的batch_size=1, 所以训练输入还得自己写), 可参考GLM-130的README.md, huozhe 查看GLM-1源码https://github.com/THUDM/GLM/blob/main/tasks/seq2seq/dataset.py
3. 注意chatglm-6b权重是float16的, 不过计算loss时候会转成float32计算, 最后loss再转回float16更新梯度;
4. ChatGLMTokenizer有时候会报奇奇怪怪的错误, 建议生成时候设置max_new_tokens, 最大{"max_new_tokens": 2048}; decode有时候会出现不存在id;
5. 低秩自适应LORA, RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
尝试 transformers升级到最新, get_peft_model后再.cuda(), device_map={'':torch.cuda.current_device()},
```
## 微调数据
1. 原始数据来自[https://github.com/LYH-YF/MWPToolkit](https://github.com/LYH-YF/MWPToolkit)
处理后的微调数据(算式/解方程)-MWP: [https://huggingface.co/datasets/Macropodus/MWP-Instruct](https://huggingface.co/datasets/Macropodus/MWP-Instruct)
3. 大数加减乘除来自: [https://github.com/liutiedong/goat.git ](https://github.com/liutiedong/goat.git )
## LoRA权重
```shell
Baichuan-7B-GPT4ForALL: https://huggingface.co/Macropodus/MWP-Instruct
Bloomz-7B-GPT4ForALL: https://huggingface.co/Macropodus/MWP-Instruct
ChatGLM-6B-GPT4ForALL: https://huggingface.co/Macropodus/MWP-Instruct
LlaMA-7B-GPT4ForALL: https://huggingface.co/Macropodus/MWP-Instruct
ChatGLM-6B-MWP: https://huggingface.co/Macropodus/MWP-Instruct
```
## 数据集-中文
- [https://github.com/tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [https://github.com/LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE)
- [https://github.com/carbonz0/alpaca-chinese-dataset](https://github.com/carbonz0/alpaca-chinese-dataset)
## 环境配置
```shell
transformers>=4.26.1
cpm_kernels==1.0.11
icetk==0.0.4
torch>=1.10.1
rouge==1.0.1
nltk==3.6.6
peft>=0.2.0
numpy
tqdm
lion_pytorch
macropodus
trl>=0.4.1
```
## 微调-计算题
```shell
lora
微调: python c00_toy_lora_train_6b.py
推理: python p00_toy_lora_predict_6b.py
ppo
训练: python t10_toy_trl_train_ppo.py
测试: python t10_toy_trl_predict_ppo.py
6b
微调: python c00_toy_cpu_train_6b.py
推理: python p00_toy_cpu_predit_6b.py
small-layer
微调: python c01_toy_cpu_train_small.py
推理: python p01_toy_cpu_predict_small.py
```
## 参考/感谢
- [https://github.com/THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B)
- [https://github.com/THUDM/GLM](https://github.com/THUDM/GLM)
- [https://github.com/tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [https://github.com/LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE)
- [https://github.com/huggingface/peft](https://github.com/huggingface/peft)
- [https://github.com/mymusise/ChatGLM-Tuning](https://github.com/mymusise/ChatGLM-Tuning)
- [https://github.com/bojone/bert4keras](https://github.com/bojone/bert4keras)
- [trl](https://github.com/lvwerra/trl)
- [math23k](https://aclanthology.org/D17-1088)
## 推理日志toy
```cpu
generator_calculate_line: ('13+75=', '13+75=88')
tokenizer.vocab_size: 150344
eval: 0%| | 0/1 [00:00<?, ?it/s]batch_query: ['简便运算: 98+83= 剖析: 98+83=181']
batch_qtext_0: 简便运算: 98+83= 剖析:
batch_qans_0: 98+83=181
response_0: 98+83=171
{'rouge-1': 0.0, 'rouge-2': 0.0, 'rouge-l': 0.0, 'bleu': 0.0}
请输入:
25.31+86.35=
请稍等...
25.31+86.35=101.66
```
## 微调日志toy
```cpu
generator_calculate_line: ('13+75=', '13+75=88')
tokenizer.vocab_size: 150344
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00, 1.31s/it]
transformer.word_embeddings.weight False
......
transformer.layers.26.mlp.dense_4h_to_h.bias False
transformer.layers.27.input_layernorm.weight True
transformer.layers.27.input_layernorm.bias True
transformer.layers.27.attention.query_key_value.weight True
transformer.layers.27.attention.query_key_value.bias True
transformer.layers.27.attention.dense.weight True
transformer.layers.27.attention.dense.bias True
transformer.layers.27.post_attention_layernorm.weight True
transformer.layers.27.post_attention_layernorm.bias True
transformer.layers.27.mlp.dense_h_to_4h.weight True
transformer.layers.27.mlp.dense_h_to_4h.bias True
transformer.layers.27.mlp.dense_4h_to_h.weight True
transformer.layers.27.mlp.dense_4h_to_h.bias True
transformer.final_layernorm.weight True
transformer.final_layernorm.bias True
model.chat start
13+75=88, but that's not the correct answer. The correct answer is 13+75=88, which is 90.
/anaconda3/envs/py371/lib/python3.7/site-packages/transformers/optimization.py:395: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
epoch: 0%|
---
license: cc-by-nc-4.0
---
|