atsuki-yamaguchi commited on
Commit
153a2f5
·
verified ·
1 Parent(s): 8af1188

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ datasets:
5
+ - allenai/MADLAD-400
6
+ language:
7
+ - ta
8
+ base_model:
9
+ - Qwen/Qwen2.5-7B-Instruct
10
+ library_name: transformers
11
+ ---
12
+ # Qwen2.5 7B Instruct for Tamil: Vocabulary expansion
13
+
14
+ This model is built on top of Qwen2.5 7B Instruct adapted for Tamil using 500M target language tokens sampled from MADLAD-400. It has an additional target vocabulary of 10K.
15
+
16
+ ## Model Details
17
+
18
+ * **Vocabulary**: This model has an additional target vocabulary of 10K.
19
+ * **Target vocabulary initialization**: The target weights of the embedding and LM head were initialized using mean initialization.
20
+ * **Training**: This model was continually pre-trained on 500M target language tokens sampled from MADLAD-400.
21
+
22
+
23
+ ## Model Description
24
+
25
+ - **Language:** Tamil
26
+ - **License:** Apache 2.0
27
+ - **Fine-tuned from model:** Qwen/Qwen2.5-7B-Instruct
28
+
29
+
30
+ ## Model Sources
31
+
32
+ - **Repository:** https://github.com/gucci-j/chat-cve
33
+ - **Paper:** https://arxiv.org/abs/2412.11704
34
+
35
+
36
+ ## How to Get Started with the Model
37
+ Use the code below to get started with the model.
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForCausalLM
40
+
41
+ model = AutoModelForCausalLM.from_pretrained(
42
+ "atsuki-yamaguchi/Qwen2.5-7B-Instruct-ta-madlad-mean-tuned"
43
+ )
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ "atsuki-yamaguchi/Qwen2.5-7B-Instruct-ta-madlad-mean-tuned"
46
+ )
47
+ ```
48
+
49
+
50
+ ## Citation
51
+ ```
52
+ @misc{yamaguchi2024vocabularyexpansionchatmodels,
53
+ title={{ElChat}: Adapting Chat Language Models Using Only Target Unlabeled Language Data},
54
+ author={Atsuki Yamaguchi and Terufumi Morishita and Aline Villavicencio and Nikolaos Aletras},
55
+ year={2024},
56
+ eprint={2412.11704},
57
+ archivePrefix={arXiv},
58
+ primaryClass={cs.CL},
59
+ url={https://arxiv.org/abs/2412.11704},
60
+ }
61
+ ```
62
+
63
+