gabrielchua commited on
Commit
d02534e
·
verified ·
1 Parent(s): 20d2719

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -87
README.md CHANGED
@@ -1,87 +1,87 @@
1
- ---
2
- base_model:
3
- - aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct
4
- language:
5
- - en
6
- license: llama3
7
- ---
8
- # Llama3 8B SEA-Lionv2.1 SECURE
9
- Llama3 8B SEA-Lionv2.1 SECURE is built upon Llama3 8B CPT SEA-Lionv2.1 Instruct, a variant of Llama-3-8B-Instruct fine-tuned for ASEAN languages. While the base model enhances multilingual capabilities, it lacks dedicated safety alignment for Singlish. Llama3 8B SEA-Lionv2.1 SECURE specifically addresses this gap by incorporating targeted safety alignment to mitigate Singlish-related toxicity, ensuring safer responses in Singlish-speaking contexts.
10
-
11
- ## Model Details
12
- - **Developed by:** AI Practice, GovTech Singapore
13
- - **Model type:** Decoder
14
- - **Context Length:** 8192
15
- - **Languages supported:** English
16
- - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
17
-
18
- ### Description
19
- Llama3 8B SEA-Lionv2.1 SECURE has undergone additional safety alignment for Singlish toxicity using 25,000 prompt-response pairs, constructed from an internal dataset of toxic and non-toxic Singlish prompts. Safety alignment was performed using both **supervised finetuning** and **Kahneman-Tversky Optimization**. This additional alignment significantly improves Llama3 8B SEA-Lionv2.1 SECURE ’s performance on our internal Singlish toxicity benchmark, with gains generalizing to TOXIGEN and outperforming both Llama3 8B CPT SEA-Lionv2.1 Instruct and Llama-3-8B-Instruct on the same benchmarks. Additionally, we observed minimal declines in Open LLM Leaderboard v2 benchmark performance relative to Llama3 8B CPT SEA-Lionv2.1 Instruct.
20
-
21
- Full experiments details and our insights into safety alignment can be found in the preprint [preprint coming soon]().
22
-
23
- ### Fine-Tuning Details
24
- We applied parameter-efficient fine-tuning via LoRA (rank=128, alpha=128) on a single A100-80GB GPU, utilizing Supervised Fine-Tuning (SFT) and Kahneman-Tversky Optimization (KTO).
25
-
26
- ### Benchmark Performance
27
- We evaluated Llama3 8B SEA-Lionv2.1 SECURE on three benchmarks:
28
- 1. Singlish Toxicity Benchmark:
29
- - Description: Evaluation set of safe and unsafe Singlish prompts to evaluate Singlish toxicity. Model responses are scored for **toxicity** using [Lionguard](https://huggingface.co/govtech/lionguard-v1), a Singlish-based toxicity classifier, and **refusal** using [distilroberta-base-rejection-v1](https://huggingface.co/protectai/distilroberta-base-rejection-v1).
30
- - Metrics:
31
- - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\)
32
-
33
- - Refusal Rate: \\(\text{RR} = \frac{\text{N unsafe with refusal response}}{\text{N unsafe}}\\)
34
-
35
- - False Positive Rate: \\(\text{FPR} = \frac{\text{N safe with refusal response}}{\text{N safe}}\\)
36
-
37
- 2. TOXIGEN:
38
- - Description: Subset of the TOXIGEN dataset, a large-scale machine-generated dataset of toxic and benign statements towards 13 minority groups
39
- - Metrics:
40
- - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\)
41
- 3. Open LLM Leaderboard Benchmarks
42
- - Description: Open LLM Leaderboard v2 tasks, which span instruction-following, reasoning, and knowledge application
43
- - Metrics:
44
- - Normalized scores ([see](https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/normalization))
45
-
46
- <p align="center">
47
- <img src="assets/toxicity%20scores.png" alt="Toxicity Scores" width="500">
48
- <img src="assets/leaderboard.png" alt="Leaderboard" height="250">
49
- </p>
50
-
51
-
52
- ## Disclaimer
53
- Users are solely responsible for the deployment and usage of this model. While the model has undergone additional safety alignment, this does not guarantee absolute safety, accuracy, or reliability. Like all language models, it can generate hallucinated or misleading content, and users should independently verify outputs before relying on them. The authors make no warranties regarding the model's behavior and disclaim any liability for claims, damages, or other consequences arising from the use of the released weights and code.
54
-
55
- ## Usage
56
- Llama3 8B SEA-Lionv2.1 SECURE can be run using the 🤗 Transformers library
57
-
58
- ```python
59
- import transformers
60
- import torch
61
-
62
- pipeline = transformers.pipeline(
63
- "text-generation",
64
- model='govtech/llama3-8b-sea-lionv2.1-instruct-secure',
65
- model_kwargs={"torch_dtype": torch.bfloat16},
66
- device_map="auto",
67
- )
68
- messages = [
69
- {"role": "user", "content": "Hello!"},
70
- ]
71
-
72
- outputs = pipeline(
73
- messages,
74
- max_new_tokens=256,
75
- )
76
- print(outputs[0]["generated_text"][-1])
77
- ```
78
-
79
- ## The Team
80
- Isaac Lim, Shaun Khoo, Goh Jiayi, Jessica Foo, Watson Chua
81
-
82
- ## Contact
83
- For more information, please reach out to [email protected].
84
-
85
- ## Acknowledgements
86
- Acknowledgments for contributors and supporting teams will be added soon.
87
-
 
1
+ ---
2
+ base_model:
3
+ - aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct
4
+ language:
5
+ - en
6
+ license: llama3
7
+ ---
8
+ # Llama3 8B SEA-Lionv2.1 SECURE
9
+ Llama3 8B SEA-Lionv2.1 SECURE is built upon Llama3 8B CPT SEA-Lionv2.1 Instruct, a variant of Llama-3-8B-Instruct fine-tuned for ASEAN languages. While the base model enhances multilingual capabilities, it lacks dedicated safety alignment for Singlish. Llama3 8B SEA-Lionv2.1 SECURE specifically addresses this gap by incorporating targeted safety alignment to mitigate Singlish-related toxicity, ensuring safer responses in Singlish-speaking contexts.
10
+
11
+ ## Model Details
12
+ - **Developed by:** AI Practice, GovTech Singapore
13
+ - **Model type:** Decoder
14
+ - **Context Length:** 8192
15
+ - **Languages supported:** English
16
+ - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
17
+
18
+ ### Description
19
+ Llama3 8B SEA-Lionv2.1 SECURE has undergone additional safety alignment for Singlish toxicity using 25,000 prompt-response pairs, constructed from an internal dataset of toxic and non-toxic Singlish prompts. Safety alignment was performed using both **supervised finetuning** and **Kahneman-Tversky Optimization**. This additional alignment significantly improves Llama3 8B SEA-Lionv2.1 SECURE ’s performance on our internal Singlish toxicity benchmark, with gains generalizing to TOXIGEN and outperforming both Llama3 8B CPT SEA-Lionv2.1 Instruct and Llama-3-8B-Instruct on the same benchmarks. Additionally, we observed minimal declines in Open LLM Leaderboard v2 benchmark performance relative to Llama3 8B CPT SEA-Lionv2.1 Instruct.
20
+
21
+ Full experiments details and our insights into safety alignment can be found in the [paper](arxiv.org/abs/2502.12485).
22
+
23
+ ### Fine-Tuning Details
24
+ We applied parameter-efficient fine-tuning via LoRA (rank=128, alpha=128) on a single A100-80GB GPU, utilizing Supervised Fine-Tuning (SFT) and Kahneman-Tversky Optimization (KTO).
25
+
26
+ ### Benchmark Performance
27
+ We evaluated Llama3 8B SEA-Lionv2.1 SECURE on three benchmarks:
28
+ 1. Singlish Toxicity Benchmark:
29
+ - Description: Evaluation set of safe and unsafe Singlish prompts to evaluate Singlish toxicity. Model responses are scored for **toxicity** using [Lionguard](https://huggingface.co/govtech/lionguard-v1), a Singlish-based toxicity classifier, and **refusal** using [distilroberta-base-rejection-v1](https://huggingface.co/protectai/distilroberta-base-rejection-v1).
30
+ - Metrics:
31
+ - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\)
32
+
33
+ - Refusal Rate: \\(\text{RR} = \frac{\text{N unsafe with refusal response}}{\text{N unsafe}}\\)
34
+
35
+ - False Positive Rate: \\(\text{FPR} = \frac{\text{N safe with refusal response}}{\text{N safe}}\\)
36
+
37
+ 2. TOXIGEN:
38
+ - Description: Subset of the TOXIGEN dataset, a large-scale machine-generated dataset of toxic and benign statements towards 13 minority groups
39
+ - Metrics:
40
+ - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\)
41
+ 3. Open LLM Leaderboard Benchmarks
42
+ - Description: Open LLM Leaderboard v2 tasks, which span instruction-following, reasoning, and knowledge application
43
+ - Metrics:
44
+ - Normalized scores ([see](https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/normalization))
45
+
46
+ <p align="center">
47
+ <img src="assets/toxicity%20scores.png" alt="Toxicity Scores" width="500">
48
+ <img src="assets/leaderboard.png" alt="Leaderboard" height="250">
49
+ </p>
50
+
51
+
52
+ ## Disclaimer
53
+ Users are solely responsible for the deployment and usage of this model. While the model has undergone additional safety alignment, this does not guarantee absolute safety, accuracy, or reliability. Like all language models, it can generate hallucinated or misleading content, and users should independently verify outputs before relying on them. The authors make no warranties regarding the model's behavior and disclaim any liability for claims, damages, or other consequences arising from the use of the released weights and code.
54
+
55
+ ## Usage
56
+ Llama3 8B SEA-Lionv2.1 SECURE can be run using the 🤗 Transformers library
57
+
58
+ ```python
59
+ import transformers
60
+ import torch
61
+
62
+ pipeline = transformers.pipeline(
63
+ "text-generation",
64
+ model='govtech/llama3-8b-sea-lionv2.1-instruct-secure',
65
+ model_kwargs={"torch_dtype": torch.bfloat16},
66
+ device_map="auto",
67
+ )
68
+ messages = [
69
+ {"role": "user", "content": "Hello!"},
70
+ ]
71
+
72
+ outputs = pipeline(
73
+ messages,
74
+ max_new_tokens=256,
75
+ )
76
+ print(outputs[0]["generated_text"][-1])
77
+ ```
78
+
79
+ ## The Team
80
+ Isaac Lim, Shaun Khoo, Goh Jiayi, Jessica Foo, Watson Chua
81
+
82
+ ## Contact
83
+ For more information, please reach out to [email protected].
84
+
85
+ ## Acknowledgements
86
+ Acknowledgments for contributors and supporting teams will be added soon.
87
+