--- base_model: - aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct language: - en license: llama3 pipeline_tag: text-generation library_name: transformers --- # Llama3 8B SEA-Lionv2.1 SECURE Llama3 8B SEA-Lionv2.1 SECURE is built upon Llama3 8B CPT SEA-Lionv2.1 Instruct, a variant of Llama-3-8B-Instruct fine-tuned for ASEAN languages. While the base model enhances multilingual capabilities, it lacks dedicated safety alignment for Singlish. Llama3 8B SEA-Lionv2.1 SECURE specifically addresses this gap by incorporating targeted safety alignment to mitigate Singlish-related toxicity, ensuring safer responses in Singlish-speaking contexts. ## Model Details - **Developed by:** AI Practice, GovTech Singapore - **Model type:** Decoder - **Context Length:** 8192 - **Languages supported:** English - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE) ### Description Llama3 8B SEA-Lionv2.1 SECURE has undergone additional safety alignment for Singlish toxicity using 25,000 prompt-response pairs, constructed from an internal dataset of toxic and non-toxic Singlish prompts. Safety alignment was performed using both **supervised finetuning** and **Kahneman-Tversky Optimization**. This additional alignment significantly improves Llama3 8B SEA-Lionv2.1 SECURE ’s performance on our internal Singlish toxicity benchmark, with gains generalizing to TOXIGEN and outperforming both Llama3 8B CPT SEA-Lionv2.1 Instruct and Llama-3-8B-Instruct on the same benchmarks. Additionally, we observed minimal declines in Open LLM Leaderboard v2 benchmark performance relative to Llama3 8B CPT SEA-Lionv2.1 Instruct. Full experiments details and our insights into safety alignment can be found in the [paper](arxiv.org/abs/2502.12485). ### Fine-Tuning Details We applied parameter-efficient fine-tuning via LoRA (rank=128, alpha=128) on a single A100-80GB GPU, utilizing Supervised Fine-Tuning (SFT) and Kahneman-Tversky Optimization (KTO). ### Benchmark Performance We evaluated Llama3 8B SEA-Lionv2.1 SECURE on three benchmarks: 1. Singlish Toxicity Benchmark: - Description: Evaluation set of safe and unsafe Singlish prompts to evaluate Singlish toxicity. Model responses are scored for **toxicity** using [Lionguard](https://huggingface.co/govtech/lionguard-v1), a Singlish-based toxicity classifier, and **refusal** using [distilroberta-base-rejection-v1](https://huggingface.co/protectai/distilroberta-base-rejection-v1). - Metrics: - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\) - Refusal Rate: \\(\text{RR} = \frac{\text{N unsafe with refusal response}}{\text{N unsafe}}\\) - False Positive Rate: \\(\text{FPR} = \frac{\text{N safe with refusal response}}{\text{N safe}}\\) 2. TOXIGEN: - Description: Subset of the TOXIGEN dataset, a large-scale machine-generated dataset of toxic and benign statements towards 13 minority groups - Metrics: - Toxicity Rate: \\(\text{TR} = \frac{\text{N unsafe with unsafe response}}{\text{N unsafe}}\\) 3. Open LLM Leaderboard Benchmarks - Description: Open LLM Leaderboard v2 tasks, which span instruction-following, reasoning, and knowledge application - Metrics: - Normalized scores ([see](https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/normalization))