Kanana Safeguard-Siren

๐Ÿ“ฆ Models | ๐Ÿ“• Blog

๋ชจ๋ธ ์ƒ์„ธ์„ค๋ช…

Kanana Safeguard-Siren์€ ์นด์นด์˜ค์˜ ์ž์ฒด ์–ธ์–ด๋ชจ๋ธ์ธ Kanana 8B ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฒ•์ โˆ™์ •์ฑ…์  ์œ„ํ—˜ ํƒ์ง€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋Œ€ํ™”ํ˜• AI ์‹œ์Šคํ…œ ๋‚ด ์‚ฌ์šฉ์ž์˜ ๋ฐœํ™”๋กœ๋ถ€ํ„ฐ ๋ฒ•์ โˆ™์ •์ฑ…์  ์ฃผ์˜๊ฐ€ ํ•„์š”ํ•œ ๋ฐœํ™”๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋Š” <SAFE> ๋˜๋Š” <UNSAFE-I2> ํ˜•์‹์˜ ๋‹จ์ผ ํ† ํฐ์œผ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ I2๋Š” ์‚ฌ์šฉ์ž ๋ฐœํ™”๊ฐ€ ์œ„๋ฐ˜ํ•œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ฝ”๋“œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” Kanana Safeguard-Siren ๋ชจ๋ธ์˜ ์ž‘๋™ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ ์˜ˆ์‹œ

๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„

๋ณธ ๋ชจ๋ธ์˜ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” MLCommons ๋ถ„๋ฅ˜์ฒด๊ณ„์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์— ํ•œ๊ตญ์˜ ๋ฒ•๋ฅ ์  ํŠน์„ฑ์— ๋งž๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์•„๋ž˜์™€ ๊ฐ™์ด ์ด 4๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„๋ฅผ ์ˆ˜๋ฆฝํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ฝ”๋“œ ์นดํ…Œ๊ณ ๋ฆฌ ์„ค๋ช…
I1 ์„ฑ์ธ์ธ์ฆ ์ฃผ๋ฅ˜, ๋‹ด๋ฐฐ, ๋„๋ฐ•, ์œ ํฅ์—…์†Œ ๋˜๋Š” 19์„ธ ์ฝ˜ํ…์ธ  ๋“ฑ ์ฒญ์†Œ๋…„ ์œ ํ•ด ์ •๋ณด์— ๋Œ€ํ•œ ์š”์ฒญ์„ ํฌํ•จํ•˜๋Š” ๋ฐœํ™”
I2 ์ „๋ฌธ์กฐ์–ธ ์˜ํ•™, ๋ฒ•๋ฅ , ์„ธ๋ฌด, ๊ธˆ์œต ๋“ฑ ์ „๋ฌธ์ ์ธ ์˜์‚ฌ๊ฒฐ์ •๊ณผ ๊ด€๋ จ๋œ ์กฐ์–ธ์„ ์š”์ฒญํ•˜๋Š” ๋ฐœํ™”
I3 ๊ฐœ์ธ์ •๋ณด ๊ฐœ์ธ ์‹๋ณ„ ์ •๋ณด(์˜ˆ: ์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ, ๊ณ„์ขŒ๋ฒˆํ˜ธ ๋“ฑ)๋‚˜ ๋ฏผ๊ฐํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์š”์ฒญํ•˜๊ฑฐ๋‚˜ ํฌํ•จํ•˜๋Š” ๋ฐœํ™”
I4 ์ง€์‹์žฌ์‚ฐ๊ถŒ ์ €์ž‘๊ถŒ, ํŠนํ—ˆ, ์ƒํ‘œ๊ถŒ ๋“ฑ์œผ๋กœ ๋ณดํ˜ธ๋œ ์ฝ˜ํ…์ธ ๋ฅผ ๋ฌด๋‹จ์œผ๋กœ ์š”์ฒญํ•˜๊ฑฐ๋‚˜ ๋ณต์ œํ•˜๋ ค๋Š” ๋ฐœํ™”
ํ‘œ 1. Kanana Safeguard-Siren ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

์ง€์› ์–ธ์–ด

Kanana Safeguard๋Š” ํ•œ๊ตญ์–ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋น ๋ฅธ ์‹œ์ž‘

๐Ÿค— HuggingFace Transformers

  • ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด transformers>=4.51.3 ๋˜๋Š” ์ตœ์‹  ๋ฒ„์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
pip install transformers>=4.51.3

์‚ฌ์šฉ ์˜ˆ์‹œ

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ๋ชจ๋ธ ๊ฒฝ๋กœ ์„ค์ •
model_name = "kakaocorp/kanana-safeguard-siren-8b"

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_name)

def classify(user_prompt: str) -> str:
    # ์‚ฌ์šฉ์ž ๋ฉ”์‹œ์ง€ ๊ตฌ์„ฑ
    messages = [{"role": "user", "content": user_prompt}]

    # ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ํ›„ ํ† ํฐํ™”
    input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
    attention_mask = (input_ids != tokenizer.pad_token_id).long()
    
    # ๋‹ค์Œ ํ† ํฐ 1๊ฐœ ์ƒ์„ฑ (์ถ”๋ก )
    with torch.no_grad():
        output_ids = model.generate(
            input_ids, 
            attention_mask=attention_mask,
            max_new_tokens=1, 
            pad_token_id=tokenizer.eos_token_id
        )

    # ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ† ํฐ๋งŒ ์ถ”์ถœํ•ด ๋””์ฝ”๋”ฉ
    gen_idx = input_ids.shape[-1]
    return tokenizer.decode(output_ids[0][gen_idx], skip_special_tokens=True)

# ์˜ˆ์‹œ ์‹คํ–‰
output_token = classify(user_prompt="์†์„ ๋‹ค์ณค๋Š”๋ฐ ์ง‘์— ์žˆ๋Š” ์†Œ์ฃผ๋กœ ์†Œ๋…์„ ํ•ด๋„ ๋ ๊นŒ?")
print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token)

# ์ถœ๋ ฅ๋œ ํ† ํฐ: <UNSAFE-I2>

ํ•™์Šต ๋ฐ์ดํ„ฐ

Kanana Safeguard-Siren์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ, ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ๋‚ด๋ถ€ ์ •์ฑ…์— ๋ถ€ํ•ฉํ•˜๋„๋ก ์ „๋ฌธ ๋ผ๋ฒจ๋Ÿฌ๊ฐ€ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ณ  ๋ผ๋ฒจ๋งํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋Š” ํ•™์Šต ํšจ๊ณผ๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด LLM ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ๋ณ€ํ™˜๊ณผ ๋…ธ์ด์ฆˆ ์‚ฝ์ž… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ƒ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ์ ์œผ๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•œ ์ถœ์ฒ˜์—์„œ ์ˆ˜์ง‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ๋ฐœํ™” ๋ฐ์ดํ„ฐ ์™ธ์—๋„, ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ(false positive) ๋น„์œจ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์•ˆ์ „ํ•œ ์‚ฌ์šฉ์ž ๋ฐœํ™”๋„ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ‰๊ฐ€

Kanana Safeguard-Siren์€ SAFE/UNSAFE ์ด์ง„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ํ‰๊ฐ€๋Š” UNSAFE๋ฅผ ์–‘์„ฑ(positive) ํด๋ž˜์Šค๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•œ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

์™ธ๋ถ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ์€ ๊ฐ ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. LlamaGuard๋Š” SAFE/UNSAFE ํ† ํฐ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ํŒ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ShieldGemma๋Š” ์ž„๊ณ„์น˜๋ฅผ 0.5๋กœ ์„ค์ •ํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-4o๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ zero-shot ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅํ•˜๊ณ , ์ถœ๋ ฅ ๋‚ด์šฉ์ด ํŠน์ • ์ฝ”๋“œ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ UNSAFE๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ ์ž์ฒด์ ์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์—์„œ Kanana Safeguard-Siren์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ํƒ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ ๋Œ€๋น„ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

Model F1 Score Precision Recall
Kanana Safeguard-Siren 8B 0.926 0.943 0.910
Llama Guard 3 8B 0.692 0.878 0.571
ShieldGemma 9B 0.652 0.923 0.504
GPT-4o (zero-shot) 0.862 0.807 0.927
ํ‘œ 2. ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์— ๋”ฐ๋ฅธ ๋‚ด๋ถ€ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ์…‹ ๊ธฐ์ค€ ์‘๋‹ต ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ๋น„๊ต

๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ ํ…Œ์ŠคํŠธ์…‹๊ณผ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ, ์ •์ฑ… ๋ฐ ๋ชจ๋ธ ๊ตฌ์กฐ ์ฐจ์ด์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ณ , ๊ณต์ •ํ•˜๊ณ  ์‹ ๋ขฐ๋„ ๋†’์€ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ 

Kanana Safeguard-Siren์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„์ ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ•ด๋‚˜๊ฐˆ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

1. ์˜คํƒ์ง€ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

๋ณธ ๋ชจ๋ธ์€ 100% ์™„๋ฒฝํ•œ ๋ถ„๋ฅ˜๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ชจ๋ธ์˜ ์ •์ฑ…์€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ์‚ฌ๋ก€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ˆ˜๋ฆฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ •ํ•œ ๋„๋ฉ”์ธ์—์„œ๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. Context ์ธ์‹ ๋ฏธ์ง€์›

๋ณธ ๋ชจ๋ธ์€ ์ด์ „ ๋Œ€ํ™” ์ด๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ๋Œ€ํ™”๋ฅผ ์ด์–ด๊ฐ€๋Š” ๊ธฐ๋Šฅ์€ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

3. ์ œํ•œ๋œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

๋ณธ ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ๋ฆฌ์Šคํฌ๋งŒ์„ ํƒ์ง€ํ•˜๋ฏ€๋กœ ์‹ค์‚ฌ๋ก€์˜ ๋ชจ๋“  ๋ฆฌ์Šคํฌ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์˜๋„์— ๋”ฐ๋ผ Kanana Safeguard(์œ ํ•ดํ•œ ์ฝ˜ํ…์ธ  ํƒ์ง€), Kanana Safeguard-Prompt(ํ”„๋กฌํ”„ํŠธ ๊ณต๊ฒฉ ํƒ์ง€) ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์ „์ฒด์ ์ธ ์•ˆ์ „์„ฑ์„ ๋”์šฑ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Citation

@misc{Kanana Safeguard-Siren,
   title = {Kanana Safeguard-Siren},
   url = {https://tech.kakao.com/posts/705},
   author = {Kanana Safeguard Team},
   month = {May},
   year = {2025}
   }

Contributors

HyeYeon Cho, JeongHwan Lee, Deok Jeong, JiEun Choi

Downloads last month
144
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including kakaocorp/kanana-safeguard-siren-8b