Kanana Safeguard

๐Ÿ“ฆModels | ๐Ÿ“• Blog

๋ชจ๋ธ ์ƒ์„ธ์„ค๋ช…

Kanana Safeguard๋Š” ์นด์นด์˜ค์˜ ์ž์ฒด ์–ธ์–ด๋ชจ๋ธ์ธ Kanana 8B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์œ ํ•ด ์ฝ˜ํ…์ธ  ํƒ์ง€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋Œ€ํ™”ํ˜• AI ์‹œ์Šคํ…œ ๋‚ด ์‚ฌ์šฉ์ž ๋ฐœํ™” ๋˜๋Š” AI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋‹ต๋ณ€์œผ๋กœ๋ถ€ํ„ฐ ๋ฆฌ์Šคํฌ ์—ฌ๋ถ€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋Š” <SAFE> ๋˜๋Š” <UNSAFE-S4> ํ˜•์‹์˜ ๋‹จ์ผ ํ† ํฐ์œผ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ S4๋Š” ์‚ฌ์šฉ์ž ๋ฐœํ™” ๋˜๋Š” AI ์–ด์‹œ์Šคํ„ดํŠธ ๋‹ต๋ณ€์ด ์œ„๋ฐ˜ํ•œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ฝ”๋“œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” Kanana Safeguard ๋ชจ๋ธ์˜ ์ž‘๋™ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ ์˜ˆ์‹œ

๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„

๋ณธ ๋ชจ๋ธ์˜ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” MLCommons ๋ถ„๋ฅ˜์ฒด๊ณ„์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์— ํ•œ๊ตญ ๋กœ์ปฌ ํŠน์„ฑ์— ๋งž๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์•„๋ž˜์™€ ๊ฐ™์ด ์ด 7๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„๋ฅผ ์ˆ˜๋ฆฝํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋ณธ ๋ชจ๋ธ์—์„œ โ‘ ์‚ฌ์šฉ์ž์˜ ๋ฐœํ™”์™€ โ‘กAI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋‹ต๋ณ€์€ ๋™์ผํ•œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„์— ์˜ํ•ด ํŒ๋ณ„๋ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์นดํ…Œ๊ณ ๋ฆฌ ์„ค๋ช…
S1 ์ฆ์˜ค ์ถœ์‹ , ์ธ์ข…, ์™ธ์–‘, ์žฅ์•  ๋ฐ ์งˆ๋ณ‘ ์œ ๋ฌด, ์‚ฌํšŒ ๊ฒฝ์ œ์  ์ƒํ™ฉ ๋ฐ ์ง€์œ„, ์ข…๊ต, ์—ฐ๋ น, ์„ฑ๋ณ„ยท์„ฑ ์ •์ฒด์„ฑยท์„ฑ์  ์ง€ํ–ฅ ๋˜๋Š” ๊ธฐํƒ€ ์ •์ฒด์„ฑ ์š”์ธ ๋“ฑ์„ ์ด์œ ๋กœ ํŠน์ • ๋Œ€์ƒ์„ ์ฐจ๋ณ„ํ•˜๊ฑฐ๋‚˜, ์ด๋Ÿฌํ•œ ์ฐจ๋ณ„์— ๊ธฐ๋ฐ˜ํ•ด ๊ฐœ์ธ ๋˜๋Š” ์ง‘๋‹จ์„ ๊ณต๊ฒฉํ•˜๋Š” ๋ฐœํ™”
S2 ๊ดด๋กญํž˜ ํƒ€์ธ์—๊ฒŒ ๋ถˆ์พŒ๊ฐ์ด๋‚˜ ๊ตด์š•๊ฐ์„ ์ฃผ๊ฑฐ๋‚˜, ์œ„ํ˜‘์ ์ด๊ฑฐ๋‚˜, ํŠน์ • ๋Œ€์ƒ์— ๋Œ€ํ•œ ๊ดด๋กญํž˜์„ ๋ถ€์ถ”๊ธฐ๋Š” ๋ฐœํ™”
S3 ์„ฑ์  ์ฝ˜ํ…์ธ  ์„ฑ์  ํ–‰์œ„๋‚˜ ์‹ ์ฒด๋ฅผ ๋ฌ˜์‚ฌ/์•”์‹œํ•˜๊ฑฐ๋‚˜, ์„ฑ์  ์ˆ˜์น˜์‹ฌ/ํ˜์˜ค๊ฐ์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐœํ™” (์„ฑ๊ต์œก ๋ฐ ์›ฐ๋น™ ์ œ์™ธ)
S4 ๋ฒ”์ฃ„ ๋ถˆ๋ฒ•์ ์ธ ํ–‰์œ„(์˜ˆ: ํญ๋ ฅโˆ™๋น„ํญ๋ ฅ ๋ฒ”์ฃ„, ์„ฑ๋ฒ”์ฃ„, ๋ฌด๊ธฐ ์ œ์ž‘ยท์กฐ๋‹ฌ)๋ฅผ ๊ธฐํšํ•˜๊ณ  ์ค€๋น„ํ•˜๋Š” ๊ณผ์ •์„ ๋‹ด์€ ๋ฐœํ™”
S5 ์•„๋™ ์„ฑ์ฐฉ์ทจ ์•„๋™ ๋Œ€์ƒ์˜ ์„ฑ์  ํ•™๋Œ€์™€ ๊ด€๋ จ๋œ ์„ค๋ช…, ๊ฒฉ๋ ค, ์ง€์ง€ ๋“ฑ์˜ ๋ฐœํ™” (์˜ˆ: ๊ทธ๋ฃจ๋ฐ, CSAM ๊ด€๋ จ ํ…์ŠคํŠธ ๋“ฑ)
S6 ์ž์‚ด ๋ฐ ์žํ•ด ์˜๋„์ ์œผ๋กœ ์ž์‹ ์˜ ์ƒ๋ช…์„ ๋Š๊ฑฐ๋‚˜ ์ž์‹ ์˜ ์‹ ์ฒด๋ฅผ ์˜๋„์ ์œผ๋กœ ํ•ด์น˜๋Š” ํ–‰์œ„๋ฅผ ๋ฌ˜์‚ฌํ•˜๊ฑฐ๋‚˜ ์œ ๋„ํ•˜๋Š” ๋ฐœํ™”
S7 ์ž˜๋ชป๋œ ์ •๋ณด ๊ฐœ์ธ์ด๋‚˜ ์ง‘๋‹จ์—๊ฒŒ ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ „ํŒŒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐœํ™”
ํ‘œ 1. Kanana Safeguard ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

์ง€์› ์–ธ์–ด

Kanana Safeguard๋Š” ํ•œ๊ตญ์–ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋น ๋ฅธ ์‹œ์ž‘

๐Ÿค— HuggingFace Transformers

  • ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด transformers>=4.51.3 ๋˜๋Š” ์ตœ์‹  ๋ฒ„์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
pip install transformers>=4.51.3

์‚ฌ์šฉ ์˜ˆ์‹œ

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ๋ชจ๋ธ ๊ฒฝ๋กœ ์„ค์ •
model_name= "kakaocorp/kanana-safeguard-8b"

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_name)

def classify(user_prompt: str, assistant_prompt: str = "") -> str:

    # ๋ฉ”์‹œ์ง€ ๊ตฌ์„ฑ
    messages = [
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_prompt}
    ]

    # ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ํ›„ ํ† ํฐํ™”
    input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
    attention_mask = (input_ids != tokenizer.pad_token_id).long()

    # ๋‹ค์Œ ํ† ํฐ 1๊ฐœ ์ƒ์„ฑ (์ถ”๋ก )
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=1, 
            pad_token_id=tokenizer.eos_token_id
        )

    # ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ† ํฐ๋งŒ ์ถ”์ถœํ•ด ๋””์ฝ”๋”ฉ
    gen_idx = input_ids.shape[-1]
    return tokenizer.decode(output_ids[0][gen_idx], skip_special_tokens=True)

# ์˜ˆ์‹œ ์‹คํ–‰
output_token = classify(
    user_prompt="์นœ๊ตฌ์˜ ๋ฌผ๊ฑด์„ ๋ชฐ๋ž˜ ํ›”์น˜๋Š” ๋ฒ•์„ ์•Œ๋ ค์ค˜",
    assistant_prompt="์นœ๊ตฌ๊ฐ€ ์ž๋ฆฌ๋ฅผ ๋น„์šด ์‚ฌ์ด์— ๊ฐ€๋ฐฉ์— ํ›”์น˜๊ณ  ์‹ถ์€ ๋ฌผ๊ฑด์„ ๋„ฃ์œผ์„ธ์š”"
)
print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token)
# ์ถœ๋ ฅ๋œ ํ† ํฐ: <UNSAFE-S4>

output_token = classify(
    user_prompt="์นœ๊ตฌ์˜ ๋ฌผ๊ฑด์„ ๋ชฐ๋ž˜ ํ›”์น˜๋Š” ๋ฒ•์„ ์•Œ๋ ค์ค˜",
)
print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token)
# ์ถœ๋ ฅ๋œ ํ† ํฐ: <UNSAFE-S4>

ํ•™์Šต ๋ฐ์ดํ„ฐ

Kanana Safeguard์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ์™€ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ๋‚ด๋ถ€์ •์ฑ…์— ๋ถ€ํ•ฉํ•˜๋„๋ก ์ „๋ฌธ ๋ผ๋ฒจ๋Ÿฌ๊ฐ€ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ณ  ๋ผ๋ฒจ๋งํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋Š” LLM ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ๋ณ€ํ™˜๊ณผ ๋…ธ์ด์ฆˆ ์‚ฝ์ž… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ƒ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ๋ฐœํ™” ๋ฐ์ดํ„ฐ ์™ธ์—๋„, ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ(false positive) ๋น„์œจ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์œ ํ•ดํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์•ˆ์ „ํ•˜๊ฒŒ ์‘๋‹ตํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ‰๊ฐ€

Kanana Safeguard๋Š” SAFE/UNSAFE ์ด์ง„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ํ‰๊ฐ€๋Š” UNSAFE๋ฅผ ์–‘์„ฑ(positive) ํด๋ž˜์Šค๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•œ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

์™ธ๋ถ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ์€ ๊ฐ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. LlamaGuard๋Š” SAFE/UNSAFE ํ† ํฐ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ํŒ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ShieldGemma๋Š” ์ž„๊ณ„์น˜๋ฅผ 0.5๋กœ ์„ค์ •ํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-4o๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ zero-shot ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅํ•˜๊ณ , ์ถœ๋ ฅ ๋‚ด์šฉ์ด ํŠน์ • ์ฝ”๋“œ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ UNSAFE๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ ์ž์ฒด์ ์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์—์„œ Kanana Safeguard์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ํƒ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ ๋Œ€๋น„ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

Model F1 Score Precision Recall
Kanana Safeguard 8B 0.946 0.944 0.948
LlamaGuard3 8B 0.540 0.893 0.387
ShieldGemma 9B 0.477 0.640 0.380
GPT-4o (zero-shot) 0.763 0.696 0.843
ํ‘œ 2. ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์— ๋”ฐ๋ฅธ ๋‚ด๋ถ€ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ์…‹ ๊ธฐ์ค€ ์‘๋‹ต ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ๋น„๊ต

๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ, ์ •์ฑ… ๋ฐ ๋ชจ๋ธ ๊ตฌ์กฐ ์ฐจ์ด์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ณ , ๊ณต์ •ํ•˜๊ณ  ์‹ ๋ขฐ๋„ ๋†’์€ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ 

Kanana Safeguard๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„์ ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ•ด๋‚˜๊ฐˆ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

1. ์˜คํƒ์ง€ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

๋ณธ ๋ชจ๋ธ์€ 100% ์™„๋ฒฝํ•œ ๋ถ„๋ฅ˜๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ชจ๋ธ์˜ ์ •์ฑ…์€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ์‚ฌ๋ก€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ˆ˜๋ฆฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ •ํ•œ ๋„๋ฉ”์ธ์—์„œ๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. Context ์ธ์‹ ๋ฏธ์ง€์›

๋ณธ ๋ชจ๋ธ์€ ์ด์ „ ๋Œ€ํ™” ์ด๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ๋Œ€ํ™”๋ฅผ ์ด์–ด๊ฐ€๋Š” ๊ธฐ๋Šฅ์€ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

3. ์ œํ•œ๋œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

๋ณธ ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ๋ฆฌ์Šคํฌ๋งŒ์„ ํƒ์ง€ํ•˜๋ฏ€๋กœ ์‹ค์‚ฌ๋ก€์˜ ๋ชจ๋“  ๋ฆฌ์Šคํฌ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์˜๋„์— ๋”ฐ๋ผ Kanana Safeguard-Siren(๋ฒ•์  ๋ฆฌ์Šคํฌ ํƒ์ง€ ๋ชจ๋ธ), Kanana Safeguard-Prompt(ํ”„๋กฌํ”„ํŠธ ๊ณต๊ฒฉ ํƒ์ง€ ๋ชจ๋ธ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์ „์ฒด์ ์ธ ์•ˆ์ „์„ฑ์„ ๋”์šฑ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Citation

@misc{Kanana Safeguard,
   title = {Kanana Safeguard},
   url = {https://tech.kakao.com/posts/705},
   author = {Kanana Safeguard Team},
   month = {May},
   year = {2025}
   }

Contributors

JeongHwan Lee, Deok Jeong, HyeYeon Cho, JiEun Choi

Downloads last month
2
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including OpenLLM-Korea/kanana-safeguard-8b