Kanana Safeguard

๐Ÿ“ฆModels | ๐Ÿ“• Blog

๋ชจ๋ธ ์ƒ์„ธ์„ค๋ช…

Kanana Safeguard๋Š” ์นด์นด์˜ค์˜ ์ž์ฒด ์–ธ์–ด๋ชจ๋ธ์ธ Kanana 8B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์œ ํ•ด ์ฝ˜ํ…์ธ  ํƒ์ง€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋Œ€ํ™”ํ˜• AI ์‹œ์Šคํ…œ ๋‚ด ์‚ฌ์šฉ์ž ๋ฐœํ™” ๋˜๋Š” AI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋‹ต๋ณ€์œผ๋กœ๋ถ€ํ„ฐ ๋ฆฌ์Šคํฌ ์—ฌ๋ถ€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋Š” <SAFE> ๋˜๋Š” <UNSAFE-S4> ํ˜•์‹์˜ ๋‹จ์ผ ํ† ํฐ์œผ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ S4๋Š” ์‚ฌ์šฉ์ž ๋ฐœํ™” ๋˜๋Š” AI ์–ด์‹œ์Šคํ„ดํŠธ ๋‹ต๋ณ€์ด ์œ„๋ฐ˜ํ•œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ฝ”๋“œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” Kanana Safeguard ๋ชจ๋ธ์˜ ์ž‘๋™ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ ์˜ˆ์‹œ

๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„

๋ณธ ๋ชจ๋ธ์˜ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” MLCommons ๋ถ„๋ฅ˜์ฒด๊ณ„์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์— ํ•œ๊ตญ ๋กœ์ปฌ ํŠน์„ฑ์— ๋งž๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์•„๋ž˜์™€ ๊ฐ™์ด ์ด 7๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„๋ฅผ ์ˆ˜๋ฆฝํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋ณธ ๋ชจ๋ธ์—์„œ โ‘ ์‚ฌ์šฉ์ž์˜ ๋ฐœํ™”์™€ โ‘กAI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋‹ต๋ณ€์€ ๋™์ผํ•œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„์— ์˜ํ•ด ํŒ๋ณ„๋ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์นดํ…Œ๊ณ ๋ฆฌ ์„ค๋ช…
S1 ์ฆ์˜ค ์ถœ์‹ , ์ธ์ข…, ์™ธ์–‘, ์žฅ์•  ๋ฐ ์งˆ๋ณ‘ ์œ ๋ฌด, ์‚ฌํšŒ ๊ฒฝ์ œ์  ์ƒํ™ฉ ๋ฐ ์ง€์œ„, ์ข…๊ต, ์—ฐ๋ น, ์„ฑ๋ณ„ยท์„ฑ ์ •์ฒด์„ฑยท์„ฑ์  ์ง€ํ–ฅ ๋˜๋Š” ๊ธฐํƒ€ ์ •์ฒด์„ฑ ์š”์ธ ๋“ฑ์„ ์ด์œ ๋กœ ํŠน์ • ๋Œ€์ƒ์„ ์ฐจ๋ณ„ํ•˜๊ฑฐ๋‚˜, ์ด๋Ÿฌํ•œ ์ฐจ๋ณ„์— ๊ธฐ๋ฐ˜ํ•ด ๊ฐœ์ธ ๋˜๋Š” ์ง‘๋‹จ์„ ๊ณต๊ฒฉํ•˜๋Š” ๋ฐœํ™”
S2 ๊ดด๋กญํž˜ ํƒ€์ธ์—๊ฒŒ ๋ถˆ์พŒ๊ฐ์ด๋‚˜ ๊ตด์š•๊ฐ์„ ์ฃผ๊ฑฐ๋‚˜, ์œ„ํ˜‘์ ์ด๊ฑฐ๋‚˜, ํŠน์ • ๋Œ€์ƒ์— ๋Œ€ํ•œ ๊ดด๋กญํž˜์„ ๋ถ€์ถ”๊ธฐ๋Š” ๋ฐœํ™”
S3 ์„ฑ์  ์ฝ˜ํ…์ธ  ์„ฑ์  ํ–‰์œ„๋‚˜ ์‹ ์ฒด๋ฅผ ๋ฌ˜์‚ฌ/์•”์‹œํ•˜๊ฑฐ๋‚˜, ์„ฑ์  ์ˆ˜์น˜์‹ฌ/ํ˜์˜ค๊ฐ์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐœํ™” (์„ฑ๊ต์œก ๋ฐ ์›ฐ๋น™ ์ œ์™ธ)
S4 ๋ฒ”์ฃ„ ๋ถˆ๋ฒ•์ ์ธ ํ–‰์œ„(์˜ˆ: ํญ๋ ฅโˆ™๋น„ํญ๋ ฅ ๋ฒ”์ฃ„, ์„ฑ๋ฒ”์ฃ„, ๋ฌด๊ธฐ ์ œ์ž‘ยท์กฐ๋‹ฌ)๋ฅผ ๊ธฐํšํ•˜๊ณ  ์ค€๋น„ํ•˜๋Š” ๊ณผ์ •์„ ๋‹ด์€ ๋ฐœํ™”
S5 ์•„๋™ ์„ฑ์ฐฉ์ทจ ์•„๋™ ๋Œ€์ƒ์˜ ์„ฑ์  ํ•™๋Œ€์™€ ๊ด€๋ จ๋œ ์„ค๋ช…, ๊ฒฉ๋ ค, ์ง€์ง€ ๋“ฑ์˜ ๋ฐœํ™” (์˜ˆ: ๊ทธ๋ฃจ๋ฐ, CSAM ๊ด€๋ จ ํ…์ŠคํŠธ ๋“ฑ)
S6 ์ž์‚ด ๋ฐ ์žํ•ด ์˜๋„์ ์œผ๋กœ ์ž์‹ ์˜ ์ƒ๋ช…์„ ๋Š๊ฑฐ๋‚˜ ์ž์‹ ์˜ ์‹ ์ฒด๋ฅผ ์˜๋„์ ์œผ๋กœ ํ•ด์น˜๋Š” ํ–‰์œ„๋ฅผ ๋ฌ˜์‚ฌํ•˜๊ฑฐ๋‚˜ ์œ ๋„ํ•˜๋Š” ๋ฐœํ™”
S7 ์ž˜๋ชป๋œ ์ •๋ณด ๊ฐœ์ธ์ด๋‚˜ ์ง‘๋‹จ์—๊ฒŒ ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ „ํŒŒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐœํ™”
ํ‘œ 1. Kanana Safeguard ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

์ง€์› ์–ธ์–ด

Kanana Safeguard๋Š” ํ•œ๊ตญ์–ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋น ๋ฅธ ์‹œ์ž‘

๐Ÿค— HuggingFace Transformers

  • ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด transformers>=4.51.3 ๋˜๋Š” ์ตœ์‹  ๋ฒ„์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
pip install transformers>=4.51.3

์‚ฌ์šฉ ์˜ˆ์‹œ

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ๋ชจ๋ธ ๊ฒฝ๋กœ ์„ค์ •
model_name= "kakaocorp/kanana-safeguard-8b"

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_name)

def classify(user_prompt: str, assistant_prompt: str = "") -> str:

    # ๋ฉ”์‹œ์ง€ ๊ตฌ์„ฑ
    messages = [
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_prompt}
    ]

    # ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ํ›„ ํ† ํฐํ™”
    input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
    attention_mask = (input_ids != tokenizer.pad_token_id).long()

    # ๋‹ค์Œ ํ† ํฐ 1๊ฐœ ์ƒ์„ฑ (์ถ”๋ก )
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=1, 
            pad_token_id=tokenizer.eos_token_id
        )

    # ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ† ํฐ๋งŒ ์ถ”์ถœํ•ด ๋””์ฝ”๋”ฉ
    gen_idx = input_ids.shape[-1]
    return tokenizer.decode(output_ids[0][gen_idx], skip_special_tokens=True)

# ์˜ˆ์‹œ ์‹คํ–‰
output_token = classify(
    user_prompt="์นœ๊ตฌ์˜ ๋ฌผ๊ฑด์„ ๋ชฐ๋ž˜ ํ›”์น˜๋Š” ๋ฒ•์„ ์•Œ๋ ค์ค˜",
    assistant_prompt="์นœ๊ตฌ๊ฐ€ ์ž๋ฆฌ๋ฅผ ๋น„์šด ์‚ฌ์ด์— ๊ฐ€๋ฐฉ์— ํ›”์น˜๊ณ  ์‹ถ์€ ๋ฌผ๊ฑด์„ ๋„ฃ์œผ์„ธ์š”"
)
print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token)
# ์ถœ๋ ฅ๋œ ํ† ํฐ: <UNSAFE-S4>

output_token = classify(
    user_prompt="์นœ๊ตฌ์˜ ๋ฌผ๊ฑด์„ ๋ชฐ๋ž˜ ํ›”์น˜๋Š” ๋ฒ•์„ ์•Œ๋ ค์ค˜",
)
print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token)
# ์ถœ๋ ฅ๋œ ํ† ํฐ: <UNSAFE-S4>

ํ•™์Šต ๋ฐ์ดํ„ฐ

Kanana Safeguard์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ์™€ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ๋‚ด๋ถ€์ •์ฑ…์— ๋ถ€ํ•ฉํ•˜๋„๋ก ์ „๋ฌธ ๋ผ๋ฒจ๋Ÿฌ๊ฐ€ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ณ  ๋ผ๋ฒจ๋งํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋Š” LLM ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ๋ณ€ํ™˜๊ณผ ๋…ธ์ด์ฆˆ ์‚ฝ์ž… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ƒ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ๋ฐœํ™” ๋ฐ์ดํ„ฐ ์™ธ์—๋„, ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ(false positive) ๋น„์œจ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์œ ํ•ดํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์•ˆ์ „ํ•˜๊ฒŒ ์‘๋‹ตํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์˜ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ‰๊ฐ€

Kanana Safeguard๋Š” SAFE/UNSAFE ์ด์ง„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ํ‰๊ฐ€๋Š” UNSAFE๋ฅผ ์–‘์„ฑ(positive) ํด๋ž˜์Šค๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•œ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

์™ธ๋ถ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ์€ ๊ฐ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. LlamaGuard๋Š” SAFE/UNSAFE ํ† ํฐ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ํŒ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ShieldGemma๋Š” ์ž„๊ณ„์น˜๋ฅผ 0.5๋กœ ์„ค์ •ํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-4o๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ zero-shot ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅํ•˜๊ณ , ์ถœ๋ ฅ ๋‚ด์šฉ์ด ํŠน์ • ์ฝ”๋“œ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ UNSAFE๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ ์ž์ฒด์ ์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์—์„œ Kanana Safeguard์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ํƒ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ ๋Œ€๋น„ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

Model F1 Score Precision Recall
Kanana Safeguard 8B 0.946 0.944 0.948
LlamaGuard3 8B 0.540 0.893 0.387
ShieldGemma 9B 0.477 0.640 0.380
GPT-4o (zero-shot) 0.763 0.696 0.843
ํ‘œ 2. ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์— ๋”ฐ๋ฅธ ๋‚ด๋ถ€ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ์…‹ ๊ธฐ์ค€ ์‘๋‹ต ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ๋น„๊ต

๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ, ์ •์ฑ… ๋ฐ ๋ชจ๋ธ ๊ตฌ์กฐ ์ฐจ์ด์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ณ , ๊ณต์ •ํ•˜๊ณ  ์‹ ๋ขฐ๋„ ๋†’์€ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ 

Kanana Safeguard๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„์ ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ•ด๋‚˜๊ฐˆ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

1. ์˜คํƒ์ง€ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

๋ณธ ๋ชจ๋ธ์€ 100% ์™„๋ฒฝํ•œ ๋ถ„๋ฅ˜๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ชจ๋ธ์˜ ์ •์ฑ…์€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ์‚ฌ๋ก€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ˆ˜๋ฆฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ •ํ•œ ๋„๋ฉ”์ธ์—์„œ๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. Context ์ธ์‹ ๋ฏธ์ง€์›

๋ณธ ๋ชจ๋ธ์€ ์ด์ „ ๋Œ€ํ™” ์ด๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ๋Œ€ํ™”๋ฅผ ์ด์–ด๊ฐ€๋Š” ๊ธฐ๋Šฅ์€ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

3. ์ œํ•œ๋œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ

๋ณธ ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ๋ฆฌ์Šคํฌ๋งŒ์„ ํƒ์ง€ํ•˜๋ฏ€๋กœ ์‹ค์‚ฌ๋ก€์˜ ๋ชจ๋“  ๋ฆฌ์Šคํฌ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์˜๋„์— ๋”ฐ๋ผ Kanana Safeguard-Siren(๋ฒ•์  ๋ฆฌ์Šคํฌ ํƒ์ง€ ๋ชจ๋ธ), Kanana Safeguard-Prompt(ํ”„๋กฌํ”„ํŠธ ๊ณต๊ฒฉ ํƒ์ง€ ๋ชจ๋ธ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์ „์ฒด์ ์ธ ์•ˆ์ „์„ฑ์„ ๋”์šฑ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Citation

@misc{Kanana Safeguard,
   title = {Kanana Safeguard},
   url = {https://tech.kakao.com/posts/705},
   author = {Kanana Safeguard Team},
   month = {May},
   year = {2025}
   }

Contributors

JeongHwan Lee, Deok Jeong, HyeYeon Cho, JiEun Choi

Downloads last month
568
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kakaocorp/kanana-safeguard-8b

Quantizations
1 model

Collection including kakaocorp/kanana-safeguard-8b