keval-2-1b

keval-2-1b is an advanced evaluation model specifically designed to assess Korean language models using a LLM-as-a-judge approach. It is a departure from the traditional method which utilized chatgpt for evaluations. keval leverages the Gemma2-9b architecture, enhanced through SFT (Supervised Fine-Tuning) and DPO (Direct Policy Optimization). This model is trained on the newly developed Ko-bench dataset, inspired by MT-bench, tailored for Korean linguistic nuances.

Model Details

  • Model Name: keval-2-1b
  • Base Model: meta-llama/Llama-3.2-1B
  • Fine-Tuning Techniques: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)

Benchmarks and Dataset

keval leverages the custom-built ko-bench dataset, which draws inspiration from MT-Bench but has been tailored specifically for Korean language assessments. This dataset includes tasks spanning a wide range of user scenarios to effectively evaluate key elements like multi-turn conversation ability and instruction adherence.

Usage Application Form

To use this model, please complete the application form and submit it via email [[email protected]]. Access will be granted after your application is reviewed and approved. We appreciate your cooperation and look forward to assisting you.

1. **Name:**
- (e.g., John Doe)
2. **Date of Birth:**
- (e.g., January 1, 1990)
3. **Affiliation:**
- Are you applying as a company or an individual? [ ] Company [ ] Individual
- Company Name (if applicable):
- Department (if applicable):
4. **Position/Role:**
- (e.g., Data Scientist, Researcher, etc.)
5. **Contact Information:**
- Email:
- Phone Number:

6. **Purpose of Use:**
- (e.g., Research and Development, Commercial use, Educational purposes, etc.)

7. **Detailed Reason for Use:**
- 1. Name and version of the model you wish to use:
- 2. Reason for selecting this model:
- 3. Objectives to achieve using this model:
- 4. Expected use cases (please describe in as much detail as possible):

8. **Data Security and Ethical Use Plan:**
- (Please describe your plans for data protection and ethical use.)

Usage

We use the Ko-Bench system prompt, which is the Korean translation of the MT-Bench judge prompts. Unlike the MT-Bench, the Ko-Bench assigns a score of 0 if the question and answer are in different languages.

single-v1

Single-response general question evaluation. It scores responses from 1 to 10 based on various quality factors.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "davidkim205/keval-2-9b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# The model is loaded in 4-bit precision for memory efficiency
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")


question = "이 λŒ€ν™” μ „λ°˜μ— 걸쳐 μ•„μ΄μ–Έλ§¨μ˜ ν† λ‹ˆ μŠ€νƒ€ν¬ 페λ₯΄μ†Œλ‚˜λ₯Ό κ΅¬ν˜„ν•΄ λ³΄μ„Έμš”. \"μ•„μ΄μ–Έλ§¨μœΌλ‘œμ„œ\"와 같은 문ꡬ둜 μ‹œμž‘ν•˜μ§€ λ§ˆμ„Έμš”. 첫 번째 μ§ˆλ¬Έμ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. 아이언맨이 λ˜μ–΄ κ°€μž₯ 쒋은 뢀뢄은 λ¬΄μ—‡μž…λ‹ˆκΉŒ?"
answer = "μ•ˆλ…•ν•˜μ„Έμš”, 아이언맨이 λ˜μ–΄ κ°€μž₯ 쒋은 뢀뢄을 λ¬Όμ–΄λ³΄μ…¨λ„€μš”. 제 μƒκ°μ—λŠ” μ΅œμ²¨λ‹¨ 기술둜 무μž₯ν•œ 슈트λ₯Ό μž…κ³  ν•˜λŠ˜μ„ λ‚ μ•„λ‹€λ‹ˆλ©° 세상을 κ΅¬ν•˜λŠ” 것이 κ°€μž₯ μ§œλ¦Ών•˜κ³  보람찬 일인 것 κ°™μ•„μš”. 제 재λŠ₯κ³Ό λŠ₯λ ₯을 μ‹­λΆ„ λ°œνœ˜ν•΄ μ•…λ‹Ήλ“€κ³Ό μ‹Έμš°κ³  μ •μ˜λ₯Ό μ‹€ν˜„ν•˜λŠ” κ²ƒλ§ŒνΌ 멋진 일도 μ—†μ£ .\n\nλ¬Όλ‘  슈트 없이도 천재적인 λ‘λ‡Œμ™€ 카리슀마둜 세상에 긍정적인 영ν–₯λ ₯을 λ―ΈμΉ˜λŠ” 것도 큰 μž₯μ μ΄μ—μš”. μŠ€νƒ€ν¬ μΈλ”μŠ€νŠΈλ¦¬λ₯Ό 이끌며 ν˜μ‹ μ μΈ κΈ°μˆ μ„ κ°œλ°œν•˜κ³ , μ–΄λ²€μ ΈμŠ€μ˜ λ¦¬λ”λ‘œμ„œ νŒ€μ„ 잘 μ΄λ„λŠ” 것도 μžλΆ€μ‹¬μ„ λŠλΌλŠ” 뢀뢄이죠.\n\n돈, λͺ…μ„±, 인기도 μžˆμ§€λ§Œ 무엇보닀 제 λŠ₯λ ₯으둜 μ‚¬λžŒλ“€μ„ 돕고 세상을 더 λ‚˜μ€ 곳으둜 λ§Œλ“œλŠ” 것이 κ°€μž₯ 큰 λ³΄λžŒμ΄μ—μš”. μ•„μ΄μ–Έλ§¨μœΌλ‘œμ„œ λŠλΌλŠ” μžλΆ€μ‹¬κ³Ό μ‚Άμ˜ 의미λ₯Ό ν•¨κ»˜ λ‚˜λˆŒ 수 μžˆμ–΄ κΈ°μ˜λ„€μš”."
prompt = f"[μ§€μ‹œ]\n κ³΅μ •ν•œ μ‹¬νŒμœΌλ‘œμ„œ μ•„λž˜μ— ν‘œμ‹œλœ μ‚¬μš©μž μ§ˆλ¬Έμ— λŒ€ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 응닡 ν’ˆμ§ˆμ„ ν‰κ°€ν•΄μ£Όμ„Έμš”. 질문과 λŒ€λ‹΅μ˜ μ–Έμ–΄κ°€ λ™μΌν•˜μ§€ μ•ŠμœΌλ©΄ 무쑰건 0μ μž…λ‹ˆλ‹€. ν‰κ°€μ—μ„œλŠ” μ‘λ‹΅μ˜ μœ μš©μ„±, κ΄€λ ¨μ„±, μ •ν™•μ„±, 깊이, μ°½μ˜μ„±, 상세함 λ“±μ˜ μš”μ†Œλ₯Ό κ³ λ €ν•΄μ•Ό ν•©λ‹ˆλ‹€. 평가λ₯Ό μ‹œμž‘ν•˜κΈ° 전에 짧은 μ„€λͺ…을 μ œκ³΅ν•˜μ„Έμš”. κ°€λŠ₯ν•œ ν•œ κ°κ΄€μ μœΌλ‘œ ν‰κ°€ν•˜μ„Έμš”. μ„€λͺ…을 μ œκ³΅ν•œ ν›„ λ‹€μŒ ν˜•μ‹μ„ μ—„κ²©νžˆ 따라 1μ—μ„œ 10점 μ‚¬μ΄λ‘œ 평가해야 ν•©λ‹ˆλ‹€: \"[[rating]]\", 예λ₯Ό λ“€μ–΄: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[μ–΄μ‹œμŠ€ν„΄νŠΈ λ‹΅λ³€μ˜ μ‹œμž‘]\n{answer}\n[μ–΄μ‹œμŠ€ν„΄νŠΈ λ‹΅λ³€μ˜ 끝]"

conversation = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt.format(question=question, answer=answer)}
]

formatted_conversation = tokenizer.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

with torch.no_grad():
    # Generate the output response based on the input tokens
    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
    print(tokenizer.decode(
        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
    ))
이 응닡은 μ‚¬μš©μžμ˜ μš”μ²­μ— 잘 λΆ€ν•©ν•˜λ©°, μ•„μ΄μ–Έλ§¨μ˜ 페λ₯΄μ†Œλ‚˜λ₯Ό 잘 κ΅¬ν˜„ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 기술둜 무μž₯ν•œ 슈트λ₯Ό μž…κ³  ν•˜λŠ˜μ„ λ‚ μ•„λ‹€λ‹ˆλ©° 세상을 κ΅¬ν•˜λŠ” μ§œλ¦Ών•¨κ³Ό 보람, 그리고 재λŠ₯κ³Ό λŠ₯λ ₯을 λ°œνœ˜ν•˜μ—¬ μ•…λ‹Ήκ³Ό μ‹Έμš°κ³  μ •μ˜λ₯Ό μ‹€ν˜„ν•˜λŠ” 것에 λŒ€ν•œ μ„€λͺ…은 μ•„μ΄μ–Έλ§¨μ˜ 캐릭터λ₯Ό 잘 λ°˜μ˜ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ, 슈트 없이도 천재적인 λ‘λ‡Œμ™€ 카리슀마둜 세상에 긍정적인 영ν–₯을 λ―ΈμΉ˜λŠ” 것, μŠ€νƒ€ν¬ μΈλ”μŠ€νŠΈλ¦¬λ₯Ό 이끌고 ν˜μ‹ μ μΈ κΈ°μˆ μ„ κ°œλ°œν•˜λ©°, μ–΄λ²€μ ΈμŠ€μ˜ λ¦¬λ”λ‘œμ„œ νŒ€μ„ μ΄λ„λŠ” 것에 λŒ€ν•œ μ„€λͺ…도 μ•„μ΄μ–Έλ§¨μ˜ λ‹€μ–‘ν•œ 츑면을 잘 λ³΄μ—¬μ€λ‹ˆλ‹€. μ „λ°˜μ μœΌλ‘œ 응닡은 μœ μš©ν•˜κ³  관련성이 있으며, μ§ˆλ¬Έμ— λŒ€ν•œ 깊이 μžˆλŠ” 닡변을 μ œκ³΅ν•©λ‹ˆλ‹€.

Rating: [[9]]

single-math-v1

Single-response math evaluation. It compares an AI response to a reference answer and scores accuracy.

question = "μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ 총 μˆ«μžκ°€ 3 이상이 λ‚˜μ˜¬ ν™•λ₯ μ€ μ–Όλ§ˆμž…λ‹ˆκΉŒ?"
ref_answer_1 = "μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ 총 μˆ«μžκ°€ 3 이상이 λ‚˜μ˜¬ ν™•λ₯ μ„ 계산해 λ³΄κ² μŠ΅λ‹ˆλ‹€.\n\nλ¨Όμ €, μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ λ‚˜μ˜¬ 수 μžˆλŠ” λͺ¨λ“  경우의 μˆ˜λŠ” 6 * 6 = 36κ°€μ§€μž…λ‹ˆλ‹€.\n\n총 μˆ«μžκ°€ 3 이상이 λ˜λŠ” 경우λ₯Ό 계산해 λ³΄κ² μŠ΅λ‹ˆλ‹€. μ£Όμ‚¬μœ„ 두 개의 합이 3 미만인 κ²½μš°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:\n1. 합이 2인 경우: (1, 1)\n\nλ”°λΌμ„œ, 합이 3 미만인 κ²½μš°λŠ” 1κ°€μ§€μž…λ‹ˆλ‹€.\n\n이제, 총 μˆ«μžκ°€ 3 이상이 λ˜λŠ” 경우의 μˆ˜λŠ” 36 - 1 = 35κ°€μ§€μž…λ‹ˆλ‹€.\n\nλ”°λΌμ„œ, μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ 총 μˆ«μžκ°€ 3 이상이 λ‚˜μ˜¬ ν™•λ₯ μ€ 35/36μž…λ‹ˆλ‹€."
answer = "μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ 총 μˆ«μžκ°€ 3 이상이 λ‚˜μ˜¬ ν™•λ₯ μ€ 거의 ν•­μƒμž…λ‹ˆλ‹€. εͺζœ‰ε½“ 두 μ£Όμ‚¬μœ„κ°€ λͺ¨λ‘ 1이 λ‚˜μ˜¬ λ•Œλ§Œ 3 미만이 λ©λ‹ˆλ‹€. λ”°λΌμ„œ ν™•λ₯ μ€ 35/36, 즉 거의 100%μž…λ‹ˆλ‹€!"

prompt = f"[μ§€μ‹œ]\nκ³΅μ •ν•œ μ‹¬νŒμœΌλ‘œμ„œ μ•„λž˜μ— ν‘œμ‹œλœ μ‚¬μš©μž μ§ˆλ¬Έμ— λŒ€ν•œ AI μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 응닡 ν’ˆμ§ˆμ„ ν‰κ°€ν•΄μ£Όμ„Έμš”. 질문과 λŒ€λ‹΅μ˜ μ–Έμ–΄κ°€ λ™μΌν•˜μ§€ μ•ŠμœΌλ©΄ 무쑰건 0μ μž…λ‹ˆλ‹€. ν‰κ°€λŠ” μ •ν™•μ„±κ³Ό μœ μš©μ„±μ„ κ³ λ €ν•΄μ•Ό ν•©λ‹ˆλ‹€. μ°Έκ³  λ‹΅λ³€κ³Ό μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 닡변이 제곡될 κ²ƒμž…λ‹ˆλ‹€. 평가λ₯Ό μ‹œμž‘ν•˜κΈ° μœ„ν•΄ μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 닡변을 μ°Έκ³  λ‹΅λ³€κ³Ό λΉ„κ΅ν•˜μ„Έμš”. 각 λ‹΅λ³€μ˜ μ‹€μˆ˜λ₯Ό μ‹λ³„ν•˜κ³  μˆ˜μ •ν•˜μ„Έμš”. κ°€λŠ₯ν•œ ν•œ κ°κ΄€μ μœΌλ‘œ ν‰κ°€ν•˜μ„Έμš”. μ„€λͺ…을 μ œκ³΅ν•œ ν›„ λ‹€μŒ ν˜•μ‹μ„ μ—„κ²©νžˆ 따라 응닡을 1μ μ—μ„œ 10점 μ‚¬μ΄λ‘œ 평가해야 ν•©λ‹ˆλ‹€: \"[[rating]]\", 예λ₯Ό λ“€μ–΄: \"Rating: [[5]]\".\n\n[질문]\n{question}\n\n[μ°Έμ‘° λ‹΅λ³€μ˜ μ‹œμž‘]\n{ref_answer_1}\n[μ°Έμ‘° λ‹΅λ³€μ˜ 끝]\n\n[μ–΄μ‹œμŠ€ν„΄νŠΈ λ‹΅λ³€μ˜ μ‹œμž‘]\n{answer}\n[μ–΄μ‹œμŠ€ν„΄νŠΈ λ‹΅λ³€μ˜ 끝]"

conversation = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt.format(question=question, ref_answer_1=ref_answer_1, answer=answer)}
]

formatted_conversation = tokenizer.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

with torch.no_grad():
    # Generate the output response based on the input tokens
    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
    print(tokenizer.decode(
        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
    ))
μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 닡변은 μ§ˆλ¬Έμ— λŒ€ν•œ μ •ν™•ν•œ 계산을 μ œκ³΅ν•˜μ§€ λͺ»ν–ˆμŠ΅λ‹ˆλ‹€. μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ 총 μˆ«μžκ°€ 3 이상이 λ‚˜μ˜¬ ν™•λ₯ μ„ κ³„μ‚°ν•˜λŠ” κ³Όμ •μ—μ„œ 잘λͺ»λœ μ„€λͺ…을 μ œκ³΅ν–ˆμŠ΅λ‹ˆλ‹€. 

μ°Έμ‘° 닡변은 μ£Όμ‚¬μœ„ 두 개λ₯Ό ꡴릴 λ•Œ λ‚˜μ˜¬ 수 μžˆλŠ” λͺ¨λ“  경우의 수λ₯Ό μ •ν™•νžˆ κ³„μ‚°ν•˜κ³ , 총 μˆ«μžκ°€ 3 이상이 λ˜λŠ” 경우의 수λ₯Ό μ˜¬λ°”λ₯΄κ²Œ κ΅¬ν•˜μ—¬ ν™•λ₯ μ„ κ³„μ‚°ν–ˆμŠ΅λ‹ˆλ‹€. 반면, μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 닡변은 잘λͺ»λœ μ„€λͺ…을 μ œκ³΅ν•˜μ—¬ μ •ν™•ν•œ 계산을 λ°©ν•΄ν–ˆμŠ΅λ‹ˆλ‹€.

μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ λ‹΅λ³€μ—μ„œμ˜ μ£Όμš” μ‹€μˆ˜:
1. "거의 항상"μ΄λΌλŠ” ν‘œν˜„μ€ ν™•λ₯ μ„ λͺ…ν™•νžˆ μ„€λͺ…ν•˜μ§€ λͺ»ν•©λ‹ˆλ‹€.
2. "εͺζœ‰ε½“"μ΄λΌλŠ” 쀑ꡭ어가 ν¬ν•¨λ˜μ–΄ μžˆμ–΄ 질문의 언어와 μΌμΉ˜ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
3. 총 μˆ«μžκ°€ 3 미만이 λ˜λŠ” 경우의 수λ₯Ό 잘λͺ» κ³„μ‚°ν–ˆμŠ΅λ‹ˆλ‹€.

λ”°λΌμ„œ, μ–΄μ‹œμŠ€ν„΄νŠΈμ˜ 닡변은 μ •ν™•μ„±κ³Ό μœ μš©μ„± λͺ¨λ‘μ—μ„œ λΆ€μ‘±ν•©λ‹ˆλ‹€.

Rating: [[0]]

Evaluation

Diff

The diff refers to the difference between the label scores and predicted scores, represented as a score. The wrong count refers to the number of incorrect answers that do not match the required format, while length represents the total number of test data. Other columns containing numbers indicate the count and percentage of differences between label and predicted scores for each value.

The score is calculated by:

  1. Calculating the difference between the label and predicted score for each pair.
  2. Assigning full points for a difference of 0, and half a point for a difference of 1.
  3. The total score is the sum of all points divided by the number of data points.
model wrong score length 0 1 2 3 4 5 6 7 8 9 10
0 keval-2-9b 0 (0.0%) 61.4% 22 11 (50.0%) 5 (22.7%) 2 (9.1%) 3 (13.6%) 0 0 0 0 0 0 1 (4.5%)
1 keval-2-3b 0 (0.0%) 59.1% 22 10 (45.5%) 6 (27.3%) 4 (18.2%) 2 (9.1%) 0 0 0 0 0 0 0
2 keval-2-1b 0 (0.0%) 43.2% 22 8 (36.4%) 3 (13.6%) 5 (22.7%) 2 (9.1%) 1 (4.5%) 0 1 (4.5%) 0 0 0 2 (9.1%)

Accuracy

The score column represents the ratio of correctly predicted labels to the total number of data points. The wrong column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.

model wrong score length 0 1 2 3 4 5 6 7 8 9 10
0 keval-2-9b 0 (0.0%) 50.0% 22 1 (50.0%) 1 (50.0%) 2 (100.0%) 0 2 (100.0%) 0 0 1 (50.0%) 1 (50.0%) 1 (50.0%) 2 (100.0%)
1 keval-2-3b 0 (0.0%) 45.5% 22 2 (100.0%) 1 (50.0%) 0 0 2 (100.0%) 1 (50.0%) 0 1 (50.0%) 1 (50.0%) 0 2 (100.0%)
2 keval-2-1b 0 (0.0%) 36.4% 22 0 1 (50.0%) 2 (100.0%) 0 1 (50.0%) 0 1 (50.0%) 0 0 1 (50.0%) 2 (100.0%)
Downloads last month
20
Safetensors
Model size
1.24B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for davidkim205/keval-2-1b

Quantizations
1 model

Collection including davidkim205/keval-2-1b