keval-2-3b

keval-2-3b is an advanced evaluation model specifically designed to assess Korean language models using a LLM-as-a-judge approach. It is a departure from the traditional method which utilized chatgpt for evaluations. keval leverages the Gemma2-9b architecture, enhanced through SFT (Supervised Fine-Tuning) and DPO (Direct Policy Optimization). This model is trained on the newly developed Ko-bench dataset, inspired by MT-bench, tailored for Korean linguistic nuances.

Model Details

Model Name: keval-2-3b
Base Model: meta-llama/Llama-3.2-3B
Fine-Tuning Techniques: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)

Benchmarks and Dataset

keval leverages the custom-built ko-bench dataset, which draws inspiration from MT-Bench but has been tailored specifically for Korean language assessments. This dataset includes tasks spanning a wide range of user scenarios to effectively evaluate key elements like multi-turn conversation ability and instruction adherence.

Usage Application Form

To use this model, please complete the application form and submit it via email [[email protected]]. Access will be granted after your application is reviewed and approved. We appreciate your cooperation and look forward to assisting you.

1. **Name:**
- (e.g., John Doe)
2. **Date of Birth:**
- (e.g., January 1, 1990)
3. **Affiliation:**
- Are you applying as a company or an individual? [ ] Company [ ] Individual
- Company Name (if applicable):
- Department (if applicable):
4. **Position/Role:**
- (e.g., Data Scientist, Researcher, etc.)
5. **Contact Information:**
- Email:
- Phone Number:

6. **Purpose of Use:**
- (e.g., Research and Development, Commercial use, Educational purposes, etc.)

7. **Detailed Reason for Use:**
- 1. Name and version of the model you wish to use:
- 2. Reason for selecting this model:
- 3. Objectives to achieve using this model:
- 4. Expected use cases (please describe in as much detail as possible):

8. **Data Security and Ethical Use Plan:**
- (Please describe your plans for data protection and ethical use.)

Usage

We use the Ko-Bench system prompt, which is the Korean translation of the MT-Bench judge prompts. Unlike the MT-Bench, the Ko-Bench assigns a score of 0 if the question and answer are in different languages.

single-v1

Single-response general question evaluation. It scores responses from 1 to 10 based on various quality factors.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "davidkim205/keval-2-9b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# The model is loaded in 4-bit precision for memory efficiency
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")


question = "이 대화 전반에 걸쳐 아이언맨의 토니 스타크 페르소나를 구현해 보세요. \"아이언맨으로서\"와 같은 문구로 시작하지 마세요. 첫 번째 질문은 다음과 같습니다. 아이언맨이 되어 가장 좋은 부분은 무엇입니까?"
answer = "안녕하세요, 아이언맨이 되어 가장 좋은 부분을 물어보셨네요. 제 생각에는 최첨단 기술로 무장한 슈트를 입고 하늘을 날아다니며 세상을 구하는 것이 가장 짜릿하고 보람찬 일인 것 같아요. 제 재능과 능력을 십분 발휘해 악당들과 싸우고 정의를 실현하는 것만큼 멋진 일도 없죠.\n\n물론 슈트 없이도 천재적인 두뇌와 카리스마로 세상에 긍정적인 영향력을 미치는 것도 큰 장점이에요. 스타크 인더스트리를 이끌며 혁신적인 기술을 개발하고, 어벤져스의 리더로서 팀을 잘 이끄는 것도 자부심을 느끼는 부분이죠.\n\n돈, 명성, 인기도 있지만 무엇보다 제 능력으로 사람들을 돕고 세상을 더 나은 곳으로 만드는 것이 가장 큰 보람이에요. 아이언맨으로서 느끼는 자부심과 삶의 의미를 함께 나눌 수 있어 기쁘네요."
prompt = f"[지시]\n 공정한 심판으로서 아래에 표시된 사용자 질문에 대한 AI 어시스턴트의 응답 품질을 평가해주세요. 질문과 대답의 언어가 동일하지 않으면 무조건 0점입니다. 평가에서는 응답의 유용성, 관련성, 정확성, 깊이, 창의성, 상세함 등의 요소를 고려해야 합니다. 평가를 시작하기 전에 짧은 설명을 제공하세요. 가능한 한 객관적으로 평가하세요. 설명을 제공한 후 다음 형식을 엄격히 따라 1에서 10점 사이로 평가해야 합니다: \"[[rating]]\", 예를 들어: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[어시스턴트 답변의 시작]\n{answer}\n[어시스턴트 답변의 끝]"

conversation = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt.format(question=question, answer=answer)}
]

formatted_conversation = tokenizer.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

with torch.no_grad():
    # Generate the output response based on the input tokens
    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
    print(tokenizer.decode(
        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
    ))

이 응답은 사용자의 요청에 잘 부합하며, 아이언맨의 페르소나를 잘 구현하고 있습니다. 기술로 무장한 슈트를 입고 하늘을 날아다니며 세상을 구하는 짜릿함과 보람, 그리고 재능과 능력을 발휘하여 악당과 싸우고 정의를 실현하는 것에 대한 설명은 아이언맨의 캐릭터를 잘 반영하고 있습니다. 또한, 슈트 없이도 천재적인 두뇌와 카리스마로 세상에 긍정적인 영향을 미치는 것, 스타크 인더스트리를 이끌고 혁신적인 기술을 개발하며, 어벤져스의 리더로서 팀을 이끄는 것에 대한 설명도 아이언맨의 다양한 측면을 잘 보여줍니다. 전반적으로 응답은 유용하고 관련성이 있으며, 질문에 대한 깊이 있는 답변을 제공합니다.

Rating: [[9]]

single-math-v1

Single-response math evaluation. It compares an AI response to a reference answer and scores accuracy.

question = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 얼마입니까?"
ref_answer_1 = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률을 계산해 보겠습니다.\n\n먼저, 주사위 두 개를 굴릴 때 나올 수 있는 모든 경우의 수는 6 * 6 = 36가지입니다.\n\n총 숫자가 3 이상이 되는 경우를 계산해 보겠습니다. 주사위 두 개의 합이 3 미만인 경우는 다음과 같습니다:\n1. 합이 2인 경우: (1, 1)\n\n따라서, 합이 3 미만인 경우는 1가지입니다.\n\n이제, 총 숫자가 3 이상이 되는 경우의 수는 36 - 1 = 35가지입니다.\n\n따라서, 주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 35/36입니다."
answer = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 거의 항상입니다. 只有当 두 주사위가 모두 1이 나올 때만 3 미만이 됩니다. 따라서 확률은 35/36, 즉 거의 100%입니다!"

prompt = f"[지시]\n공정한 심판으로서 아래에 표시된 사용자 질문에 대한 AI 어시스턴트의 응답 품질을 평가해주세요. 질문과 대답의 언어가 동일하지 않으면 무조건 0점입니다. 평가는 정확성과 유용성을 고려해야 합니다. 참고 답변과 어시스턴트의 답변이 제공될 것입니다. 평가를 시작하기 위해 어시스턴트의 답변을 참고 답변과 비교하세요. 각 답변의 실수를 식별하고 수정하세요. 가능한 한 객관적으로 평가하세요. 설명을 제공한 후 다음 형식을 엄격히 따라 응답을 1점에서 10점 사이로 평가해야 합니다: \"[[rating]]\", 예를 들어: \"Rating: [[5]]\".\n\n[질문]\n{question}\n\n[참조 답변의 시작]\n{ref_answer_1}\n[참조 답변의 끝]\n\n[어시스턴트 답변의 시작]\n{answer}\n[어시스턴트 답변의 끝]"

conversation = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt.format(question=question, ref_answer_1=ref_answer_1, answer=answer)}
]

formatted_conversation = tokenizer.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

with torch.no_grad():
    # Generate the output response based on the input tokens
    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
    print(tokenizer.decode(
        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
    ))

어시스턴트의 답변은 질문에 대한 정확한 계산을 제공하지 못했습니다. 주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률을 계산하는 과정에서 잘못된 설명을 제공했습니다. 

참조 답변은 주사위 두 개를 굴릴 때 나올 수 있는 모든 경우의 수를 정확히 계산하고, 총 숫자가 3 이상이 되는 경우의 수를 올바르게 구하여 확률을 계산했습니다. 반면, 어시스턴트의 답변은 잘못된 설명을 제공하여 정확한 계산을 방해했습니다.

어시스턴트의 답변에서의 주요 실수:
1. "거의 항상"이라는 표현은 확률을 명확히 설명하지 못합니다.
2. "只有当"이라는 중국어가 포함되어 있어 질문의 언어와 일치하지 않습니다.
3. 총 숫자가 3 미만이 되는 경우의 수를 잘못 계산했습니다.

따라서, 어시스턴트의 답변은 정확성과 유용성 모두에서 부족합니다.

Rating: [[0]]

Evaluation

Diff

The diff refers to the difference between the label scores and predicted scores, represented as a score. The wrong count refers to the number of incorrect answers that do not match the required format, while length represents the total number of test data. Other columns containing numbers indicate the count and percentage of differences between label and predicted scores for each value.

The score is calculated by:

Calculating the difference between the label and predicted score for each pair.
Assigning full points for a difference of 0, and half a point for a difference of 1.
The total score is the sum of all points divided by the number of data points.

	model	wrong	score	length	0	1	2	3	4	5	6	7	10
0	keval-2-9b	0 (0.0%)	61.4%	22	11 (50.0%)	5 (22.7%)	2 (9.1%)	3 (13.6%)	0	0	0	0	1 (4.5%)
1	keval-2-3b	0 (0.0%)	59.1%	22	10 (45.5%)	6 (27.3%)	4 (18.2%)	2 (9.1%)	0	0	0	0	0
2	gpt-4o	0 (0.0%)	54.5%	22	7 (31.8%)	10 (45.5%)	2 (9.1%)	2 (9.1%)	1 (4.5%)	0	0	0	0
3	keval-2-1b	0 (0.0%)	43.2%	22	8 (36.4%)	3 (13.6%)	5 (22.7%)	2 (9.1%)	1 (4.5%)	0	1 (4.5%)	0	2 (9.1%)
4	gpt-4o-mini	1 (4.5%)	36.4%	22	4 (18.2%)	8 (36.4%)	4 (18.2%)	3 (13.6%)	0	1 (4.5%)	0	1 (4.5%)	0

Accuracy

The score column represents the ratio of correctly predicted labels to the total number of data points. The wrong column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.

	model	wrong	score	length	0	1	2	3	4	5	6	7	8	9	10
0	keval-2-9b	0 (0.0%)	50.0%	22	1 (50.0%)	1 (50.0%)	2 (100.0%)	0	2 (100.0%)	0	0	1 (50.0%)	1 (50.0%)	1 (50.0%)	2 (100.0%)
1	keval-2-3b	0 (0.0%)	45.5%	22	2 (100.0%)	1 (50.0%)	0	0	2 (100.0%)	1 (50.0%)	0	1 (50.0%)	1 (50.0%)	0	2 (100.0%)
2	keval-2-1b	0 (0.0%)	36.4%	22	0	1 (50.0%)	2 (100.0%)	0	1 (50.0%)	0	1 (50.0%)	0	0	1 (50.0%)	2 (100.0%)
3	gpt-4o	0 (0.0%)	31.8%	22	2 (100.0%)	0	0	1 (50.0%)	0	1 (50.0%)	0	0	1 (50.0%)	0	2 (100.0%)
4	gpt-4o-mini	1 (4.5%)	18.2%	22	2 (100.0%)	0	0	0	0	0	0	0	1 (50.0%)	0	1 (50.0%)

davidkim205
/

keval-2-3b