---
language:
- ko
license: apache-2.0
library_name: transformers
tags:
- text-generation-inference
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
---

# EEVE-Korean-Instruct-10.8B-v1.0-Grade-Retrieval


## About the Model
This model has been fine-tuned to evaluate whether the retrieved context for a question in RAG is correct with a yes or no answer.

The base model for this model is [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0).

## Prompt Template
```
주어진 질문과 정보가 주어졌을 때 질문에 답하기에 충분한 정보인지 평가해줘.
정보가 충분한지를 평가하기 위해 "예" 또는 "아니오"로 답해줘. 

### 질문: 
{question}

### 정보: 
{context}

### 평가: 
```

## How to Use it
```python
import torch
from transformers import (
    BitsAndBytesConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
)

model_path = "sinjy1203/EEVE-Korean-Instruct-10.8B-v1.0-Grade-Retrieval"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, quantization_config=nf4_config, device_map={'': 'cuda:0'}
)

prompt_template = '주어진 질문과 정보가 주어졌을 때 질문에 답하기에 충분한 정보인지 평가해줘.\n정보가 충분한지를 평가하기 위해 "예" 또는 "아니오"로 답해줘.\n\n### 질문:\n{question}\n\n### 정보:\n{context}\n\n### 평가:\n'
query = {
    "question": "동아리 종강총회가 언제인가요?",
    "context": "종강총회 날짜는 6월 21일입니다."
}

model_inputs = tokenizer(prompt_template.format_map(query), return_tensors='pt')
output = model.generate(**model_inputs, max_new_tokens=100, max_length=200)
print(output)
```

### Example Output
```
주어진 질문과 정보가 주어졌을 때 질문에 답하기에 충분한 정보인지 평가해줘.
정보가 충분한지를 평가하기 위해 "예" 또는 "아니오"로 답해줘.

### 질문:
동아리 종강총회가 언제인가요?

### 정보:
종강총회 날짜는 6월 21일입니다.

### 평가:
예<|end_of_text|>
```

### Training Data
- Referenced generated_instruction by [stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- use [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) as the model for question generation.

## Metrics

### Korean LLM Benchmark

|         Model                                   | Average |  Ko-ARC   | Ko-HellaSwag | Ko-MMLU | Ko-TruthfulQA | Ko-CommonGen V2|
|:-------------------------------:|:--------:|:-----:|:---------:|:------:|:------:|:------:|
| EEVE-Korean-Instruct-10.8B-v1.0                 | 56.08    | 55.2 | 66.11     | 56.48  | 49.14 | 53.48 |
| EEVE-Korean-Instruct-10.8B-v1.0-Grade-Retrieval | 56.1    | 55.55 | 65.95     | 56.24  | 48.66 | 54.07 |

### Generated Dataset

|         Model                                   | Accuracy |  F1   | Precision | Recall |
|:-------------------------------:|:--------:|:-----:|:---------:|:------:|
| EEVE-Korean-Instruct-10.8B-v1.0                 | 0.824    | 0.800 | 0.885     | 0.697  |
| EEVE-Korean-Instruct-10.8B-v1.0-Grade-Retrieval | 0.892    | 0.875 | 0.903     | 0.848  |