JinaJudge: Proxy Judgement for Russian LLM Arena

Description

This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the Russian LLM Arena, designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.


Model Details

This is a small upgrade to the kaleinaNyan/jina-v3-rullmarena-judge model:

  • Number of decoder blocks increased from 4 to 5.
  • Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
  • The resulting model size went from 614M params to 589M params.
  • I also tweaked some training hyperparameters, but training data composition is the same.

Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.


Evaluation

The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

Models evaluated:

  • gemma-2-9b-it-sppo-iter3
  • glm-4-9b-chat
  • gpt-3.5-turbo-1106
  • mistral-7b-instruct-v0.3
  • storm-7b

Validation Performance:

  • Accuracy: 80.76% (+2.67)
  • Precision: 78.56% (+2.74)
  • Recall: 79.48% (+2.71)
  • F1-score: 79.00% (+2.73)

For the test phase, new judgements were generated using GPT-4 for the kolibri-mistral-0427-upd model.

Test Performance:

  • Accuracy: 82.72% (+2.64)
  • Precision: 80.11% (+3.43)
  • Recall: 82.42% (+4.69)
  • F1-score: 81.18% (+4.10)

Usage Example

from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])

Generated ranking

The ranking was obtained using a modified Russian LLM Arena code. All judgements were regenerated using the jina-judge model.

Model Score 95% CI Average #Tokens
gpt-4-1106-preview 81.6 (-2.3, 3.0) 541
gpt-4.0-mini 76.0 (-2.7, 2.4) 448
qwen-2.5-72b-it 72.5 (-3.6, 3.6) 557
gemma-2-9b-it-sppo-iter3 72.1 (-3.7, 3.6) 569
gemma-2-27b-it 71.1 (-3.3, 3.2) 482
gemma-2-9b-it 70.8 (-3.4, 3.5) 569
t-lite-instruct-0.1 68.3 (-3.8, 4.5) 810
suzume-llama-3-8b-multilingual-orpo 62.9 (-3.9, 4.0) 682
glm-4-9b-chat 60.5 (-3.9, 4.0) 516
sfr-iterative-dpo-llama-3-8b-r 59.9 (-4.0, 4.3) 682
c4ai-command-r-v01 56.9 (-4.2, 3.8) 516
phi-3-medium-4k-instruct 56.4 (-2.8, 3.3) 566
mistral-nemo-instruct-2407 56.1 (-2.9, 3.4) 682
yandex_gpt_pro 51.7 (-3.4, 3.4) 345
suzume-llama-3-8b-multilingual 51.3 (-3.4, 4.0) 489
hermes-2-theta-llama-3-8b 50.9 (-3.2, 3.4) 485
starling-1m-7b-beta 50.2 (-3.3, 3.4) 495
gpt-3.5-turbo-0125 50.0 (0.0, 0.0) 220
llama-3-instruct-8b-sppo-iter3 49.8 (-3.4, 4.0) 763
llama-3-8b-saiga-suzume-ties 48.2 (-4.1, 3.9) 569
llama-3-smaug-8b 46.6 (-3.9, 3.8) 763
vikhr-it-5.4-fp16-orpo-v2 46.6 (-3.7, 4.0) 379
aya-23-8b 46.3 (-3.8, 3.9) 571
saiga-llama3-8b_v6 45.5 (-3.8, 3.9) 471
vikhr-it-5.2-fp16-cp 43.8 (-3.9, 4.0) 543
qwen2-7b-instruct 43.7 (-2.5, 2.7) 492
opencchat-3.5-0106 43.4 (-3.3, 3.7) 485
gpt-3.5-turbo-1106 41.7 (-2.9, 3.5) 220
kolibri-mistral-0427-upd 41.5 (-3.2, 3.5) 551
paralex-llama-3-8b-sft 40.6 (-3.8, 3.3) 688
mistral-7b-instruct-v0.3 40.3 (-3.3, 3.4) 469
llama-3-instruct-8b-simpo 40.2 (-2.9, 3.7) 551
gigachat_pro 40.2 (-3.2, 3.5) 294
hermes-2-pro-llama-3-8b 39.5 (-2.9, 3.4) 689
vikhr-it-5.3-fp16-32k 39.5 (-2.8, 3.2) 519
opencchat-3.6-8b-2204522 37.7 (-3.3, 3.7) 409
meta-llama-3-8b-instruct 37.5 (-3.1, 3.5) 450
kolibri-vikhr-mistral-0427 37.1 (-3.1, 3.8) 488
neural-chat-v3.3 36.5 (-2.7, 3.6) 523
vikhr-it-5.1-fp16 36.4 (-3.5, 3.5) 448
gigachat-lite 36.0 (-2.8, 3.0) 523
saiga-7b 25.9 (-3.1, 3.7) 927
storm-7b 25.1 (-3.6, 4.1) 419
snorkel-mistral-pairrm-dpo 16.5 (-3.8, 3.2) 773
Downloads last month
17
Safetensors
Model size
589M params
Tensor type
F32
·
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for kaleinaNyan/jina-v3-rullmarena-judge-300924

Finetuned
(20)
this model

Collection including kaleinaNyan/jina-v3-rullmarena-judge-300924