Rank-R1-v0.2

The following changes are made in v0.2:

The base models were switched from Qwen2.5 to the Qwen3 series.
The training data was changed from Tevatron/msmarco-passage to Tevatron/reasonir-data-hn, which uses ReasonIR provided synthetic queries, positive documents, synthetic hard negative documents, and we further mined other hard negatives using BM25.
The number of documents in the prompt are varing from 2 to 10 during RL training. During inference, we use 10 documents in the prompt.
Some GRPO training hyperparameters changes.
The prompt was changed as shown below:

prompt_system = "You are RankLLM, an intelligent assistant capable of evaluating the relevancy of passages to a given query."

prompt_user = '''You will be presented with a query, and a set of documents.

Your task consists of the following step:

1. Analyze the query: Carefully read the query and identify the core problem or question being asked.

2. Analyze the documents: Thoroughly examine each document and briefly explain how each document is relevant or not relevant to the query.

3. Find the most relevant document: Based on your analysis, select the most relevant document to the query from the set and briefly explain why.

Important: Provide your analysis within the <think> </think> tags and answer only the label of the most relevant document, enclosed in square brackets, within the <answer> </answer> tags. For example, if the third document is the most relevant, your response should be:
<think> Your analysis here </think>
<answer>[3]</answer>

Here is the query: {query}

Here are the documents:
{docs}'''

BRIGHT results

Method	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT	Average
a. BM25+GPT4 CoT	53.6	54.1	24.3	38.7	18.9	27.7	26.3	19.3	17.6	3.9	19.2	20.8	27.0
b. ReasonIR+GPT4 CoT	43.6	42.9	32.7	38.8	20.9	25.8	27.5	31.5	19.6	7.4	33.1	35.7	29.9
c. a => Rank1-32B	49.7	35.8	22.0	37.5	22.5	21.7	35.0	18.8	32.5	10.8	22.9	43.7	29.4
d. a => Rank-K-32B	50.4	46.2	30.6	46.7	32.4	33.0	41.2	24.0	32.2	7.6	28.3	26.6	33.3
e. b => QwenRerank	58.2	53.2	32.0	43.6	28.8	37.6	36.0	33.2	34.8	7.9	32.6	45.0	36.9
f. a => Rank-R1-v0.2-32B (ours)	62.3	59.3	34.1	50.7	32.4	38.9	46.3	26.6	18.1	10.6	31.1	41.2	37.6
g. b => Rank-R1-v0.2-32B (ours)	60.1	56.3	36.6	52.1	30.2	37.6	45.9	25.5	14.6	10.1	38.6	44.3	37.7
h. g + b (Hybrid)^ (ours)	59.5	55.1	37.9	52.7	30.0	39.3	45.1	32.1	17.1	10.7	40.4	45.6	38.8

All the rerankers rerank using the original query without GPT4 CoT.
^ reranked results hybrid with the first-stage results, with score min-max norm and 0.1 weight on the first-stage document scores, no extra ranker and retrieval is introduced.

ielabgroup
/

Rank-R1-32B-v0.2

Rank-R1-v0.2

BRIGHT results

Dataset used to train ielabgroup/Rank-R1-32B-v0.2

Collection including ielabgroup/Rank-R1-32B-v0.2

Rank-R1-v0.2