# J1-7B-RL Not yet finished! ## Model Description J1-7B-RL is an LLM-as-a-Judge model trained through a two-stage process of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). This model is specifically designed to benefit from Simple Test-Time Scaling (STTS) techniques and serves as an improved preference judge for evaluating LLM outputs. It is the implementation of the model described in the paper "J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge". ## Key Features - **Enhanced Reflective Reasoning**: Trained to utilize reflective reasoning tokens optimally through a novel two-stage paradigm - **STTS Compatibility**: Demonstrates superior scaling behavior under Simple Test-Time Scaling compared to previous LLM-as-a-Judge models - **Performance Improvement**: Achieves 4.8% improvement in overall judgment performance and exhibits a 5.1% stronger scaling trend under STTS ## Model Details - **Base Model**: Qwen2.5-7B-Base - **Training Procedure**: - Stage 1: SFT on J1-SFT-53K dataset (curated from HelpSteer2, OffsetBias, WildGuard, and Magpie) - Stage 2: RL using Reinforce++ algorithm on the English subset of the RISE dataset - **Context Length**: 8192 tokens - **Parameters**: 7 billion - **Training Hardware**: NVIDIA H800 cluster ## Evaluation Results J1-7B-RL was evaluated on four diverse preference datasets and outperforms previous state-of-the-art models: | Model | RewardBench | RewardMath | Anthropic Harmless | CodePrefBench | Overall | |-------|-------------|------------|-------------------|---------------|---------| | Llama3.1-8B-Instruct | 70.47 | 61.12 | 46.43 | 67.10 | 61.28 | | Qwen2.5-7B-Instruct | 78.50 | 69.70 | 49.56 | 67.59 | 66.34 | | Skywork-Critic-Llama3.1-8B | 88.86 | 66.51 | 58.61 | 60.57 | 68.64 | | RISE-Judge-Qwen2.5-7B | 87.42 | 81.69 | 56.35 | 59.22 | 71.17 | | J1-7B (SFT Only) | 85.01 | 82.40 | 53.88 | 49.20 | 67.62 | | J1-7B (SFT + RL) | 86.91 | **90.15** | 59.05 | **67.80** | **75.98** | This represents a significant improvement over previous state-of-the-art LLM-as-a-Judge models in the same size class. ## Usage J1-7B-RL can be used both conventionally and with STTS for enhanced performance: ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model model = AutoModelForCausalLM.from_pretrained("test-time-scaling/J1_7B_RL") tokenizer = AutoTokenizer.from_pretrained("test-time-scaling/J1_7B_RL") # Example question and responses query = "What are the advantages and disadvantages of remote work?" response_a = "Remote work offers flexibility and eliminates commuting, but can lead to isolation and blurred work-life boundaries." response_b = "Working remotely is convenient because you can work from anywhere, but sometimes it's hard to communicate with colleagues." # Standard usage (without STTS) prompt_template = " Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Please first analysis both of the answer step by step, directly point out the position of error and output why it is an error in detail when finding error in analysis. If the question is open-ended, directly point out why the rejected answer is worse than the chosen one. After providing your explanation, output your final verdict by strictly following this format: ‘[[A]]’ if assistant A is better, ‘[[B]]’ if assistant B is better. [User Question] {instruction} {{The Start of Assistant A’s Answer}} {answer_a} {{The End of Assistant A’s Answer}} {{The Start of Assistant B’s Answer}} {answer_b} {{The End of Assistant B’s Answer}} " inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=1024) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) # Usage with STTS def apply_stts(model, tokenizer, query, response_a, response_b, num_waits=2): prompt = f"Question:\n{query}\n\nAnswer A:\n{response_a}\n\nAnswer B:\n{response_b}\n\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Initial generation until token outputs = model.generate( **inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("")[0] # Stop at ) # Replace with "wait" and continue generation result = tokenizer.decode(outputs[0], skip_special_tokens=True) for i in range(num_waits): prompt_with_thinking = result + " wait," inputs = tokenizer(prompt_with_thinking, return_tensors="pt").to(model.device) if i == num_waits - 1: continued = model.generate( **inputs, max_new_tokens=1024, ) else: continued = model.generate( **inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("")[0] ) result = tokenizer.decode(continued[0], skip_special_tokens=True) return result stts_result = apply_stts(model, tokenizer, query, response_a, response_b, num_waits=2) print(stts_result) ``` ## License CC-BY-NC-4.0 ## Acknowledgements This work builds upon research in LLM-as-a-Judge, test-time scaling techniques, and reinforcement learning methodologies. We acknowledge the creators of the Qwen2.5 model series, the RISE dataset, the various evaluation benchmarks used to assess model performance and verl, openrlhf framework for performing RL and SFT training.