Personal thoughts on a randomized study of LM-based review feedback agent at ICLR 2025

Last year, ICLR 2025 announced an interesting experiment "Assisting ICLR 2025 reviewers with feedback". Unfortunately, I had to decline the reviewer invitation from ICLR 2025 for other duties at that time and I missed the opportunity to be part of the interesting experiment.
Since then, I had waited for the results of the experiment and feedback from the community as I was super interested in this kind of work. In fact, I proposed a similar idea to organizers of different venues before. I was also advised by some people to reuse their system for some of the venues I'm serving for, so my expectation bar was set pretty high. A few days ago, a follow-up blog post was published by ICLR 2025 along with a preprint, "Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025".
I'm writing this post to share my personal thoughts on this work, focused on designs of the experiment and evaluation.
TL;DR
I find it difficult for other venues to apply the same approach to their review systems. Besides, I cannot buy many of findings and conclusions reported in the preprint (v1) as I find the evaluations and analysis of the experiment not convincing enough to support their claims. If you are an organizer of some conference/journal and feel inclined to introduce the approach based on the findings, I want to suggest that you read the preprint at first.
Let me briefly introduce what you could (not) reuse from their work as is.
What you could reuse
- Prompts
- Example code
The preprint shows the authors' effort in transparency of the experiment, providing all prompts used in the experiment, many stats, and minimal example code to run their proposed Review Feedback Agent. If other venues want to implement similar approaches, those may be good references.
Difficult for other venues to reuse the approach as is
First and foremost, the experiment uses paper content parsed from the PDF file as part of context in the prompt, but they considered only proprietary LM-powered APIs: GPT-4o, Gemini 1.5 Flash, and Claude Sonnet 3.5.
ICLR traditionally opens (almost) everything involving their review process, including submitted papers, even during their review period, which might have enabled the organizers to feed the paper content to the proprietary APIs. I am not sure what terms and plans were used for the proprietary API services, but if those allow the providers to store and/or use the paper content for training their own models, I'd assume that organizers need authors' agreement beforehand, especially if the submitted papers are not publicly available.
It is unclear for me whether the authors of ICLR 2025 submissions received any form of approval for their submissions being exposed to the proprietary LM-powered APIs.
Now I switch gears and share my personal thoughts on the work itself.
Key concerns
- This work doesn't show any real user feedback on the generated review feedback.
- Many findings are based on analysis done by LM-powered APIs, which seems (blindly) believed without much doubt about its accuracy.
1. No real user feedback
By real users, I meant those in the review processes such as authors, reviewers, and ACs.
After the final paper decisions have been released, we will distribute an anonymous, voluntary survey to authors, reviewers, and ACs to gather feedback. We will carefully analyze the responses to assess the pilot system's impact and to guide improvements for future iterations.
In this work, despite the above statement in an early ICLR blog post, I didn't find any discussions regarding real user feedback (i.e., feedback from authors, reviewers, and ACs) on the generated review feedback and/or its impact. While the work suggests that reviews updated after the review feedback had a little bit more words, slightly longer reviews do not necessarily mean the reviews were improved.
There were "two human AI researchers" who conducted a blind preference evaluation between the initial and modified pre-rebuttal reviews. The two researchers checked 100 examples that meet some criteria and preferred modified reviews 89% of the time. However, I do not think it's a strong signal of higher quality review. This type of evaluation should be based on feedback from real users as they are the players that quality of review may impact on. It might be a different story if the two researchers were known (selected) as high-quality reviewers before and assessed the modified and original reviews based on the review guidelines. But, this work fails to provide the details.
On top of that, the 100 examples are not randomly sampled, but extracted from a group that the authors claimed "We see that when reviewers receive fewer feedback items, they are more likely to incorporate more (or even all) of the items." Thus, I do not think that the finding based on the 100 examples represents an average (expected) experience.
It is also notable that 73.4% of the reviews that received generated feedback were "not updated". Though it's lower than 90.6% (from their control group), it would have been an interesting discussion if we learned something from the real user feedback to improve reviewers' engagement.
The goal of this work is to improve review quality, but I think that "review quality" is not either well defined or properly assessed in this work. I believe that without positive feedback from authors, reviewers, and/or ACs, it is difficult to claim that the generated feedback indeed improve the review quality.
2. Making findings built on decisions by LM-powered APIs with no accuracy discussions
Ex1: "Well-written" reviews
Less than 8% of the selected reviews did not receive feedback for one of two reasons: 2,692 reviews were originally well-written and did not need feedback, while 829 reviews had feedback that failed the reliability tests.
This work doesn't explain how they found 2,692 reviews well-written. From their prompt,
If you find no issues in the review at all, respond with: 'Thanks for your hard work!'
I assume that the conclusion is drawn based on predictions by the LM-powered APIs and this work doesn't discuss how accurate the predictions are.
Ex2: Measuring how much feedback reviewers incorporate
Of the reviewers that updated their review, we wanted to measure what proportion of them incorporated one or more pieces of feedback they were provided. This analysis helped us estimate how many reviewers found the feedback useful. … To systematically carry out this analysis, we developed an LLM-based pipeline to run on all updated reviews (see Supplementary Figure S2A). We used the Claude Sonnet 3.5 model to evaluate whether each feedback item received by a reviewer was incorporated into their modified review.
Similarly, this finding is based on how many of the reviews with reviewer feedback incorporated the feedback, which is built on Claude Sonnet 3.5's estimates. Again, there is no discussion regarding accuracy of Claude Sonnet 3.5's prediction for this specific task.
Ex3: Reliability tests
Before sharing a generated feedback with reviewer, the feedback needs to pass four tests. The reliability tests are automated (probably with the proprietary APIs). But, this work doesn't discuss either the quality of each test or error rate.
I also feel that the following prompt tuning strategy may lead to overfitting.
To refine our Review Feedback Agent's pipeline and prompts, we passed our test set reviews through the validated reliability tests until we achieved a 100% pass rate.
Suggestions
Yet, I still think that it was a great initiative by ICLR 2025 team, for improving review quality by an automated approach. I believe that this experiment was one of so many tasks that the team completed, and I can imagine from my experience that even without the experiment, running a regular review process is very challenging and keeps them busy.
In this post, I intentionally avoid discussing their review feedback agent approach since I didn't find solid evaluations to discuss the effectiveness of the proposed approach (though the high-level pipeline design looks reasonable). If other venues conduct similar experiments, I would like to see the discussions on review quality based on feedback from authors, reviewers, and/or ACs.
Specifically, I imagine that conference organizers and journal editorial board members would be more interested in what users in the review process thought about the generated feedback and quality of reviews besides the statistics in the experiment like those ICLR 2025 team provided.
I also want to suggest that organizers who follow this experiment discuss risks of using proprietary LM-powered APIs such as whether or not their service terms suggest that paper contents fed to the APIs may be stored or used for their model training and whether authors are ok with their papers being exposed to such external APIs, considering open-sourced/weight models for their experiments.
Dear ICLR 2025 authors, reviewers, ACs, and research community, what's your thought on the experiment?
In this post, I shared my personal thoughts on the experiment done by ICLR 2025. As mentioned above, I didn't involve the ICLR 2025 review process this time or submit papers to the venue either. If you exprienced the ICLR 2025 review process, I am very eager to hear your thoughts on the experiment, how useful the generated feedback was for you, quality of reviews you saw / received, etc. As I'm fortunately in a position that can propose and technically support such experiments, it would be very helpful if you could share your experience and feedback on the experiment!
Even if you were not in the ICLR 2025 review process, I still want to hear your thoughts on this work!
P.S.
I'm cross-posting this content at Medium in case you don't have a Hugging Face account to leave a comment. If you don't have a Medium account either, reply to my Bluesky or X posts.