README.md · agenticx/TxAgentRAOEval at main

metadata

title: TxAgent RAO Evaluation
emoji: 🌍
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.28.0
app_file: app.py
pinned: false

The TxAgent Rare-as-one Evaluation Portal is a Gradio-based web application designed to facilitate the human evaluation of TxAgent's responses compared to other models. Users log in with their credentials, receive questions relevant to their expertise, and then perform pairwise comparisons and detailed ratings of model responses.

Features

User Registration: The portal collects user information such as name, email, evaluator ID, medical specialty, subspecialty, NPI ID, and years of experience.
Dynamic Question Assignment: Questions are fetched based on the evaluator's ID. The system tracks previously evaluated questions to ensure users don't see the same questions twice.
Model Response Display: Two model responses are presented side-by-side in scrollable chat windows for easy comparison.
Reference Answer: A reference answer is provided to offer context and guidance during evaluation.
Pairwise Comparison: Users compare Model A and Model B across five criteria: Problem Resolution, Helpfulness, Scientific Consensus, Accuracy, and Completeness.
Detailed Rating: Evaluators can rate each model's response individually on a 1-5 scale for the same five criteria, with an "Unable to Judge" option.
Constraint-Based Rating: To maintain consistency, individual model ratings are restricted based on the pairwise comparison choices. For instance, if Model A is chosen as "better" in a pairwise comparison, its individual rating for that criterion cannot be lower than Model B's.
Validation: The system checks for inconsistencies between pairwise comparisons and individual ratings, providing feedback to the user.
Progress Tracking: Users are informed about how many questions they have remaining to evaluate.
Data Submission: All evaluation data is submitted to a Google Sheet for storage and analysis. Importantly, the Google Sheet name is dynamically generated using a base name (TxAgent_Human_Eval_Results_CROWDSOURCED) combined with the evaluator's unique ID, ensuring that each evaluator's submissions are organized into their own dedicated sheet (e.g., TxAgent_Human_Eval_Results_CROWDSOURCED_1234).

Application Flow

The application guides the user through a multi-page evaluation process:

Welcome Page (page-1): This is the initial landing page featuring a "Participate in TxAgent Evaluation" button and general information about the TxAgent project.
Registration/Information Page (page0): Users input their personal and professional details here. Once submitted, the application uses this information to assign relevant questions.
Evaluation Progress Modal: A pop-up appears, informing the user of the number of questions they have left to evaluate.
Pairwise Comparison Page (page1): On this page, users compare two model responses for a given prompt based on the predefined criteria.
Detailed Rating Page (page2): Users then provide individual ratings for each model's response across the same criteria.
Confirmation Modal: Before final submission, a pop-up asks the user to confirm their evaluation.
Final Page (final_page): Once all assigned questions are evaluated, a thank you message is displayed.
Error Modal: Any validation errors or important messages are conveyed through this pop-up.

Code Components and Interactions

The Gradio application is built using several key components and functions that orchestrate the user experience and data handling.

Configuration and Data Handling

REPO_ID: This variable points to the Hugging Face dataset repository where evaluation data is stored.
EVALUATOR_MAP_DICT: This is the name of the JSON file within the Hugging Face repository that maps evaluator IDs to their specific data directories.
TXAGENT_RESULTS_SHEET_BASE_NAME: As mentioned, this string forms the base for the Google Sheet names. The actual sheet used for saving data will append the user's evaluator ID to this base name.
Tool Lists: The application dynamically loads JSON files from the tool_lists directory. These files define the various tools used by the models.
tool_database_labels: This dictionary is constructed from the loaded tool lists and is used by the format_chat function to clearly label tool calls within the displayed model responses, making the reasoning trace more understandable.
HTML & Image Assets: index.html provides the content for the initial project page, dynamically embedding images (like anatomyofAgentResponse.jpg) by converting them to base64 strings.
Specialty Data: specialties.json and subspecialties.json provide the lists of medical specialties and subspecialties for the dropdown menus during user registration.

Core Logic Functions

encode_image_to_base64(image_path): A utility to convert image files into base64 strings, allowing them to be directly embedded into HTML without requiring separate file hosting.
preprocess_question_id(question_id): Ensures that question IDs are consistently formatted as strings, regardless of their original type.
get_evaluator_questions(evaluator_id, all_files, evaluator_directory): This crucial function orchestrates the retrieval of questions for a given evaluator. It downloads the relevant evaluation data from Hugging Face based on the evaluator's assigned directory. Crucially, it then filters these questions against previously submitted evaluations (read from the evaluator's specific Google Sheet), ensuring that users only receive questions they haven't already answered. It returns the list of (question_ID, other_model_name) pairs to be evaluated and all the loaded model data.
format_chat(reasoning_trace, tool_database_labels): This function takes a model's raw reasoning trace and transforms it into a clean, human-readable chat history format, highlighting tool calls using the labels from tool_database_labels.
append_to_sheet(...) and read_sheet_to_df(...): These (presumed) utility functions handle the integration with Google Sheets. append_to_sheet writes the collected evaluation data, and read_sheet_to_df reads existing data to prevent re-evaluation of questions. The custom_sheet_name parameter in these functions is where the dynamic sheet naming (base name + evaluator ID) comes into play.

User Interface and Page Transitions

The application uses various Gradio UI components for input, display, and interaction:

Layout Components: gr.Column and gr.Row are used to arrange UI elements effectively.
Content Display: gr.HTML and gr.Markdown are used for embedding rich content, instructions, and formatted text, such as the highlighted prompt and reference answers.
Input Fields: gr.Textbox handles text input for user details and comments. gr.Dropdown provides multi-select options for specialties. gr.Radio is used for single-choice selections like years of experience, pairwise comparisons, and individual ratings.
Interactive Display: gr.Chatbot provides scrollable, interactive displays for model responses, capable of rendering markdown.
Buttons: gr.Button elements control navigation (Next, Back, Home Page), initiate actions (Participate in Evaluation, Submit), and manage specific UI elements (Clear Selection, This question does not make sense).
gr.State(): This fundamental Gradio component is used to store and pass information between different function calls and pages. Variables like user_info_state, pairwise_state, and data_subset_state preserve crucial evaluation context.
Modal(): Used for presenting critical information or confirmation requests in a pop-up window, enhancing user experience for actions like submission or error notifications.

Event Handling and Validation

Gradio's event handling mechanism (.click(), .change(), .then()) is central to the application's interactive nature:

go_to_eval_progress_modal(...): Triggered by clicking "Next" on the registration page. It validates user inputs, then fetches and prepares the first question for evaluation, showing the eval_progress_modal.
go_to_page2(...): Activated by the "Next: Rate Responses" button on the pairwise comparison page. It saves the pairwise choices and reasons, then transitions the user to the detailed rating page.
restrict_choices(...): This function is a key part of the constraint-based rating. It is triggered whenever a rating radio button for Model A or Model B is changed. It dynamically updates the available choices for the other model's rating based on the selected pairwise comparison and the current rating, ensuring consistency.
validate_ratings(...): This function is called when the "Submit" button is clicked. It performs a consistency check between the pairwise comparisons and the individual model ratings. If any discrepancies are found (e.g., Model A rated lower than B but chosen as "better" in pairwise), an error message is generated.
Submission Flow: The submit_btn's .click() event first calls validate_ratings. Its .then() method then processes the validation result. If valid, a confirmation modal (confirm_modal) appears.
final_submit(...): If the user confirms submission, this function is executed. It constructs the complete evaluation record using build_row_dict, appends it to the correct Google Sheet (TXAGENT_RESULTS_SHEET_BASE_NAME + evaluator ID), and then re-evaluates the remaining questions. Based on the count, it either shows the eval_progress_modal for the next question or the final_page if no more questions are available.
reset_everything_except_user_info(): After a successful submission and when proceeding to a new question, this function resets all evaluation-specific UI components to their default empty or unselected states, ensuring a clean slate for the next task.