Spaces:
Running
Running
| # Arena-Lite (former Arena-Lite) | |
| Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs. | |
| For more information, the followings may help understanding how it works. | |
| * [Paper](https://arxiv.org/abs/2411.01281) | |
| * [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d) | |
| ## Quickstart | |
| ### Running Web Demo locally (streamlit, Recommended!) | |
| ```bash | |
| git clone [THIS_REPO] | |
| # install requirements below. we recommend miniforge to manage environment | |
| cd streamlit_app_local | |
| bash run.sh | |
| ``` | |
| For more details, see `[THIS_REPO]/streamlit_app_local/README.md` | |
| ### CLI use | |
| * located at | |
| * `varco_arena/` | |
| * debug configurations for vscode at | |
| * `varco_arena/.vscode` | |
| ```bash | |
| ## gpt-4o-mini as a judge | |
| python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini" | |
| ## vllm-openai served LLM as a judge | |
| python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport" | |
| # dbg lines | |
| ## openai api judge dbg | |
| python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
| ## other testing lines | |
| python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
| ## dummy judge dbg (checking errors without api requests) | |
| python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug | |
| ``` | |
| ## Requirements | |
| ``` | |
| pip install -r requirements.txt # python 3.11 | |
| # Linux | |
| uvloop | |
| # Windows | |
| winloop | |
| ``` | |
| #### Argument | |
| - -i, --input : directory path which contains input jsonlines files (llm outputs) | |
| - -o, --output_dir : directory where results to be put | |
| - -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", \[vllm-served-model-name\]) | |
| - -k, --openai_api_key : OpenAI API Key | |
| - -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk) | |
| #### advanced | |
| - -j, --n_jobs : n jobs to be put to `asyncio.semaphore(n=)` | |
| - -p, --evalprompt : [see the directory](./varco_arena/prompts/*.yaml) | |
| - -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680) | |
| - -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640) | |
| #### Input Data Format | |
| [input jsonl guides](./streamlit_app_local/guide_mds/input_jsonls_en.md) | |
| ## Contributing & Customizing | |
| #### Do this after git clone and installation | |
| ```bash | |
| pip install pre-commit | |
| pre-commit install | |
| ``` | |
| #### before commit | |
| ```bash | |
| bash precommit.sh # black formatter will reformat the codes | |
| ``` | |
| ### 📝 Adding a Custom Prompt | |
| Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method. | |
| The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt. | |
| #### 1. Create Prompt `.py` and `.yaml` Files | |
| - Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory. | |
| - **`my_prompt.py`**: | |
| - Define a class that inherits from `ComparisonPromptBase`. | |
| - You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner. | |
| - **`my_prompt.yaml`**: | |
| - Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`. | |
| - The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function. | |
| - Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt. | |
| #### 2. Register the Prompt in `prompts/__init__.py` | |
| - Import your new prompt class: | |
| ```python | |
| from .my_prompt import MyPrompt | |
| ``` | |
| - Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary: | |
| ```python | |
| NAME2PROMPT_CLS = dict( | |
| # ... other prompts | |
| my_prompt=MyPrompt(), | |
| ) | |
| ``` | |
| - Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function: | |
| ```python | |
| def load_prompt( | |
| promptname: Literal[ | |
| # ... other prompt names | |
| "my_prompt", | |
| ], | |
| # ... | |
| ): | |
| ``` | |
| #### 3. Add the Prompt to `eval_prompt_list.txt` | |
| - Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line. | |
| #### 4. (Recommended) Test and Debug | |
| - It is highly recommended to debug your prompt to ensure it works as expected. | |
| - In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`: | |
| - Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`. | |
| - If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt. | |
| - Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger. | |
| - Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match. | |
| ## FAQ | |
| * I want to apply my custom judge prompt to run Arena-Lite | |
| * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need. | |
| * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`) | |
| * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py). | |
| ## Special Thanks to (contributors) | |
| - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/) | |
| - query wrapper | |
| - rag prompt | |
| - Jumin Oh (@Generation Model Team, NCSOFT) | |
| - overall prototyping of the system in haste | |
| ## Citation | |
| If you found our work helpful, consider citing our paper! | |
| ``` | |
| @misc{son2024varcoarenatournamentapproach, | |
| title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models}, | |
| author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim}, | |
| year={2024}, | |
| eprint={2411.01281}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2411.01281}, | |
| } | |
| ``` | |