Spaces:

NCSOFT
/

ArenaLite

Running

App Files Files Community

ArenaLite / README_en.md

sonsus

updated requirements after check

829d82e 5 months ago

preview code

raw

history blame

6.63 kB

	# Arena-Lite (former Arena-Lite)
	Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.

	For more information, the followings may help understanding how it works.
	* [Paper](https://arxiv.org/abs/2411.01281)
	* [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)


	## Quickstart
	### Running Web Demo locally (streamlit, Recommended!)
	```bash
	git clone [THIS_REPO]
	# install requirements below. we recommend miniforge to manage environment
	cd streamlit_app_local
	bash run.sh
	```
	For more details, see `[THIS_REPO]/streamlit_app_local/README.md`

	### CLI use
	* located at
	* `varco_arena/`
	* debug configurations for vscode at
	* `varco_arena/.vscode`
	```bash
	## gpt-4o-mini as a judge
	python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
	## vllm-openai served LLM as a judge
	python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"

	# dbg lines
	## openai api judge dbg
	python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
	## other testing lines
	python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
	## dummy judge dbg (checking errors without api requests)
	python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug
	```

	## Requirements
	```
	pip install -r requirements.txt # python 3.11

	# Linux
	uvloop
	# Windows
	winloop
	```


	#### Argument
	- -i, --input : directory path which contains input jsonlines files (llm outputs)
	- -o, --output_dir : directory where results to be put
	- -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", \[vllm-served-model-name\])
	- -k, --openai_api_key : OpenAI API Key
	- -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)

	#### advanced
	- -j, --n_jobs : n jobs to be put to `asyncio.semaphore(n=)`
	- -p, --evalprompt : [see the directory](./varco_arena/prompts/*.yaml)
	- -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
	- -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)

	#### Input Data Format
	[input jsonl guides](./streamlit_app_local/guide_mds/input_jsonls_en.md)


	## Contributing & Customizing
	#### Do this after git clone and installation
	```bash
	pip install pre-commit
	pre-commit install
	```
	#### before commit
	```bash
	bash precommit.sh # black formatter will reformat the codes
	```

	### 📝 Adding a Custom Prompt

	Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.

	The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.

	#### 1. Create Prompt `.py` and `.yaml` Files

	- Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
	- `my_prompt.py`:
	- Define a class that inherits from `ComparisonPromptBase`.
	- You must implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
	- `my_prompt.yaml`:
	- Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
	- The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
	- Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.

	#### 2. Register the Prompt in `prompts/__init__.py`

	- Import your new prompt class:
	```python
	from .my_prompt import MyPrompt
	```
	- Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
	```python
	NAME2PROMPT_CLS = dict(
	# ... other prompts
	my_prompt=MyPrompt(),
	)
	```
	- Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
	```python
	def load_prompt(
	promptname: Literal[
	# ... other prompt names
	"my_prompt",
	],
	# ...
	):
	```

	#### 3. Add the Prompt to `eval_prompt_list.txt`

	- Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.

	#### 4. (Recommended) Test and Debug

	- It is highly recommended to debug your prompt to ensure it works as expected.
	- In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
	- Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
	- If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
	- Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
	- Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.


	## FAQ
	* I want to apply my custom judge prompt to run Arena-Lite
	* [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
	* I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
	* You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).

	## Special Thanks to (contributors)
	- Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
	- query wrapper
	- rag prompt
	- Jumin Oh (@Generation Model Team, NCSOFT)
	- overall prototyping of the system in haste


	## Citation
	If you found our work helpful, consider citing our paper!
	```
	@misc{son2024varcoarenatournamentapproach,
	title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
	author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
	year={2024},
	eprint={2411.01281},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.01281},
	}
	```