Spaces:
Running
Running
| # Arena-Lite (ꡬ Arena-Lite) | |
| μλ λ-λΌμ΄νΈλ ν μ€νΈμ λͺ λ Ήμ΄λ³λ‘ λΉκ΅ν λͺ¨λΈλ€μ ν λλ¨ΌνΈλ₯Ό μννμ¬ μ ννκ² λͺ¨λΈλ€μ μμλ₯Ό λ§€κΉλλ€. μ΄κ²μ reference μμνκ³Ό λΉκ΅νμ¬ μΉλ₯ μ λ§€κΈ°λ λ°©λ²λ³΄λ€ μ ννλ©° μ‘°κΈ λ μ λ ΄ν©λλ€. | |
| λ μμΈν λ΄μ©μ λν΄μλ μλμ λ§ν¬λ₯Ό μ°Έμ‘°νμλ©΄ λ©λλ€. | |
| * [λ Όλ¬Έ](https://arxiv.org/abs/2411.01281) | |
| * [μμ¨μννΈ ν ν¬λΈλ‘κ·Έ (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d) | |
| ## Quickstart | |
| ### λ‘컬μμ μ€νΈλ¦Όλ¦Ώ μ±μΌλ‘ μμνκΈ° (μΆμ²!) | |
| ```bash | |
| git clone [THIS_REPO] | |
| # install requirements below. we recommend miniforge to manage environment | |
| cd streamlit_app_local | |
| bash run.sh | |
| ``` | |
| λ μμΈν λ΄μ©μ `[THIS_REPO]/streamlit_app_local/README.md` μ μ°Έμ‘°νμΈμ! | |
| ### CLI μ¬μ© | |
| * cliμ μΉ μ±μ μλ‘ κ°μ μ½λλ₯Ό νμ©νλ©°, μλμ λλ ν 리μ μμ΅λλ€. | |
| * `varco_arena/` | |
| * vscode μμμ λλ²κΉ μ μν ν리μ ν둬ννΈλ³ ν μ€νΈ λͺ λ Ήμ΄λ λ€μ νμΌμ μ νμμ΅λλ€. | |
| * `varco_arena/.vscode/launch.json` | |
| ```bash | |
| ## gpt-4o-mini as a judge | |
| python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini" | |
| ## vllm-openai served LLM as a judge | |
| python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport" | |
| # dbg lines | |
| ## openai api judge dbg | |
| python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
| ## other testing lines | |
| python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
| ## dummy judge dbg (checking errors without api requests) | |
| python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug | |
| ``` | |
| ## Requirements | |
| `python = 3.11.9` μμμ ν μ€νΈ ν¨. `requirements.txt` | |
| ``` | |
| openai>=1.17.0 | |
| munch | |
| pandas | |
| numpy | |
| tqdm>=4.48.0 | |
| plotly | |
| scikit-learn | |
| kaleido | |
| tiktoken>=0.7.0 | |
| pyyaml | |
| transformers | |
| streamlit>=1.40.2 | |
| openpyxl | |
| fire==0.6.0 | |
| git+https://github.com/shobrook/openlimit.git#egg=openlimit # do not install this by pypi | |
| # LinuxμΈ κ²½μ° | |
| uvloop | |
| # WindowsμΈ κ²½μ° | |
| winloop | |
| ``` | |
| #### Argument | |
| - -i, --input : μ λ ₯ νμΌ or λλ ν 리 or νμΌλͺ μ λν μ κ· ννμ | |
| - -o, --output_dir : μΆλ ₯ νμΌμ΄ μ μ₯λλ λλ ν 리 | |
| - -e, --evaluation : νκ° λͺ¨λΈ (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", vllmμμ λμ΄ λͺ¨λΈ λͺ λ±) | |
| - -m, --matching_method: λ§€μΉ λ°©μ (κΈ°λ³Έκ° "tournament", "league" (λΉμΆμ²) ) | |
| - -k, --openai_api_key : OpenAI API Key | |
| - -u, --openai_url: λ‘컬 vLLM OpenAI μλ² μ¬μ© μ URL(ipμ£Όμ+ν¬νΈ) | |
| #### advanced | |
| - -j, --n_jobs : asyncio.semaphore()μ μ λ¬λ μΈμ. Arenaκ° μ§νλμ§ μλλ€λ©΄ κΈ°λ³Έκ°μΈ 32 μ΄νλ‘ λ΄λ €λ³΄μ | |
| - -p, --evalprompt : [ν΄λΉ λλ ν 리 μ°Έμ‘°](./varco_arena/prompts/*.yaml) | |
| - -lr, --limit_requests : vLLM OpenAI μλ² μμ² μ ν (default: 7,680) | |
| - -lt, --limit_tokens : vLLM OpenAI μλ² ν ν° μ ν (default: 15,728,640) | |
| #### Input Data Format | |
| [input jsonl κ°μ΄λ λ§ν¬](./streamlit_app_local/guide_mds/input_jsonls_kr.md) | |
| ## Contributing & Customizing | |
| #### git clone λ° dependency μ€μΉ νμ ν μΌ | |
| ```bash | |
| pip install pre-commit | |
| pre-commit install | |
| ``` | |
| #### commit νκΈ° μ μ ν μΌ | |
| ```bash | |
| bash precommit.sh # μ΄κ² μ½λλ€μ λ€ λ¦¬ν¬λ§·ν΄μ€κ±°μ | |
| ``` | |
| ### π 컀μ€ν ν둬ννΈ μΆκ°νκΈ° | |
| μλ‘μ΄ νκ° ν둬ννΈλ₯Ό μΆκ°νλ κ³Όμ μ λ€μκ³Ό κ°μ΅λλ€. μ΅κ·Ό Judge λ‘μ§μ΄ `parsed_output` λ©μλλ§ μ¬μ©νλλ‘ κ°μνλμ΄ μ΄μ λ³΄λ€ μ½κ² ν둬ννΈλ₯Ό μΆκ°ν μ μμ΅λλ€. | |
| κ°μ₯ κ°λ¨ν λ°©λ²μ `llmbar_brief.py`μ `llmbar_brief.yaml` νμΌμ 볡μ¬νμ¬ μμ λ§μ ν둬ννΈλ₯Ό λ§λλ κ²μ λλ€. | |
| #### 1. ν둬ννΈ `.py` λ° `.yaml` νμΌ μμ± | |
| - `varco_arena/varco_arena_core/prompts/` κ²½λ‘μ `my_prompt.py`μ `my_prompt.yaml`μ²λΌ νμΌμ μμ±ν©λλ€. | |
| - **`my_prompt.py`**: | |
| - `ComparisonPromptBase`λ₯Ό μμλ°λ ν΄λμ€λ₯Ό μ μν©λλ€. | |
| - `parsed_output(self, response)` λ©μλλ₯Ό λ°λμ ꡬνν΄μΌ ν©λλ€. μ΄ ν¨μλ LLM Judgeμ μλ΅(`response`)μ λ°μ, μΉμλ₯Ό λνλ΄λ κ²°μ ν ν°(μ: `'a'`, `'b'`)μ λ°νν΄μΌ ν©λλ€. | |
| - **`my_prompt.yaml`**: | |
| - `sampling_parameters`, `decision_tokens`, `prompt_template` λ± ν둬ννΈμ νμν μμλ€μ μ μν©λλ€. | |
| - `prompt_template` μ λ€μ΄κ°λ λ¬Έμμ΄μ `string.Template`μΌλ‘ μ²λ¦¬λλ©° `BasePrompt.complete_prompt()` ν¨μλ₯Ό ν΅ν΄ `eval_utils.py`μμ μ΅μ’ μμ±λ©λλ€. | |
| - `${task}, ${generated}, ${model_id}`λ₯Ό `prompt_template`μ μ¬μ©νμ§ λ§μΈμ. μμ½λ ν€μλλ€μ λλ€. | |
| #### 2. `prompts/__init__.py`μ ν둬ννΈ λ±λ‘ | |
| - μμ±ν ν둬ννΈ ν΄λμ€λ₯Ό `import` ν©λλ€. | |
| ```python | |
| from .my_prompt import MyPrompt | |
| ``` | |
| - `NAME2PROMPT_CLS` λμ λ리μ μ ν둬ννΈ μ΄λ¦κ³Ό ν΄λμ€ κ°μ²΄λ₯Ό μΆκ°ν©λλ€. | |
| ```python | |
| NAME2PROMPT_CLS = dict( | |
| # ... κΈ°μ‘΄ ν둬ννΈλ€ | |
| my_prompt=MyPrompt(), | |
| ) | |
| ``` | |
| - `load_prompt` ν¨μμ `promptname` μΈμμ `Literal` νμ ννΈμ μ ν둬ννΈ μ΄λ¦μ μΆκ°ν©λλ€. | |
| ```python | |
| def load_prompt( | |
| promptname: Literal[ | |
| # ... κΈ°μ‘΄ ν둬ννΈ μ΄λ¦λ€ | |
| "my_prompt", | |
| ], | |
| # ... | |
| ): | |
| ``` | |
| #### 3. `eval_prompt_list.txt`μ ν둬ννΈ μΆκ° | |
| - νλ‘μ νΈ λ£¨νΈμ `eval_prompt_list.txt` νμΌμ μ΄κ³ , μ ν둬ννΈμ μ΄λ¦(`my_prompt`)μ μ μ€μ μΆκ°ν©λλ€. | |
| #### 4. (κΆμ₯) ν μ€νΈ λ° λλ²κΉ | |
| - ν둬ννΈκ° μλλλ‘ μλνλμ§ νμΈνκΈ° μν΄ λλ²κΉ μ κΆμ₯ν©λλ€. | |
| - `.vscode/launch.json` νμΌμ `"VA"` μ€μ μμ `args`λ₯Ό λ€μκ³Ό κ°μ΄ μμ ν©λλ€. | |
| - `"-p", "translation_fortunecookie"` λΆλΆμ `"-p", "my_prompt"`λ‘ λ³κ²½ν©λλ€. | |
| - νμμ `"-i", "..."` λΆλΆμ μ ν둬ννΈμ μ ν©ν ν μ€νΈ λ°μ΄ν° κ²½λ‘λ₯Ό μ§μ ν©λλ€. | |
| - VS Codeμ `Run and Debug` ν(Ctrl+Shift+D)μΌλ‘ μ΄λνμ¬ "VA" μ€μ μ μ ννκ³ F5 ν€λ₯Ό λλ¬ λλ²κ±°λ₯Ό μ€νν©λλ€. | |
| - `-o` λ€μ λͺ μν output λλ ν 리 μμμ `result.json` λ₯Ό μ°Ύμμ μνλλλ‘ λμνλμ§ νμΈν΄λ³΄μΈμ. λͺ¨λ judgeμ λ§€μΉμ νμ©λ ν둬ννΈ μ λ³΄κ° λ΄κ²¨μμ΅λλ€. | |
| λ¬Έμ: μμ μΌ | |
| * λ΄κ° λ§λ ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ | |
| * [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) μμ κ°μ’ ν둬ννΈ ν΄λμ€ λ° `yaml` νμΌ ννλ‘ μ μλ ν둬ννΈλ₯Ό λ‘λν©λλ€. ν리μ μ μ°Έμ‘°νμ¬ μμ±νμλ©΄ λ©λλ€. | |
| * ν μ€νΈμ λ³λ‘ λ€λ₯Έ νκ° ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ (e.g. μμ μ λ°λΌ λ€λ₯Έ ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ) | |
| * μ κ±Έμ΄λλ¦° λ§ν¬μ `load_prompt` λ₯Ό ν΅ν΄μ `promptname` + `task` ννλ‘ [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) ν둬ννΈκ° λ‘λλλλ‘ ν΄λμμ΅λλ€. | |
| ## Special Thanks to (contributors) | |
| - μ΄λ―ΌνΈ (@λνλͺ¨λΈν, NCSOFT) [github](https://github.com/minolee/) | |
| - query wrapper | |
| - rag prompt | |
| - μ€μ£Όλ―Ό (@μμ±λͺ¨λΈν, NCSOFT) | |
| - overall prototyping of the system in haste | |
| ## Citation | |
| μ ν¬ μμ λ¬Όμ΄ λμμ΄ λμλ€λ©΄ μ ν¬λ λμμ λ°μλ³Ό μ μμκΉμ?π | |
| ``` | |
| @misc{son2024varcoarenatournamentapproach, | |
| title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models}, | |
| author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim}, | |
| year={2024}, | |
| eprint={2411.01281}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2411.01281}, | |
| } | |
| ``` | |