speechless-coding-7b-16k-tora / README.md

Adding Evaluation Results

c228623 verified 7 months ago

7.98 kB

	---
	language:
	- en
	license: llama2
	library_name: transformers
	tags:
	- llama-2
	- code
	datasets:
	- jondurbin/airoboros-2.2
	- Open-Orca/OpenOrca
	- garage-bAInd/Open-Platypus
	- WizardLM/WizardLM_evol_instruct_V2_196k
	- TokenBender/python_eval_instruct_51k
	pipeline_tag: text-generation
	model-index:
	- name: SpeechlessCoder
	results:
	- task:
	type: text-generation
	dataset:
	name: HumanEval
	type: openai_humaneval
	metrics:
	- type: pass@1
	value: 52.439
	name: pass@1
	verified: false
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 41.13
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 64.48
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 38.86
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 44.95
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 63.85
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 17.06
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=speechlessai/speechless-coding-7b-16k-tora
	name: Open LLM Leaderboard
	---

	<p><h1> speechless-coding-7b-16k-tora </h1></p>

	Use the following dataset to fine-tune llm_agents/tora-code-7b-v1.0 in order to improve the model's reasoning and planning abilities.

	context window length: 16,384
	prompt_type = "alpaca"
	max_tokens > 128 && < 16384
	>
	Total 177,333 samples 316 MB
	- jondurbin/airoboros-2.2: Filter categories related to coding, reasoning and planning. 21,923 samples.
	- Open-Orca/OpenOrca: Filter the 'cot' category in 1M GPT4 dataset. 62,973 samples.
	- garage-bAInd/Open-Platypus: 100%, 22,760 samples.
	- WizardLM/WizardLM_evol_instruct_V2_196k: Coding coversation part. 30,081 samples
	- TokenBender/python_eval_instruct_51k: “python” in output .39,596 samples


	50 samples/T=0.2/MaxTokens=512/Top_P=0.95

	Code: https://github.com/uukuguy/speechless

	## HumanEval

	\| Metric \| Value \|
	\| --- \| --- \|
	\| humaneval-python \| 52.44 \|

	[Big Code Models Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)

	CodeLlama-34B-Python: 53.29

	CodeLlama-34B-Instruct: 50.79

	CodeLlama-13B-Instruct: 50.6

	CodeLlama-34B: 45.11

	CodeLlama-13B-Python: 42.89

	CodeLlama-13B: 35.07

	## MultiPL-E

	\| Metric \| Value \|
	\| --- \| --- \|
	\| python \| 55.96 \|
	\| java \| 37.84 \|
	\| javascript \| 46.93 \|
	\| cpp \| 37.48 \|
	\| rust \| 29.01 \|
	\| go \| 28.99 \|
	\| sh \| 12.11 \|
	\| julia \| 31.47 \|
	\| typescript \| 47.80 \|

	## LMEval

	[Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	\| Metric \| Value \|
	\| --- \| --- \|
	\| ARC \| \|
	\| HellaSwag \| \|
	\| MMLU \| \|
	\| TruthfulQA \| \|
	\| Average \| \|

	## Parameters

	\| \| \|
	\|------ \| ------ \|
	\| lr \| 2e-4 \|
	\| lr_scheduler_type \| cosine \|
	\| weight_decay \| 0.0 \|
	\| optim \| paged_adamw_8bit \|
	\| flash_attention \| True \|
	\| rerope \| False \|
	\| max_new_tokens \| 16384 \|
	\| num_train_epochs \| 2 \|
	\| bits \| 4 \|
	\| lora_r \| 64 \|
	\| lora_alpha \| 256 \|
	\| lora_dropout \| 0.05 \|
	\| double_quant \| True \|
	\| quant_type \| nf4 \|
	\| dataset_format \| sharegpt \|
	\| mini_batch_size \| 2 \|
	\| grandient_accumulation_steps \| 32 \|
	\| bf16 \| True \|

	A100-40G x 4


	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_speechlessai__speechless-coding-7b-16k-tora)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|45.05\|
	\|AI2 Reasoning Challenge (25-Shot)\|41.13\|
	\|HellaSwag (10-Shot) \|64.48\|
	\|MMLU (5-Shot) \|38.86\|
	\|TruthfulQA (0-shot) \|44.95\|
	\|Winogrande (5-shot) \|63.85\|
	\|GSM8k (5-shot) \|17.06\|