BEE-spoke-data
/

tFINE-680m-e32-d16-infinity_instruct-L2

Text2Text Generation

text-generation-inference

Model card Files Files and versions

tFINE-680m-e32-d16-infinity_instruct-L2 / README.md

pszemraj's picture

Update README.md

a1ec0ab verified 7 months ago

|

history blame contribute delete

3.36 kB

	---
	library_name: transformers
	language:
	- en
	license: apache-2.0
	base_model: BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1
	tags:
	- gqa
	- t5
	- instruct
	datasets:
	- pszemraj/infinity-instruct-7m-T2T_en
	pipeline_tag: text2text-generation
	---


	# tFINE-680m-e32-d16-infinity_instruct-L2

	this is an instruction-tuned version of a pretrained t5 with GQA.

	## Model description

	This model is a fine-tuned version of [BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1](https://huggingface.co/BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L1) on the pszemraj/infinity-instruct-7m-T2T_en dataset (config `deduped-L2`).

	It achieves the following results on the evaluation set:
	- Loss: 1.3139
	- Num Input Tokens Seen: 361724696

	## usage

	prerequisite: you need to have [t5-gqa fork of transformers installed](https://huggingface.co/BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan#testing), and accelerate.

	```py
	from transformers import pipeline

	pipe = pipeline(
	"text2text-generation",
	model="BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2",
	device_map="auto",
	)
	prompt = "Write me a python fn that demonstrates an advanced sorting algorithm"
	res = pipe(
	prompt, max_new_tokens=384, num_beams=4, early_stopping=True, repetition_penalty=1.1
	)
	print(res[0]["generated_text"])
	```

	## Quick eval

	Quick eval for: `BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2`


	hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|------\|
	\|boolq \| 2\|none \| 0\|acc \|↑ \|0.6364\|± \|0.0084\|
	\|openbookqa \| 1\|none \| 0\|acc \|↑ \|0.1480\|± \|0.0159\|
	\| \| \|none \| 0\|acc_norm\|↑ \|0.2860\|± \|0.0202\|
	\|piqa \| 1\|none \| 0\|acc \|↑ \|0.6083\|± \|0.0114\|
	\| \| \|none \| 0\|acc_norm\|↑ \|0.6132\|± \|0.0114\|
	\|social_iqa \| 0\|none \| 0\|acc \|↑ \|0.3854\|± \|0.0110\|
	\|tinyArc \| 0\|none \| 25\|acc_norm\|↑ \|0.3122\|± \| N/A\|
	\|tinyHellaswag\| 0\|none \| 10\|acc_norm\|↑ \|0.3356\|± \| N/A\|
	\|tinyMMLU \| 0\|none \| 0\|acc_norm\|↑ \|0.2793\|± \| N/A\|
	\|winogrande \| 1\|none \| 0\|acc \|↑ \|0.5201\|± \|0.0140\|

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2.5e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 17868
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 256
	- total_eval_batch_size: 8
	- optimizer: Use paged_ademamix_32bit and the args are:
	No additional optimizer arguments
	- lr_scheduler_type: constant_with_warmup
	- lr_scheduler_warmup_ratio: 0.02
	- num_epochs: 1.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Input Tokens Seen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:-----------------:\|
	\| 1.4008 \| 0.2534 \| 1000 \| 1.4020 \| 91375832 \|
	\| 1.3456 \| 0.5068 \| 2000 \| 1.3669 \| 182939052 \|
	\| 1.3437 \| 0.7602 \| 3000 \| 1.3378 \| 274855796 \|