Update README.md

d0f656c verified 12 months ago

3.93 kB

	# parseny/TinyLlama1.1B-Nvidia-QA

	This repository contains the parseny/TinyLlama1.1B-Nvidia-QA model, a fine-tuned version of the TinyLlama language model designed for generating answers on NVIDIA documentation. The model was fine-tuned on a [dataset of question-answer pairs](https://www.kaggle.com/datasets/gondimalladeepesh/nvidia-documentation-question-and-answer-pairs) and evaluated using several metrics to ensure high performance.

	## Model Details

	- Model ID: parseny/TinyLlama1.1B-Nvidia-QA
	- Model Type: Causal Language Model
	- Base Model: TinyLlama-1.1B
	- Quantization: 4-bit quantization using BitsAndBytes
	- Fine-Tuning Framework: Hugging Face Transformers and PEFT

	## Training Configuration

	The model was fine-tuned with the following training arguments:

	```python
	training_arguments = TrainingArguments(
	output_dir="./logs",
	per_device_train_batch_size=16,
	gradient_accumulation_steps=4,
	optim="paged_adamw_32bit",
	fp16=True,
	evaluation_strategy="epoch",
	save_strategy="epoch",
	num_train_epochs=5,
	load_best_model_at_end=True,
	learning_rate=5e-4
	)
	```

	## Evaluation Metrics

	The performance of the fine-tuned model was evaluated using the following metrics:

	- ROUGE Scores:
	- ROUGE-1: 0.3122
	- ROUGE-2: 0.1228
	- ROUGE-L: 0.2599
	- ROUGE-Lsum: 0.2600

	- METEOR Score: 0.27

	These scores indicate that the model performs reasonably well in generating responses that are lexically and semantically similar to the reference answers.

	## Model Usage

	You can use this model to generate responses for chat-based applications. Below is an example of how to load and use the model for generating responses:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel, PeftConfig
	import torch

	# Load the model and tokenizer
	model_id = "parseny/TinyLlama1.1B-Nvidia-QA"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	model.to('cuda')

	# Generate a response
	generation_config = GenerationConfig(
	penalty_alpha=0.6, do_sample=True,
	top_k=5, temperature=0.5, repetition_penalty=1.2,
	max_new_tokens=47, pad_token_id=tokenizer.eos_token_id
	)

	def generate_response(prompt):
	try:
	inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
	outputs = model.generate(**inputs, generation_config=generation_config)
	generated_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	start_idx = generated_response.find('<\|im_start\|>assistant\n') + len('<\|im_start\|>assistant\n')
	generated_response = generated_response[start_idx:]
	end_idx = generated_response.find('<\|im_end\|>')
	generated_response = generated_response[:end_idx]
	return generated_response
	except:
	return ""

	# Example usage
	prompt = "What was the purpose of setting up the DGX RAID memory in version 2 of the pipeline?"
	response = generate_response(prompt)
	print(response)
	```

	## Training Procedure

	The model was fine-tuned using a dataset of question-answer pairs. The fine-tuning process involved:

	1. Loading the pre-trained TinyLlama-1.1B model.
	2. Quantizing the model to 4-bit precision to reduce memory usage and increase inference speed.
	3. Fine-tuning the model using the `SFTTrainer` with the specified training arguments.
	4. Evaluating the model at the end of each epoch and saving the best-performing model.

	## How to Cite

	If you use this model in your research or applications, please cite it as follows:

	```
	@misc{parseny-tinyllama-nvidia-qa,
	author = {Your Name},
	title = {TinyLlama1.1B-Nvidia-QA: NVIDIA documnetation helper},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/parseny/TinyLlama1.1B-Nvidia-QA},
	}
	```

	## Contact

	For any questions or issues, please open an issue on the Hugging Face model repository.