ctranslate2-4you
/

Mistral-Small-Instruct-2409-ct2-AWQ

4-bit precision

Model card Files Files and versions

Mistral-Small-Instruct-2409-ct2-AWQ / README.md

ctranslate2-4you's picture

ctranslate2-4you

Update README.md

4ce6750 verified 8 months ago

|

history blame contribute delete

3.6 kB

	---
	base_model:
	- mistralai/Mistral-Small-Instruct-2409
	---

	# Mistral-Small-Instruct CTranslate2 Model

	This repository contains a CTranslate2 version of the [Mistral-Small-Instruct model](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409). The conversion process involved AWQ quantization followed by CTranslate2 format conversion.

	## Quantization Parameters

	The following AWQ parameters were used:
	```zero_point=true```
	```q_group_size=128```
	```w_bit=4```
	```version=gemv```

	## Quantization Process

	The quantization was performed using the [AutoAWQ library](https://casper-hansen.github.io/AutoAWQ/examples/). AutoAWQ supports two quantization approaches:

	1. Without calibration data:
	- Quick process (~few minutes)
	- Uses standard quantization schema
	- Suitable for general use cases

	2. With calibration data:
	- Longer process (3-4 hours on RTX 4090)
	- Preserves full precision for task-specific weights
	- Slightly better performance for targeted tasks

	## Calibration Details

	This model was quantized with calibration data. Specifically, the [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) dataset was used, which is good for overall QA and instruction-following.

	Key parameters:
	- `max_calib_seq_len`: 8192 (enables long-form responses)
	- `text_token_length`: 2048 (minimum input token length during quantization)

	While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.

	## Requirements

	```torch 2.2.2```
	```ctranslate2 4.4.0```
	- NOTE: The soon-to-be-released ```ctranslate2 4.5.0``` will support ```torch``` greater than version 2.2.2. These instructions will be updated when that occurs.

	## Sample Script

	```
	import os
	import sys
	import ctranslate2
	import gc
	import torch
	from transformers import AutoTokenizer

	system_message = "You are a helpful person who answers questions."
	user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"

	model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB


	def build_prompt_mistral_small():
	prompt = f"""<s>
	[INST] {system_message}

	{user_message}[/INST]"""

	return prompt


	def main():
	model_name = os.path.basename(model_dir)

	print(f"\033[32mLoading the model: {model_name}...\033[0m")

	intra_threads = max(os.cpu_count() - 4, 4)

	generator = ctranslate2.Generator(
	model_dir,
	device="cuda",
	# compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
	intra_threads=intra_threads
	)

	tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)

	prompt = build_prompt_mistral_small()

	tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

	print(f"\nRun 1 (Beam Size: {beam_size}):")

	results_batch = generator.generate_batch(
	[tokens],
	include_prompt_in_result=False,
	max_batch_size=4096,
	batch_type="tokens",
	beam_size=1,
	num_hypotheses=1,
	max_length=512,
	sampling_temperature=0.0,
	)

	output = tokenizer.decode(results_batch[0].sequences_ids[0])

	print("\nGenerated response:")
	print(output)

	del generator
	del tokenizer
	torch.cuda.empty_cache()
	gc.collect()


	if __name__ == "__main__":
	main()
	```