--- tags: - FP8 - vllm - audio license: apache-2.0 license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md language: - en base_model: openai/whisper-tiny library_name: transformers --- # whisper-tiny-FP8-Dynamic ## Model Overview - **Model Architecture:** whisper-tiny - **Input:** Audio-Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP8 - **Activation quantization:** FP8 - **Release Date:** 04/16/2025 - **Version:** 1.0 - **Model Developers:** Neural Magic Quantized version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny). ### Model Optimizations This model was obtained by quantizing the weights of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) to FP8 data type, ready for inference with vLLM >= 0.5.2. ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm.assets.audio import AudioAsset from vllm import LLM, SamplingParams # prepare model llm = LLM( model="neuralmagic/whisper-tiny-FP8-Dynamic", max_model_len=448, max_num_seqs=400, limit_mm_per_prompt={"audio": 1}, ) # prepare inputs inputs = { # Test explicit encoder/decoder prompt "encoder_prompt": { "prompt": "", "multi_modal_data": { "audio": AudioAsset("winning_call").audio_and_sample_rate, }, }, "decoder_prompt": "<|startoftranscript|>", } # generate response print("========== SAMPLE GENERATION ==============") outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64)) print(f"PROMPT : {outputs[0].prompt}") print(f"RESPONSE: {outputs[0].outputs[0].text}") print("==========================================") ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
Model Creation Code ```bash python quantize.py \ --model_path openai/whisper-tiny \ --quant_path output_dir/whisper-tiny-FP8-Dynamic ``` ```python import argparse import torch import os from datasets import load_dataset from transformers import WhisperProcessor from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration from compressed_tensors.quantization import QuantizationType # --- Args --- parser = argparse.ArgumentParser() parser.add_argument('--model_path', type=str, required=True) parser.add_argument('--quant_path', type=str, required=True) parser.add_argument('--observer', type=str, default="minmax") args = parser.parse_args() # --- Load Model --- model = TraceableWhisperForConditionalGeneration.from_pretrained( args.model_path, device_map="auto", torch_dtype="auto", ) model.config.forced_decoder_ids = None processor = WhisperProcessor.from_pretrained(args.model_path) # --- Recipe (FP8 Dynamic) --- recipe = [ QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"], ignore=["re:.*lm_head"], ) ] # --- Run oneshot --- oneshot( model=model, recipe=recipe, trust_remote_code_model=True, ) # --- Save --- os.makedirs(args.quant_path, exist_ok=True) model.save_pretrained(args.quant_path, save_compressed=True) processor.save_pretrained(args.quant_path) ```
## Evaluation The model was evaluated on [LibriSpeech](https://huggingface.co/datasets/lmms-lab/librispeech) and [Fleurs](https://huggingface.co/datasets/lmms-lab/fleurs) datasets using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), via the following commands:
Evaluation Commands Librispeech: ``` lmms-eval \ --model=whisper_vllm \ --model_args="pretrained=neuralmagic-ent/whisper-tiny-FP8-Dynamic" \ --batch_size 64 \ --output_path \ --tasks librispeech ``` Fleurs: ``` lmms-eval \ --model=whisper_vllm \ --model_args="pretrained=neuralmagic-ent/whisper-tiny-FP8-Dynamic" \ --batch_size 64 \ --output_path \ --tasks fleurs ```
Benchmark Split BF16 w8a8 Recovery (%)
LibriSpeech (WER) test-clean 7.6602 7.8941 96.53%
test-other 17.1041 17.1325 98.74%
Fleurs (X→en, WER) cmn_hans_cn 43.8226 45.0539 97.27%
en 13.6638 15.2980 89.32%
yue_hant_hk 60.1848 67.5437 89.10%