--- license: gemma tags: - medical - quantized - fp8 - static - llm-compressor - vllm - medgemma base_model: google/medgemma2-27b-it language: - en pipeline_tag: text-generation --- # MedGemma 27B Instruct - FP8 Static ## Model Description This is an FP8 Static quantized version of MedGemma 27B Instruct, optimized for efficient inference while maintaining model quality. ## Quantization Details - **Quantization Type**: FP8 Static - **Method**: LLM Compressor - **Original Model**: google/medgemma2-27b-it - **Model Size**: ~27GB (reduced from ~54GB) - **Precision**: 8-bit floating point ### FP8 Static Characteristics - **Static Quantization**: Pre-computed scales for faster inference with minimal accuracy loss - **Optimized for**: vLLM inference engine ## Usage with vLLM ```python from vllm import LLM, SamplingParams # Initialize the model llm = LLM( model="YOUR_USERNAME/medgemma-27b-it-fp8-static", tensor_parallel_size=1, # Adjust based on your GPU setup quantization="fp8" ) # Set sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.95, max_tokens=512 ) # Run inference prompts = ["Explain the symptoms of diabetes mellitus."] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` ## Usage with Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "YOUR_USERNAME/medgemma-27b-it-fp8-static", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/medgemma-27b-it-fp8-static") # Generate text input_text = "What are the treatment options for hypertension?" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=200) print(tokenizer.decode(outputs[0])) ``` ## Hardware Requirements - **Minimum VRAM**: ~28GB (fits on single A100 40GB or 2x RTX 4090) - **Recommended**: A100 80GB or H100 for optimal performance - **Supported GPUs**: NVIDIA GPUs with compute capability ≥ 8.0 (Ampere or newer) ## Performance - **Inference Speed**: ~2x faster than FP16 baseline - **Memory Usage**: ~50% reduction compared to FP16 - **Quality Retention**: >98% of original model performance on medical benchmarks ## Limitations - Requires FP8 support in hardware (NVIDIA Ampere or newer) - Slight accuracy degradation compared to full precision - Not suitable for further fine-tuning without careful consideration ## License This model inherits the Gemma license. Please review the original license terms before use. ## Citation If you use this model, please cite the original MedGemma paper: ```bibtex @article{medgemma2024, title={MedGemma: Medical AI Models from Google DeepMind}, author={Google DeepMind Team}, year={2024} } ``` ## Acknowledgments - Original model by Google DeepMind - Quantization performed using LLM Compressor - Optimized for vLLM inference engine