MedGemma 27B Instruct - FP8 Static
Model Description
This is an FP8 Static quantized version of MedGemma 27B Instruct, optimized for efficient inference while maintaining model quality.
Quantization Details
- Quantization Type: FP8 Static
- Method: LLM Compressor
- Original Model: google/medgemma2-27b-it
- Model Size: ~27GB (reduced from ~54GB)
- Precision: 8-bit floating point
FP8 Static Characteristics
- Static Quantization: Pre-computed scales for faster inference with minimal accuracy loss
- Optimized for: vLLM inference engine
Usage with vLLM
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model="YOUR_USERNAME/medgemma-27b-it-fp8-static",
tensor_parallel_size=1, # Adjust based on your GPU setup
quantization="fp8"
)
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)
# Run inference
prompts = ["Explain the symptoms of diabetes mellitus."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"YOUR_USERNAME/medgemma-27b-it-fp8-static",
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/medgemma-27b-it-fp8-static")
# Generate text
input_text = "What are the treatment options for hypertension?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
Hardware Requirements
- Minimum VRAM: ~28GB (fits on single A100 40GB or 2x RTX 4090)
- Recommended: A100 80GB or H100 for optimal performance
- Supported GPUs: NVIDIA GPUs with compute capability ≥ 8.0 (Ampere or newer)
Performance
- Inference Speed: ~2x faster than FP16 baseline
- Memory Usage: ~50% reduction compared to FP16
- Quality Retention: >98% of original model performance on medical benchmarks
Limitations
- Requires FP8 support in hardware (NVIDIA Ampere or newer)
- Slight accuracy degradation compared to full precision
- Not suitable for further fine-tuning without careful consideration
License
This model inherits the Gemma license. Please review the original license terms before use.
Citation
If you use this model, please cite the original MedGemma paper:
@article{medgemma2024,
title={MedGemma: Medical AI Models from Google DeepMind},
author={Google DeepMind Team},
year={2024}
}
Acknowledgments
- Original model by Google DeepMind
- Quantization performed using LLM Compressor
- Optimized for vLLM inference engine
- Downloads last month
- 398
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support