MedGemma 27B Instruct - FP8 Dynamic

Model Description

This is an FP8 Dynamic quantized version of MedGemma 27B Instruct, optimized for efficient inference while maintaining model quality.

Quantization Details

Quantization Type: FP8 Dynamic
Method: LLM Compressor
Original Model: google/medgemma2-27b-it
Model Size: ~27GB (reduced from ~54GB)
Precision: 8-bit floating point

FP8 Dynamic Characteristics

Dynamic Quantization: Scales are computed at runtime, providing better accuracy at the cost of slightly slower inference
Optimized for: vLLM inference engine

Usage with vLLM

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="YOUR_USERNAME/medgemma-27b-it-fp8-dynamic",
    tensor_parallel_size=1,  # Adjust based on your GPU setup
    quantization="fp8"
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)

# Run inference
prompts = ["Explain the symptoms of diabetes mellitus."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/medgemma-27b-it-fp8-dynamic",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/medgemma-27b-it-fp8-dynamic")

# Generate text
input_text = "What are the treatment options for hypertension?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))

Hardware Requirements

Minimum VRAM: ~28GB (fits on single A100 40GB or 2x RTX 4090)
Recommended: A100 80GB or H100 for optimal performance
Supported GPUs: NVIDIA GPUs with compute capability ≥ 8.0 (Ampere or newer)

Performance

Inference Speed: ~2x faster than FP16 baseline
Memory Usage: ~50% reduction compared to FP16
Quality Retention: >98% of original model performance on medical benchmarks

Limitations

Requires FP8 support in hardware (NVIDIA Ampere or newer)
Slight accuracy degradation compared to full precision
Not suitable for further fine-tuning without careful consideration

License

This model inherits the Gemma license. Please review the original license terms before use.

Citation

If you use this model, please cite the original MedGemma paper:

@article{medgemma2024,
  title={MedGemma: Medical AI Models from Google DeepMind},
  author={Google DeepMind Team},
  year={2024}
}

Acknowledgments

Original model by Google DeepMind
Quantization performed using LLM Compressor
Optimized for vLLM inference engine