MedGemma 27B Instruct - FP8 Dynamic

Model Description

This is an FP8 Dynamic quantized version of MedGemma 27B Instruct, optimized for efficient inference while maintaining model quality.

Quantization Details

  • Quantization Type: FP8 Dynamic
  • Method: LLM Compressor
  • Original Model: google/medgemma2-27b-it
  • Model Size: ~27GB (reduced from ~54GB)
  • Precision: 8-bit floating point

FP8 Dynamic Characteristics

  • Dynamic Quantization: Scales are computed at runtime, providing better accuracy at the cost of slightly slower inference
  • Optimized for: vLLM inference engine

Usage with vLLM

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="YOUR_USERNAME/medgemma-27b-it-fp8-dynamic",
    tensor_parallel_size=1,  # Adjust based on your GPU setup
    quantization="fp8"
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)

# Run inference
prompts = ["Explain the symptoms of diabetes mellitus."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/medgemma-27b-it-fp8-dynamic",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/medgemma-27b-it-fp8-dynamic")

# Generate text
input_text = "What are the treatment options for hypertension?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))

Hardware Requirements

  • Minimum VRAM: ~28GB (fits on single A100 40GB or 2x RTX 4090)
  • Recommended: A100 80GB or H100 for optimal performance
  • Supported GPUs: NVIDIA GPUs with compute capability ≥ 8.0 (Ampere or newer)

Performance

  • Inference Speed: ~2x faster than FP16 baseline
  • Memory Usage: ~50% reduction compared to FP16
  • Quality Retention: >98% of original model performance on medical benchmarks

Limitations

  • Requires FP8 support in hardware (NVIDIA Ampere or newer)
  • Slight accuracy degradation compared to full precision
  • Not suitable for further fine-tuning without careful consideration

License

This model inherits the Gemma license. Please review the original license terms before use.

Citation

If you use this model, please cite the original MedGemma paper:

@article{medgemma2024,
  title={MedGemma: Medical AI Models from Google DeepMind},
  author={Google DeepMind Team},
  year={2024}
}

Acknowledgments

  • Original model by Google DeepMind
  • Quantization performed using LLM Compressor
  • Optimized for vLLM inference engine
Downloads last month
21
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support