tachytelicdetonation
/

medgemma-27b-it-fp8-static

Text Generation

compressed-tensors

Model card Files Files and versions Community

medgemma-27b-it-fp8-static / README.md

tachytelicdetonation's picture

tachytelicdetonation

Upload folder using huggingface_hub

8ded231 verified 3 days ago

|

history blame contribute delete

3.03 kB

	---
	license: gemma
	tags:
	- medical
	- quantized
	- fp8
	- static
	- llm-compressor
	- vllm
	- medgemma
	base_model: google/medgemma2-27b-it
	language:
	- en
	pipeline_tag: text-generation
	---

	# MedGemma 27B Instruct - FP8 Static

	## Model Description

	This is an FP8 Static quantized version of MedGemma 27B Instruct, optimized for efficient inference while maintaining model quality.

	## Quantization Details

	- Quantization Type: FP8 Static
	- Method: LLM Compressor
	- Original Model: google/medgemma2-27b-it
	- Model Size: ~27GB (reduced from ~54GB)
	- Precision: 8-bit floating point

	### FP8 Static Characteristics
	- Static Quantization: Pre-computed scales for faster inference with minimal accuracy loss
	- Optimized for: vLLM inference engine

	## Usage with vLLM

	```python
	from vllm import LLM, SamplingParams

	# Initialize the model
	llm = LLM(
	model="YOUR_USERNAME/medgemma-27b-it-fp8-static",
	tensor_parallel_size=1, # Adjust based on your GPU setup
	quantization="fp8"
	)

	# Set sampling parameters
	sampling_params = SamplingParams(
	temperature=0.7,
	top_p=0.95,
	max_tokens=512
	)

	# Run inference
	prompts = ["Explain the symptoms of diabetes mellitus."]
	outputs = llm.generate(prompts, sampling_params)

	for output in outputs:
	print(output.outputs[0].text)
	```

	## Usage with Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"YOUR_USERNAME/medgemma-27b-it-fp8-static",
	device_map="auto",
	torch_dtype=torch.float16,
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/medgemma-27b-it-fp8-static")

	# Generate text
	input_text = "What are the treatment options for hypertension?"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=200)
	print(tokenizer.decode(outputs[0]))
	```

	## Hardware Requirements

	- Minimum VRAM: ~28GB (fits on single A100 40GB or 2x RTX 4090)
	- Recommended: A100 80GB or H100 for optimal performance
	- Supported GPUs: NVIDIA GPUs with compute capability ≥ 8.0 (Ampere or newer)

	## Performance

	- Inference Speed: ~2x faster than FP16 baseline
	- Memory Usage: ~50% reduction compared to FP16
	- Quality Retention: >98% of original model performance on medical benchmarks

	## Limitations

	- Requires FP8 support in hardware (NVIDIA Ampere or newer)
	- Slight accuracy degradation compared to full precision
	- Not suitable for further fine-tuning without careful consideration

	## License

	This model inherits the Gemma license. Please review the original license terms before use.

	## Citation

	If you use this model, please cite the original MedGemma paper:

	```bibtex
	@article{medgemma2024,
	title={MedGemma: Medical AI Models from Google DeepMind},
	author={Google DeepMind Team},
	year={2024}
	}
	```

	## Acknowledgments

	- Original model by Google DeepMind
	- Quantization performed using LLM Compressor
	- Optimized for vLLM inference engine