🩺 MediMaven Llama-3.1-8B – AWQ 4-bit (v1.1)

Drop-in 4-bit AWQ quantisation of the MediMaven fp16 weights – fits on a 16 GB GPU (e.g. T4).

💡 Why use this repo?


Footprint	≈ 5.9 GB on disk / VRAM
Throughput	~29 tok/s on a single T4 (batch = 1)
Accuracy loss	< 0.3 ROUGE vs fp16

⚡ Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("dranreb1660/medimaven-llama3-8b-awq")
model = AutoModelForCausalLM.from_pretrained(
    "dranreb1660/medimaven-llama3-8b-awq",
    device_map="auto",
    torch_dtype="auto"   # bitsandbytes picks int4 automatically
)

🔧 Quantisation details

AWQ group_size=128, zero_point=True, zero_sym=True.
Calibrated on 128 in-domain prompts (medical Q&A).
Exported with AutoAWQ v0.2.3.

📝 Usage notes

The model inherits all limitations and licensing terms of the fp16 weights.
For maximum accuracy in secondary fine-tuning, use the fp16 repo instead.

⬆️ Versioning

v1.1 = first public release (merged weights, new tokenizer template).

📜 Citation

@misc{medimaven2025llama3,
  title        = {MediMaven Llama-3.1-8B},
  author       = {Kyei-Mensah, Bernard},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/dranreb1660/medimaven-llama3-8b-fp16}}
}

dranreb1660
/

medimaven-llama3-8b-awq

🩺 MediMaven Llama-3.1-8B – AWQ 4-bit (v1.1)

💡 Why use this repo?

⚡ Quick start

🔧 Quantisation details

📝 Usage notes

⬆️ Versioning

📜 Citation

Model tree for dranreb1660/medimaven-llama3-8b-awq