🩺 MediMaven Llama-3.1-8B – AWQ 4-bit (v1.1)

Drop-in 4-bit AWQ quantisation of the MediMaven fp16 weights – fits on a 16 GB GPU (e.g. T4).


πŸ’‘ Why use this repo?

Footprint β‰ˆ 5.9 GB on disk / VRAM
Throughput ~29 tok/s on a single T4 (batch = 1)
Accuracy loss < 0.3 ROUGE vs fp16

⚑ Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("dranreb1660/medimaven-llama3-8b-awq")
model = AutoModelForCausalLM.from_pretrained(
    "dranreb1660/medimaven-llama3-8b-awq",
    device_map="auto",
    torch_dtype="auto"   # bitsandbytes picks int4 automatically
)

πŸ”§ Quantisation details

  • AWQ group_size=128, zero_point=True, zero_sym=True.

  • Calibrated on 128 in-domain prompts (medical Q&A).

  • Exported with AutoAWQ v0.2.3.

πŸ“ Usage notes

  • The model inherits all limitations and licensing terms of the fp16 weights.

  • For maximum accuracy in secondary fine-tuning, use the fp16 repo instead.

⬆️ Versioning

  • v1.1 = first public release (merged weights, new tokenizer template).

πŸ“œ Citation

@misc{medimaven2025llama3,
  title        = {MediMaven Llama-3.1-8B},
  author       = {Kyei-Mensah, Bernard},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/dranreb1660/medimaven-llama3-8b-fp16}}
}
Downloads last month
4
Safetensors
Model size
1.98B params
Tensor type
I32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dranreb1660/medimaven-llama3-8b-awq

Finetuned
(1)
this model