Ameena Qwen3-8B e3 Quantized GGUF

This is a quantized version of a fine-tuned Qwen3-8B model, optimized for efficient inference.

Model Details

Base Model: Qwen/Qwen3-8B
Quantization: Q4_K_M (4-bit with K-quant mixed precision)
Original Size: ~15.26 GB
Quantized Size: ~4.68 GB
Compression Ratio: 3.3x
Format: GGUF (GPT-Generated Unified Format)

Usage

With llama-cpp-python

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="Ameena_Qwen3-8B_e3.gguf",
    n_gpu_layers=-1,  # Use GPU acceleration
    n_ctx=4096,       # Context window
    verbose=False
)

# Generate text
response = llm(
    "Your prompt here",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)

With Hugging Face Transformers + llama.cpp

# Download the model
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="Tohirju/Ameena_Qwen3-8B_e3_Quantised_gguf",
    filename="Ameena_Qwen3-8B_e3.gguf"
)

Quantization Details

Method: Q4_K_M - Mixed precision 4-bit quantization
Quality: Excellent balance between model size and performance
Speed: Optimized for fast inference on both CPU and GPU
Memory: Significantly reduced VRAM requirements

Performance

Inference Speed: ~3.3x faster loading due to smaller file size
Memory Usage: ~69% reduction in memory requirements
Quality: Minimal quality loss compared to FP16 version

Hardware Requirements

CPU: Any modern CPU (optimized for x86_64)
GPU: CUDA-compatible GPU recommended (RTX 3060+ or better)
RAM: 8GB minimum, 16GB recommended
Storage: ~5GB for the model file

License

This model follows the Apache 2.0 license of the base Qwen3-8B model.

Tohirju
/

Ameena_Qwen3-8B_e3_Quantised_gguf

Ameena Qwen3-8B e3 Quantized GGUF

Model Details

Usage

With llama-cpp-python

With Hugging Face Transformers + llama.cpp

Quantization Details

Performance

Hardware Requirements

License

Model tree for Tohirju/Ameena_Qwen3-8B_e3_Quantised_gguf

Space using Tohirju/Ameena_Qwen3-8B_e3_Quantised_gguf 1