Ameena Qwen3-8B e3 Quantized GGUF
This is a quantized version of a fine-tuned Qwen3-8B model, optimized for efficient inference.
Model Details
- Base Model: Qwen/Qwen3-8B
- Quantization: Q4_K_M (4-bit with K-quant mixed precision)
- Original Size: ~15.26 GB
- Quantized Size: ~4.68 GB
- Compression Ratio: 3.3x
- Format: GGUF (GPT-Generated Unified Format)
Usage
With llama-cpp-python
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="Ameena_Qwen3-8B_e3.gguf",
n_gpu_layers=-1, # Use GPU acceleration
n_ctx=4096, # Context window
verbose=False
)
# Generate text
response = llm(
"Your prompt here",
max_tokens=512,
temperature=0.7,
top_p=0.9
)
With Hugging Face Transformers + llama.cpp
# Download the model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="Tohirju/Ameena_Qwen3-8B_e3_Quantised_gguf",
filename="Ameena_Qwen3-8B_e3.gguf"
)
Quantization Details
- Method: Q4_K_M - Mixed precision 4-bit quantization
- Quality: Excellent balance between model size and performance
- Speed: Optimized for fast inference on both CPU and GPU
- Memory: Significantly reduced VRAM requirements
Performance
- Inference Speed: ~3.3x faster loading due to smaller file size
- Memory Usage: ~69% reduction in memory requirements
- Quality: Minimal quality loss compared to FP16 version
Hardware Requirements
- CPU: Any modern CPU (optimized for x86_64)
- GPU: CUDA-compatible GPU recommended (RTX 3060+ or better)
- RAM: 8GB minimum, 16GB recommended
- Storage: ~5GB for the model file
License
This model follows the Apache 2.0 license of the base Qwen3-8B model.
- Downloads last month
- 47
Hardware compatibility
Log In
to view the estimation
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support