--- library_name: transformers tags: [quantization, qwen3, qlora, causal-lm, low-rank-adapters, 4bit, bitsandbytes, peft, efficient-finetuning] --- # Qwen3-0.6B Quantized with QLoRA for Reasoning Tasks This is a 4-bit quantized version of `Qwen/Qwen3-0.6B-Base`, fine-tuned using LoRA adapters on multiple MCQA-style reasoning datasets. The model was optimized using QLoRA, a parameter-efficient tuning method with minimal memory footprint and minimal accuracy loss. ## Model Details ### Model Description This model is: - A quantized version of `Qwen/Qwen3-0.6B-Base` using `bitsandbytes` 4-bit NormalFloat (nf4) - Fine-tuned using Low-Rank Adaptation (LoRA) with rank 8 - Adapted to multiple-choice reasoning datasets like AQuA-RAT and TheoremQA - Fully compatible with Hugging Face Transformers - **Developed by:** Ahmed Abdelmalek (EPFL CS-552 Project) - **Model type:** Causal Language Model - **Language(s):** English - **License:** Apache 2.0 - **Fine-tuned from model:** `Qwen/Qwen3-0.6B-Base` ### Model Sources - [Repository](https://huggingface.co/Qwen/Qwen3-0.6B-Base) ## Uses ### Direct Use You can directly use this model for MCQA-style question-answering tasks using generation. ### Out-of-Scope Use - Not intended for open-ended generation or safety-critical applications - Not intended for real-time or commercial deployment without evaluation ## Bias, Risks, and Limitations - Inherits biases from its base model and training data (e.g., reasoning datasets) - May fail on adversarial or out-of-distribution logic tasks ### Recommendations Evaluate the model against your specific reasoning task before production use. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "your-username/MNLP_M2_quantized_model" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) prompt = "Question: What is 3 + 5? Options: A) 6 B) 8 C) 9 D) 10 Answer:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data - Processed versions of AQuA-RAT, TheoremQA, and custom MCQA datasets - Unified into a single format with rationale-enhanced prompts ### Training Procedure - **Precision:** fp16 - **Quantization:** 4-bit nf4 + double quant + float16 compute - **Adapter Type:** LoRA (r=8, α=16, dropout=0.05) - **Base model frozen** #### Training Hyperparameters - **Epochs:** 3 - **Batch size:** 4 - **Grad accum steps:** 2 - **Optimizer:** paged_adamw_8bit ## Evaluation ### Testing Data Validation set with 1000 samples held out from the unified dataset. ### Metrics - Accuracy / F1 (to be reported in evaluation phase) ## Environmental Impact - **Hardware:** Google Colab Pro, GPU A100 - **Hours used:** ~6–7 hours - **Carbon Emitted:** Estimated with [MLCO2](https://mlco2.github.io/impact#compute) ## Technical Specifications ### Architecture - Qwen3-0.6B base - 28-layer transformer with rotary positional encoding and 16 heads ### Compute Infrastructure - **Hardware:** Colab A100 GPU, High RAM - **Software:** Python 3.10, PyTorch 2.2.2, Transformers 4.51.3 ## Contact - **Author:** Ahmed Abdelmalek - **Email:** ahmed.abdelmalek@epfl.ch