--- license: apache-2.0 --- # Qwen2-7B-ReLU Qwen2-7B-ReLU is a variant of Qwen2-7B that replaces the SiLU/Swish activation function with dReLU, achieving higher sparsity while maintaining the performance of the original model. ## Key Features - Replaces SiLU/Swish activation function with dReLU - Maintains comparable or even better performance with the original Qwen2-7B - Significantly increases activation sparsity, enabling further optimization and compression ## Benchmarks The model has been evaluated on standard benchmarks to verify its performance: - **MMLU**: 69.19% (5-shot) - **IFEval**: 73.2% (Prompt Strict-Accuracy) - **Livebench**: - Average: 32.1% - Coding: 39.8% - Data Analysis: 45.3% - Instruction Following: 58.1% - Language: 9.0% - Math: 22.0% - Reasoning: 18.7% These results demonstrate that the ReLU modification maintains competitive performance while achieving higher sparsity compared to the original model. ## Technical Details The key modification in this version is the application of ReLU activation to both branches in the MLP block. The implementation modifies the original `Qwen2MLP` class as follows: ```python class Qwen2MLP(nn.Module): def __init__(self, config): super().__init__() self.config = config self.hidden_size = config.hidden_size self.intermediate_size = config.intermediate_size self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) self.act_fn = ACT2FN[config.hidden_act] def forward(self, x): down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x))) return down_proj ``` The key change is in the forward pass, where the activation function is now applied to both the gate projection and up projection outputs before multiplication. This modification, combined with the use of ReLU, contributes to the increased sparsity of the model. ## Intended Usage This release primarily targets the research community for: - Studying sparsity in large language models - Model compression and optimization research - Understanding the impact of activation functions on model behavior ## Model Limitations - The model may exhibit biases present in the training data - May generate incorrect, inappropriate, or harmful content - Performance may vary across different domains and tasks - Not suitable for production deployment without proper evaluation ## Quick Start You should replace original modeling_qwen FFN implementation code to dReLU firstly. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("PowerInfer/SparseQwen2-7B") tokenizer = AutoTokenizer.from_pretrained("PowerInfer/SparseQwen2-7B") prompt = "Hello" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs) response = tokenizer.decode(outputs[0]) ``` ## Citation If you use this model in your research, please cite: ```bibtex @article{song2024turbo, title={Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters}, author={Song, Yixin and Xie, Haotong and Zhang, Zhengyan and Wen, Bo and Ma, Li and Mi, Zeyu and Chen, Haibo}, journal={arXiv preprint arXiv:2406.05955}, year={2024} } ```