MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
Abstract
A novel Mixture-of-Basis-Experts (MoBE) method is introduced to compress large language models with minimal accuracy loss.
The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).
Community
While Mixture-of-Experts (MoE) models are state-of-the-art, their massive size makes them very difficult to deploy due to high memory costs. Current methods to compress these models cause a large drop in performance (7-14%). This paper introduces a new, highly effective method called Mixture-of-Basis-Experts (MoBE).
The core idea is to change how the "experts" are built. Instead of each expert being completely independent, MoBE re-engineers them to be a combination of a small, unique component and a larger component built from a set of "basis" matrices that are shared across all experts. This efficient parameter-sharing strategy allows MoBE to reduce the size of massive models (from hundreds of billions to over a trillion parameters) by 24-30% while causing a negligible performance drop of only 1-2%, a significant improvement over previous techniques.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis (2025)
- EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models (2025)
- Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging (2025)
- Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts (2025)
- Unveiling Super Experts in Mixture-of-Experts Large Language Models (2025)
- MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE (2025)
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper