stereoplegic 's Collections
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
• 2310.16795
• Published
• 27
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable
Mixture-of-Expert Inference
Paper
• 2308.12066
• Published
• 4
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
Paper
• 2303.06182
• Published
• 1
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via
Dense-To-Sparse Gate
Paper
• 2112.14397
• Published
• 1
From Sparse to Soft Mixtures of Experts
Paper
• 2308.00951
• Published
• 22
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
• 2308.06093
• Published
• 2
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
• 2306.06446
• Published
• 1
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
• 2212.05055
• Published
• 6
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
Paper
• 2212.05191
• Published
• 1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
Paper
• 2306.04073
• Published
• 2
Multi-Head Adapter Routing for Cross-Task Generalization
Paper
• 2211.03831
• Published
• 2
Improving Visual Prompt Tuning for Self-supervised Vision Transformers
Paper
• 2306.05067
• Published
• 2
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
• 2302.06218
• Published
• 1
Alternating Gradient Descent and Mixture-of-Experts for Integrated
Multimodal Perception
Paper
• 2305.06324
• Published
• 1
Sparse Backpropagation for MoE Training
Paper
• 2310.00811
• Published
• 2
Zorro: the masked multimodal transformer
Paper
• 2301.09595
• Published
• 2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
Paper
• 2201.05596
• Published
• 2
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper
• 2310.10837
• Published
• 11
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
• 2303.13755
• Published
• 1
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Paper
• 2310.15961
• Published
• 1
LoRA ensembles for large language model fine-tuning
Paper
• 2310.00035
• Published
• 2
Build a Robust QA System with Transformer-based Mixture of Experts
Paper
• 2204.09598
• Published
• 1
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and
Compositional Experts
Paper
• 2305.14839
• Published
• 1
A Mixture-of-Expert Approach to RL-based Dialogue Management
Paper
• 2206.00059
• Published
• 1
Spatial Mixture-of-Experts
Paper
• 2211.13491
• Published
• 1
FastMoE: A Fast Mixture-of-Expert Training System
Paper
• 2103.13262
• Published
• 2
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training
and Inference System
Paper
• 2205.10034
• Published
• 1
Eliciting and Understanding Cross-Task Skills with Task-Level
Mixture-of-Experts
Paper
• 2205.12701
• Published
• 1
FEAMOE: Fair, Explainable and Adaptive Mixture of Experts
Paper
• 2210.04995
• Published
• 1
On the Adversarial Robustness of Mixture of Experts
Paper
• 2210.10253
• Published
• 1
HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization
Paper
• 2211.08253
• Published
• 1
Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
Paper
• 2101.03961
• Published
• 13
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training
Paper
• 2303.06318
• Published
• 1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
• 2306.04845
• Published
• 4
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation
Paper
• 2210.07535
• Published
• 1
Optimizing Mixture of Experts using Dynamic Recompilations
Paper
• 2205.01848
• Published
• 1
Towards Understanding Mixture of Experts in Deep Learning
Paper
• 2208.02813
• Published
• 1
Learning Factored Representations in a Deep Mixture of Experts
Paper
• 1312.4314
• Published
• 1
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
• 1701.06538
• Published
• 7
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Paper
• 2112.06905
• Published
• 2
Contextual Mixture of Experts: Integrating Knowledge into Predictive
Modeling
Paper
• 2211.00558
• Published
• 1
Taming Sparsely Activated Transformer with Stochastic Experts
Paper
• 2110.04260
• Published
• 2
Heterogeneous Multi-task Learning with Expert Diversity
Paper
• 2106.10595
• Published
• 2
SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code
Translation
Paper
• 2310.15539
• Published
• 1
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA
Composition
Paper
• 2307.13269
• Published
• 34
SkillNet-NLG: General-Purpose Natural Language Generation with a
Sparsely Activated Approach
Paper
• 2204.12184
• Published
• 1
SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural
Language Understanding
Paper
• 2203.03312
• Published
• 1
Residual Mixture of Experts
Paper
• 2204.09636
• Published
• 1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
Paper
• 2203.01104
• Published
• 2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit
from Emergent Modular Structures?
Paper
• 2310.10908
• Published
• 1
One Student Knows All Experts Know: From Sparse to Dense
Paper
• 2201.10890
• Published
• 1
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed
Training System
Paper
• 2203.14685
• Published
• 1
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts
Paper
• 2305.18691
• Published
• 1
An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
Paper
• 2306.17165
• Published
• 1
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
• 2211.15841
• Published
• 8
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
• 2105.03036
• Published
• 2
Language-Routing Mixture of Experts for Multilingual and Code-Switching
Speech Recognition
Paper
• 2307.05956
• Published
• 1
M6-T: Exploring Sparse Expert Models and Beyond
Paper
• 2105.15082
• Published
• 1
Cross-token Modeling with Conditional Computation
Paper
• 2109.02008
• Published
• 1
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability
Paper
• 2204.10598
• Published
• 2
Efficient Language Modeling with Sparse all-MLP
Paper
• 2203.06850
• Published
• 1
Efficient Large Scale Language Modeling with Mixtures of Experts
Paper
• 2112.10684
• Published
• 2
TAME: Task Agnostic Continual Learning using Multiple Experts
Paper
• 2210.03869
• Published
• 1
Learning an evolved mixture model for task-free continual learning
Paper
• 2207.05080
• Published
• 1
Model Spider: Learning to Rank Pre-Trained Models Efficiently
Paper
• 2306.03900
• Published
• 1
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
Paper
• 2206.00277
• Published
• 1
SiRA: Sparse Mixture of Low Rank Adaptation
Paper
• 2311.09179
• Published
• 9
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
Paper
• 2309.05444
• Published
• 1
MoEC: Mixture of Expert Clusters
Paper
• 2207.09094
• Published
• 1
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of
Experts And Frequency-augmented Decoder Approach
Paper
• 2310.12004
• Published
• 2
A General Theory for Softmax Gating Multinomial Logistic Mixture of
Experts
Paper
• 2310.14188
• Published
• 1
Extending Mixture of Experts Model to Investigate Heterogeneity of
Trajectories: When, Where and How to Add Which Covariates
Paper
• 2007.02432
• Published
• 1
Mixture of experts models for multilevel data: modelling framework and
approximation theory
Paper
• 2209.15207
• Published
• 1
ComPEFT: Compression for Communicating Parameter Efficient Updates via
Sparsification and Quantization
Paper
• 2311.13171
• Published
• 1
The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
Paper
• 2306.01705
• Published
• 1
Exponentially Faster Language Modelling
Paper
• 2311.10770
• Published
• 119
Scaling Expert Language Models with Unsupervised Domain Discovery
Paper
• 2303.14177
• Published
• 2
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Memory-efficient NLLB-200: Language-specific Expert Pruning of a
Massively Multilingual Machine Translation Model
Paper
• 2212.09811
• Published
• 1
Exploiting Transformer Activation Sparsity with Dynamic Inference
Paper
• 2310.04361
• Published
• 1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
Paper
• 2310.02410
• Published
• 3
Punica: Multi-Tenant LoRA Serving
Paper
• 2310.18547
• Published
• 2
Merging Experts into One: Improving Computational Efficiency of Mixture
of Experts
Paper
• 2310.09832
• Published
• 1
Adaptive Gating in Mixture-of-Experts based Language Models
Paper
• 2310.07188
• Published
• 2
Making Small Language Models Better Multi-task Learners with
Mixture-of-Task-Adapters
Paper
• 2309.11042
• Published
• 2
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
• 2312.07987
• Published
• 41
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of
Low-rank Experts
Paper
• 2312.00968
• Published
• 1
Memory Augmented Language Models through Mixture of Word Experts
Paper
• 2311.10768
• Published
• 19
Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models
Paper
• 2311.08692
• Published
• 13
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
• 2401.06066
• Published
• 59
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
• 2210.05144
• Published
• 2
Direct Neural Machine Translation with Task-level Mixture of Experts
models
Paper
• 2310.12236
• Published
• 3
Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting
Pre-trained Language Models
Paper
• 2310.16240
• Published
• 1
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
Paper
• 2401.02731
• Published
• 3
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
Paper
• 2209.08326
• Published
• 1
Mixture-of-experts VAEs can disregard variation in surjective multimodal
data
Paper
• 2204.05229
• Published
• 1
One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
Paper
• 2205.06126
• Published
• 1
Specialized Language Models with Cheap Inference from Limited Domain
Data
Paper
• 2402.01093
• Published
• 47
BlackMamba: Mixture of Experts for State-Space Models
Paper
• 2402.01771
• Published
• 25
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper
• 2402.01739
• Published
• 28
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
• 2312.17238
• Published
• 7
Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference
Paper
• 2401.08383
• Published
• 1
A Review of Sparse Expert Models in Deep Learning
Paper
• 2209.01667
• Published
• 3
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Paper
• 2308.10110
• Published
• 2
On the Representation Collapse of Sparse Mixture of Experts
Paper
• 2204.09179
• Published
• 1
StableMoE: Stable Routing Strategy for Mixture of Experts
Paper
• 2204.08396
• Published
• 1
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
• 2106.03760
• Published
• 4
CPM-2: Large-scale Cost-effective Pre-trained Language Models
Paper
• 2106.10715
• Published
• 1
Demystifying Softmax Gating Function in Gaussian Mixture of Experts
Paper
• 2305.03288
• Published
• 1
Statistical Perspective of Top-K Sparse Softmax Gating Mixture of
Experts
Paper
• 2309.13850
• Published
• 1
Sparse Mixture-of-Experts are Domain Generalizable Learners
Paper
• 2206.04046
• Published
• 1
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
• 2206.02770
• Published
• 4
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
Routing Policy
Paper
• 2310.01334
• Published
• 3
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
• 2202.08906
• Published
• 3
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
Paper
• 2309.04354
• Published
• 16
A non-asymptotic approach for model selection via penalization in
high-dimensional mixture of experts models
Paper
• 2104.02640
• Published
• 1
Non-asymptotic oracle inequalities for the Lasso in high-dimensional
mixture of experts
Paper
• 2009.10622
• Published
• 1
Fast Feedforward Networks
Paper
• 2308.14711
• Published
• 3
Mixture-of-Experts with Expert Choice Routing
Paper
• 2202.09368
• Published
• 4
Go Wider Instead of Deeper
Paper
• 2107.11817
• Published
• 1
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
Paper
• 2205.12399
• Published
• 1
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Paper
• 2205.12410
• Published
• 1
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Paper
• 2402.08609
• Published
• 36
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
• 2402.07033
• Published
• 19
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Scattered Mixture-of-Experts Implementation
Paper
• 2403.08245
• Published
• 1
Sparse Universal Transformer
Paper
• 2310.07096
• Published
Multi-Head Mixture-of-Experts
Paper
• 2404.15045
• Published
• 60
LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World
Knowledge in Language Model Alignment
Paper
• 2312.09979
• Published
• 2
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
• 2404.07413
• Published
• 38
Learning to Route Among Specialized Experts for Zero-Shot Generalization
Paper
• 2402.05859
• Published
• 5
MoELoRA: Contrastive Learning Guided Mixture of Experts on
Parameter-Efficient Fine-Tuning for Large Language Models
Paper
• 2402.12851
• Published
• 2
MoEUT: Mixture-of-Experts Universal Transformers
Paper
• 2405.16039
• Published
• 3
Yuan 2.0-M32: Mixture of Experts with Attention Router
Paper
• 2405.17976
• Published
• 21
Enhancing Fast Feed Forward Networks with Load Balancing and a Master
Leaf Node
Paper
• 2405.16836
• Published
Mixture of A Million Experts
Paper
• 2407.04153
• Published
• 5