Hymba: A Hybrid-head Architecture for Small Language Models
Abstract
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.
Community
Hymba - an efficient small language model with hybrid architecture. We release 1.5B model, feel free to ask questions here.
GitHub: https://github.com/NVlabs/hymba
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions (2024)
- Taipan: Efficient and Expressive State Space Language Models with Selective Attention (2024)
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads (2024)
- A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts (2024)
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection (2024)
- Falcon Mamba: The First Competitive Attention-free 7B Language Model (2024)
- In-context KV-Cache Eviction for LLMs via Attention-Gate (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper