Abstract
A novel Chain-of-Model framework introduces hierarchical hidden state chains in Transformers to improve scaling efficiency and inference flexibility for language models.
In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.
Community
In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models (2025)
- Adaptive Layer-skipping in Pre-trained LLMs (2025)
- Parallel Scaling Law for Language Models (2025)
- S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning (2025)
- Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models (2025)
- LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models (2025)
- Efficient Construction of Model Family through Progressive Training Using Model Expansion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
How to chain two different hidden sizes? Different models usually are different dimensional size.. chaining into the same I'm not entirely sure what we are looking at
You can choose different chains sizes, actually. Just as mentioned in Section 3.1, we introduce a hyper-parameter C = {c1, c2, ..., cn} to determine the size of each chain, so the size of each chain is calculate by ci / sum(C) * D. In our experiments, we choose the same chain size is want to slightly improve training efficiency due to resource limitations. Using the same chain allows us to use some group operations in some operators (e.g., normalization) to slightly improve training efficiency. Besides, you can determine different sizes by using Algorithm 1 (Naive PyTorch Implementation) in Appendix. But it will be slow due to more data access and all-reduce operations. Therefore, we design a block-size sparse kernel by using Triton. So, we expect each chain should be a multiple of block size, where block_size is a power of 2 (at least 64 or larger).
Did you try to unroll W and use dense GEMM? Some space will be wasted, but may work faster.
Thanks for your suggestion. We have considered it, but the embarrassing situation is our GPU is only 40GB A100, and I have to apply many memort-efficient techniques. But our triton implementation is also slightly faster when compared to a standard MLP in Table 15 refer to Appendix. But I admit it could be better.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper