arxiv:2505.11820

Chain-of-Model Learning for Language Model

Published on May 17

· Submitted by

KaitaoSong on May 20

#1 Paper of the day

Upvote

120

Authors:

Kaitao Song ,

Xu Tan ,

Huiqiang Jiang ,

Chengruidong Zhang ,

Yongliang Shen ,

Cen LU ,

Zihao Li ,

Caihua Shan ,

Yansen Wang ,

Xiaoqing Zheng ,

Abstract

A novel Chain-of-Model framework introduces hierarchical hidden state chains in Transformers to improve scaling efficiency and inference flexibility for language models.

AI-generated summary

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.

View arXiv page View PDF Add to collection

Community

KaitaoSong

Paper author Paper submitter May 20

•

edited May 20

yjh415

May 20

an audio overview for learning on the go: https://youtu.be/YO0Cxeclywg

librarian-bot

May 21

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

fblgit

May 21

How to chain two different hidden sizes? Different models usually are different dimensional size.. chaining into the same I'm not entirely sure what we are looking at

KaitaoSong

Paper author May 21

You can choose different chains sizes, actually. Just as mentioned in Section 3.1, we introduce a hyper-parameter C = {c1, c2, ..., cn} to determine the size of each chain, so the size of each chain is calculate by ci / sum(C) * D. In our experiments, we choose the same chain size is want to slightly improve training efficiency due to resource limitations. Using the same chain allows us to use some group operations in some operators (e.g., normalization) to slightly improve training efficiency. Besides, you can determine different sizes by using Algorithm 1 (Naive PyTorch Implementation) in Appendix. But it will be slow due to more data access and all-reduce operations. Therefore, we design a block-size sparse kernel by using Triton. So, we expect each chain should be a multiple of block size, where block_size is a power of 2 (at least 64 or larger).

vkataev

May 21

Did you try to unroll W and use dense GEMM? Some space will be wasted, but may work faster.

KaitaoSong

Paper author May 21

•

edited May 21

Did you try to unroll W and use dense GEMM? Some space will be wasted, but may work faster.

Thanks for your suggestion. We have considered it, but the embarrassing situation is our GPU is only 40GB A100, and I have to apply many memort-efficient techniques. But our triton implementation is also slightly faster when compared to a standard MLP in Table 15 refer to Appendix. But I admit it could be better.