Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Abstract
A novel approach to scaling large language models through modular composition and layer-wise growth using fixed embeddings enhances performance and flexibility.
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
Community
How does an LLM understand the meaning of 'wRiTe' when its building blocks—the individual character tokens 'w', 'R', 'i'—have no semantic content? This simple question challenges the very foundation of modern AI.
Our paper argues that high-level meaning is not contained in embeddings, but is constructed by the Transformer architecture. We prove this by replacing standard trainable embeddings with a completely frozen layer derived from the raw visual structure of Unicode glyphs. These non-semantic vectors are fixed before training even begins.
The result is paradigm-shifting: our models not only converge but consistently outperform identical architectures on reasoning benchmarks. This reveals a core principle for development: Induction. Instead of forcing a model to guess all its knowledge at once, we give it simple, immutable rules (the visual form of characters) and let it build complexity from there.
It’s the difference between trying to freeze an entire lake instantly, versus letting a solid sheet of ice form layer by layer. It’s the power of a locomotive moving an entire train by first conquering the inertia of a single car.
This foundational discovery unlocks a powerful new methodology. In this paper we demonstrate the practical payoff: merging expert models like LEGOs and "growing" powerful AI systems incrementally.
This two-part work presents a blueprint for a more modular, efficient, and scalable future for AI.
What if we've been building LLMs all wrong? Instead of forging a monolithic giant in a single, resource-intensive fire, our work shows AI can be grown.
Building on our foundational paper (arXiv:2507.04886 https://huggingface.co/papers/2507.04886 ), we introduce "Constructive Learning." Our frozen, non-semantic embeddings act as a universal substrate, allowing us to:
Merge experts like LEGOs: Combine RU and ZH models post-training into a superior MoE.
Grow models layer-by-layer: We build deep knowledge incrementally, like a frozen lake's ice thickening from a thin crust to a solid core. This is how complex reasoning emerges.
Think of it like moving a massive train: you don't push the whole thing at once; you gain momentum by moving one car at a time. This paradigm isn't just about efficiency; it's about the future. When monolithic models exhaust the world's data and data centers, this method will still allow for growth.
Why continue down a path with a known ceiling and a massive carbon footprint? Let's start building AI constructively.
Models citing this paper 25
Browse 25 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper