# TorchScale - A Library for Transformers at (Any) Scale
TorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively. It has the implementation of fundamental research to improve modeling generality and capability, as well as training stability and efficiency of scaling Transformers. - Stability - [**DeepNet**](https://arxiv.org/abs/2203.00555): scaling Transformers to 1,000 Layers and beyond - Generality - [**Foundation Transformers (Magneto)**](https://arxiv.org/abs/2210.06423): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal) - Efficiency - [**X-MoE**](https://arxiv.org/abs/2204.09179): scalable & finetunable sparse Mixture-of-Experts (MoE) ## News - November, 2022: TorchScale 0.1.1 released [[Paper](https://arxiv.org/abs/2211.13184)] [[PyPI](https://pypi.org/project/torchscale/)] ## Installation To install: ``` pip install torchscale ``` Alternatively, you can develop it locally: ``` git clone https://github.com/microsoft/torchscale.git cd torchscale pip install -e . ``` ## Getting Started It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder: ```python >>> from torchscale.architecture.config import EncoderConfig >>> from torchscale.architecture.encoder import Encoder >>> config = EncoderConfig(vocab_size=64000) >>> model = Encoder(config) >>> print(model) ``` We also support the `Decoder` architecture and the `EncoderDecoder` architecture: ```python # Creating a decoder model >>> from torchscale.architecture.config import DecoderConfig >>> from torchscale.architecture.decoder import Decoder >>> config = DecoderConfig(vocab_size=64000) >>> decoder = Decoder(config) >>> print(decoder) # Creating a encoder-decoder model >>> from torchscale.architecture.config import EncoderDecoderConfig >>> from torchscale.architecture.encoder_decoder import EncoderDecoder >>> config = EncoderDecoderConfig(vocab_size=64000) >>> encdec = EncoderDecoder(config) >>> print(encdec) ``` ## Key Features - [DeepNorm to improve the training stability of Post-LayerNorm Transformers](https://arxiv.org/abs/2203.00555) * enabled by setting *deepnorm=True* in the `Config` class. * It adjusts both the residual connection and the initialization method according to the model architecture (i.e., encoder, decoder, or encoder-decoder). - [SubLN for the model generality and the training stability](https://arxiv.org/abs/2210.06423) * enabled by *subln=True*. This is enabled by default. * It introduces another LayerNorm to each sublayer and adjusts the initialization according to the model architecture. * Note that SubLN and DeepNorm cannot be used in one single model. - [X-MoE: efficient and finetunable sparse MoE modeling](https://arxiv.org/abs/2204.09179) * enabled by *use_xmoe=True*. * It replaces every *'moe_freq'* `FeedForwardNetwork` layers with the X-MoE layers. - [Multiway architecture for multimodality](https://arxiv.org/abs/2208.10442) * enabled by *multiway=True*. * It provides a pool of Transformer's parameters used for different modalities. - [Relative position bias](https://arxiv.org/abs/1910.10683) * enabled by adjusting *rel_pos_buckets* and *max_rel_pos*. - [SparseClip: improving the gradient clipping for sparse MoE models](https://arxiv.org/abs/2211.13184) * we provide a [sample code](examples/fairseq/utils/sparse_clip.py) that can be easily adapted to the FairSeq (or other) repo. Most of the features above can be used by simply passing the corresponding parameters to the config. For example: ```python >>> from torchscale.architecture.config import EncoderConfig >>> from torchscale.architecture.encoder import Encoder >>> config = EncoderConfig(vocab_size=64000, deepnorm=True, multiway=True) >>> model = Encoder(config) >>> print(model) ``` ## Examples We have the examples of how to use TorchScale in the following scenarios/tasks: - Language * [Decoder/GPT](examples/fairseq/README.md#example-gpt-pretraining) * [Encoder-Decoder/Neural Machine Translation](examples/fairseq/README.md#example-machine-translation) * [Encoder/BERT](examples/fairseq/README.md#example-bert-pretraining) - Vision * ViT/BEiT [In progress] - Speech - Multimodal * [Multiway Transformers/BEiT-3](torchscale/model/BEiT3.py) [In progress] We plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)). Any comments or PRs are welcome! ## Results ### Stability Evaluation