metadata

license: apache-2.0
language:
  - en
  - zh
pipeline_tag: text-generation

Model Card for sparsing-law-0.1b-relu

Paper: paper
Repository containing relevant codes: github

Introduction

The model is one of the key checkpoints used for most analyses in the paper Sparsing Law: Towards Large Language Models with Greater Activation Sparsity. It is ReLU-activated and contains approximately 0.1 billion non-embedding parameters.

The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler. Note that it is a base model derived from the last checkpoint of the stable pre-training stage, which has not undergone the decay or SFT stage.

Citation

Please kindly cite using the following BibTeX:

@article{luo2024sparsinglaw,
  title={{Sparsing Law}: Towards Large Language Models with Greater Activation Sparsity},
  author={Yuqi Luo and Chenyang Song and Xu Han and Yingfa Chen and Chaojun Xiao and Zhiyuan Liu and Maosong Sun},
  year={2024},
  journal={arXiv preprint arXiv:2411.02335},
  url={https://arxiv.org/pdf/2411.02335.pdf}
}