metadata
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
Model Card for sparsing-law-0.1b-relu
Introduction
The model is one of the key checkpoints used for most analyses in the paper Sparsing Law: Towards Large Language Models with Greater Activation Sparsity. It is ReLU-activated and contains approximately 0.1 billion non-embedding parameters.
The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler. Note that it is a base model derived from the last checkpoint of the stable pre-training stage, which has not undergone the decay or SFT stage.
Citation
Please kindly cite using the following BibTeX:
@article{luo2024sparsinglaw,
title={{Sparsing Law}: Towards Large Language Models with Greater Activation Sparsity},
author={Yuqi Luo and Chenyang Song and Xu Han and Yingfa Chen and Chaojun Xiao and Zhiyuan Liu and Maosong Sun},
year={2024},
journal={arXiv preprint arXiv:2411.02335},
url={https://arxiv.org/pdf/2411.02335.pdf}
}