demerzel-iv's picture
Update README.md
a728903 verified
|
raw
history blame
556 Bytes
metadata
license: mit
language:
  - en
  - zh

Model Card for sparsing-law-0.1b-relu

  • Paper [optional]: paper
  • Repository and demo code: github

This model is ReLU-activated and contains approximately 0.1 billion non-embedding parameters.

The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler. It represents the final checkpoint of the stable stage in WSD, meaning it has not undergone the decay stage.