Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture Paper β’ 2412.11834 β’ Published Dec 16, 2024 β’ 7
view post Post 2075 Only a single RTX 4090 running model pre-training is really slow, even for small language models!!! (https://huggingface.co/collections/JingzeShi/doge-slm-677fd879f8c4fd0f43e05458) See translation 2 replies Β· π 8 8 π€― 6 6 π 4 4 + Reply
view post Post 1667 π€©warmup -> stable -> decay leanring rate scheduler: πuse the Stable Phase CheckPoints to Continue Training the model on Any New Dataset without spikes of the training!!! SmallDoge/Doge-20M-checkpoint SmallDoge/Doge-60M-checkpoint See translation 4 replies Β· π₯ 7 7 π 1 1 π 1 1 π€ 1 1 + Reply