More training details?
Hi,
Thanks so much for the model and very well-written & approachable technical report – It's great to see continuing work on BitNet!
I wonder if you could share any more details about the training, especially regarding cost / resource utilisation and how it compares to un-quantised training runs? Naively, I would expect meaningful efficiency gains at training time as well, but it would be great to get some concrete numbers.
Thanks in advance,
Ed
I don't think there is any benefit in training efficiency since it still use full precision in training stage.
I don't think there is any benefit in training efficiency since it still use full precision in training stage.
Really??
I thought the whole point of BitNet was that the ternary “quantisation” is applied during training, rather than afterwards. Otherwise why even bother training a new model in the first place?
The authors could have just applied post-training ternary quantisation to an existing model like Phi-4. It probably would have been quicker, cheaper and served to more clearly indicate that this is “just” post-training quantisation, providing a clearer, more direct baseline to compare against.