microsoft/bitnet-b1.58-2B-4T · More training details?

Apr 18

Hi,

Thanks so much for the model and very well-written & approachable technical report – It's great to see continuing work on BitNet!

I wonder if you could share any more details about the training, especially regarding cost / resource utilisation and how it compares to un-quantised training runs? Naively, I would expect meaningful efficiency gains at training time as well, but it would be great to get some concrete numbers.

Thanks in advance,
Ed

sd983527

Apr 24

I don't think there is any benefit in training efficiency since it still use full precision in training stage.

elepedus

Apr 24

•

edited Apr 24

I don't think there is any benefit in training efficiency since it still use full precision in training stage.

Really??

I thought the whole point of BitNet was that the ternary “quantisation” is applied during training, rather than afterwards. Otherwise why even bother training a new model in the first place?

The authors could have just applied post-training ternary quantisation to an existing model like Phi-4. It probably would have been quicker, cheaper and served to more clearly indicate that this is “just” post-training quantisation, providing a clearer, more direct baseline to compare against.