Text Generation
Transformers
Safetensors
PyTorch
nvidia
nemotron-h

Is the cache working?

#2
by FremyCompany - opened

I have the impression that the cache implementation is not working right now.

When I try to run the model, I see the following one-time warning:

NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.

Updating the logger to warning instead of warning_once shows that this message repeats for every generated token, so it is possible that no cache is ever used? (I think this is indeed the case, but I could be wrong here)

I think it would be good for accurately representing the speed capabilities of this model to make sure the cache is functioning correctly in the transformers implementation.

See #3 for an early investigation.
If my test is right, enabling the cache should yield a ~6x perf increase for batch_size 1 in Q4 BF16 on an RTX 5090.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment