Is the cache working?
#2
by
FremyCompany
- opened
I have the impression that the cache implementation is not working right now.
When I try to run the model, I see the following one-time warning:
NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.
Updating the logger to warning
instead of warning_once
shows that this message repeats for every generated token, so it is possible that no cache is ever used? (I think this is indeed the case, but I could be wrong here)
I think it would be good for accurately representing the speed capabilities of this model to make sure the cache is functioning correctly in the transformers
implementation.
See #3 for an early investigation.
If my test is right, enabling the cache should yield a ~6x perf increase for batch_size 1 in Q4 BF16 on an RTX 5090.