nvidia/Nemotron-H-47B-Base-8K · Is the cache working?

nvidia

I have the impression that the cache implementation is not working right now.

When I try to run the model, I see the following one-time warning:

NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.

Updating the logger to warning instead of warning_once shows that this message repeats for every generated token, so it is possible that no cache is ever used? (I think this is indeed the case, but I could be wrong here)

I think it would be good for accurately representing the speed capabilities of this model to make sure the cache is functioning correctly in the transformers implementation.