Why clamp qkv_states, is it common?

#44

by jay68 - opened Apr 8, 2024

Apr 8, 2024

In line 318 of modeling_dbrx.py, along with the "clip_qkv": 8 configuration, dbrx will clamp the value of qkv_states between -8 and 8.
Is such config only for inference or for both training and inference?
Why dbrx does this, is there some citation works?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment