In line 318 of modeling_dbrx.py, along with the "clip_qkv": 8 configuration, dbrx will clamp the value of qkv_states between -8 and 8. Is such config only for inference or for both training and inference? Why dbrx does this, is there some citation works?