Spaces:
Running
i got question
so ur using open source fine-tuned and quantized llm's in your space. if am a user the first question that i will ask you is why models are takes so much time ~2m (except tinyllama) for single prompt. but i ain't user. in the first palce why am in ur space and typing this comment is am going to do samething that you did. like deploy fine-tuned quantized models (particularly mistral-7b). but if we run that even quantized models in free tier as default used cpu that ain't gonna predict next tokens so fast.
So is it possible to fine-tune quantized models using lora . lora way of fine-models are pretty fast. is lora can help with quantized models. cause quantized models are chuck that weight matrix into different parts. eg like in GPTQ quantization version each weights matrix will have scales, zeros, g_idx, qweight like that to make it compressed.