konradhugging/qwen2-onnx-quantized

davesoma

Aug 31

Hello,
Could you share the genai_config.json?

konradhugging

Owner Aug 31

This comment has been hidden

konradhugging

Owner Aug 31

Hello,
Could you share the genai_config.json?

Instead, I recommend building your own system using the onnxruntime_genai with :

python -m onnxruntime_genai.models.builder -m Qwen/Qwen2-0.5B-Instruct -o .\gen_ai -p int4 -e web -c .\cache_dir

In fact, I created the onnx quantized version of Qwen 2 hosted here differently: I first converted it to ORT format, saved it as onnx and then dynamically quantized it. I think the result will be different using the above command. And you can also adapt it to your needs and it will come with the genai_config.json.

davesoma

Sep 1

•

edited Sep 1

Thanks, I'm working on a mobile app, unfortunately I'm stuck trying to use 800MB Qwen2 model in my app.
Android keeps throwing memory errors once I hit 573MB.
I've tried all the usual tricks - largeHeap, chunking, ONNX Runtime, you name it.
Qwen int4 still to large, any suggestion?

konradhugging

Owner Sep 2

•

edited Sep 2

Thanks, I'm working on a mobile app, unfortunately I'm stuck trying to use 800MB Qwen2 model in my app.
Android keeps throwing memory errors once I hit 573MB.
I've tried all the usual tricks - largeHeap, chunking, ONNX Runtime, you name it.
Qwen int4 still to large, any suggestion?

Okay I understand your problem, here are a few additional strategies I think you could explore:

Model Compression: Use tools like TensorFlow Lite's model compression techniques or ONNX's model optimizer to further compress the model. Techniques such as weight clustering and post-training quantization can sometimes shave off more memory usage. I also tried onnx optimization but it doesn't support Qwen2 model yet I think. I didn't tried tensorflow compression, it may help you.
Use a Lite Model: Consider using a smaller, pre-trained model more suited for mobile devices. Models like MobileBERT or TinyBERT are optimized for mobile environments and might provide a good balance between performance and size. I know they don't always support all languages, but you can also find a smaller model that suits your needs. I tried running qwen on the web and found it too slow, I plan to use a lighter model instead.

I don't really have strong experience running models on mobile, so I hope this helps. Tell me if you solve it in any way :)