How to reproduce int8 quantized model for QNN backend? #213

Maphist0 · 2024-12-14T01:52:52Z

Thank you for contributing such an amazing work! I'm impressed by the acceleration with Qualcomm NPU.

I guess the QNN backend in mllm only supports int8 models, right?

If so, it seems that the quantizer (code here) does not support int8 format. How to quantize into int8? More specifically, how to reproduce mllm's Qwen 1.5 1.8B int8 model listed in the readme?

Thanks!

The text was updated successfully, but these errors were encountered:

oreomaker · 2024-12-14T03:53:52Z

Thanks for your attention. To get the quantized int8 model for QNN prefilling, you need to get the profiled PyTorch model with quantization scale and outliers weight first, which is in tools/convertor/profiling_activation. Then you need to convert the weight to mllm type using src/quantizer/main.cpp.

The I8 quantization option, which is used by QNN models is not integrated into src/quantizer/main.cpp. This is a negligence, sorry for that.😓 We will add this as soon as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reproduce int8 quantized model for QNN backend? #213

How to reproduce int8 quantized model for QNN backend? #213

Maphist0 commented Dec 14, 2024

oreomaker commented Dec 14, 2024 •

edited

Loading

How to reproduce int8 quantized model for QNN backend? #213

How to reproduce int8 quantized model for QNN backend? #213

Comments

Maphist0 commented Dec 14, 2024

oreomaker commented Dec 14, 2024 • edited Loading

oreomaker commented Dec 14, 2024 •

edited

Loading