Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce int8 quantized model for QNN backend? #213

Open
Maphist0 opened this issue Dec 14, 2024 · 1 comment
Open

How to reproduce int8 quantized model for QNN backend? #213

Maphist0 opened this issue Dec 14, 2024 · 1 comment

Comments

@Maphist0
Copy link

Thank you for contributing such an amazing work! I'm impressed by the acceleration with Qualcomm NPU.

I guess the QNN backend in mllm only supports int8 models, right?

If so, it seems that the quantizer (code here) does not support int8 format. How to quantize into int8? More specifically, how to reproduce mllm's Qwen 1.5 1.8B int8 model listed in the readme?

Thanks!

@oreomaker
Copy link
Collaborator

oreomaker commented Dec 14, 2024

Thanks for your attention. To get the quantized int8 model for QNN prefilling, you need to get the profiled PyTorch model with quantization scale and outliers weight first, which is in tools/convertor/profiling_activation. Then you need to convert the weight to mllm type using src/quantizer/main.cpp.

The I8 quantization option, which is used by QNN models is not integrated into src/quantizer/main.cpp. This is a negligence, sorry for that.😓 We will add this as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants