-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pytorch engine kv int4/int8 quantization #2438
Conversation
Conflicts: lmdeploy/pytorch/config.py lmdeploy/pytorch/engine/cache_engine.py lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py lmdeploy/pytorch/kernels/cuda/pagedattention.py lmdeploy/pytorch/models/internlm2.py lmdeploy/pytorch/models/llama.py
Benchmark
Tested gsm8k accuracy:
|
May resolve the conflicts |
Conflicts: lmdeploy/pytorch/config.py lmdeploy/pytorch/kernels/cuda/pagedattention.py
Since kv int4 requires triton>=2.3.0, It would be cool if we add a check in engine. https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/pytorch/check_env/__init__.py |
|
Yes, but it is the same name as |
@AllentDan Can you update support models? I will add testcase according to this. https://github.com/InternLM/lmdeploy/blob/main/docs/en/supported_models/supported_models.md |
I did not test all the models since some models may fail when |
All models supported by pytorch backend and 4bits are tested. Find following errors.
|
@AllentDan qwen2-vl-2b and 7b is passed on kvint8. |
Only update internlm and llama model. After #2104, all the models should be updated.