-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor torch inference engine #871
Conversation
* 4D input, model.eval and llama config * use auto dtype
current best redist w/o dtensor host mem in que less rewrite less code update model weight
* add baichuan WIP * support baichuan * support baichuan-13b * fix * add chat template * lint * comments * fix
* cherry-pick Fix meta tensor error commits * fix smooth quant --------- Co-authored-by: pppppM <[email protected]>
这个PR中,我们重点保证功能的正确性:
|
Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the ["8bit Weight Quantization"](#8bit-weight-quantization) section. Save them in the internlm-chat-7b-w8 directory, using the command below: | ||
|
||
```shell | ||
python lmdeploy/lite/apis/smooth_quant.py internlm/internlm-7b ./internlm-chat-7b-w8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got:
from .calibrate import calibrate
ImportError: attempted relative import with no known parent package
following the doc.
docs/en/w8a8.md
Outdated
Afterwards, use the following command to interact with the model via the terminal: | ||
|
||
```shell | ||
python lmdeploy/pytorch_poc/chat.py ./internlm-chat-7b-w8 internlm-chat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch_poc
-> pytorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another relative import error for the command.
20b w8 转换完后,直接跑 tp 2 挂了。
|
|
不量化的话,TP是否正确呢? |
正确 |
添加了 W8A8 支持 TP 的 pr,麻烦帮忙 review 下。 如果把 Weight, Bias, Scale 都注册为 Parameter ,可能需要统一修改 register parameter 的逻辑。因为 Weight 需要是 int8 类型的,将其注册为 Parameter 的时候需要将requires_grad 设为False,而 accelerator 的 init_on_device 中重写了注册 Parameter 的机制,导致
|
* fix smooth quant save_pretrained * support w8a8 tp * change weight and bias in QLinear back to buffer * remove debug codes and add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM But scripts in documents should be updated. Such as pytorch-poc
-> pytorch
No description provided.