Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.
Currently, Turbomind support loading three types of model:
- A lmdeploy-quantized model hosted on huggingface.co, such as llama2-70b-4bit, internlm-chat-20b-4bit, etc.
- Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
- A model converted by
lmdeploy convert
, legacy format
For models quantized by lmdeploy.lite
such as llama2-70b-4bit, internlm-chat-20b-4bit, etc.
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through lmdeploy list
.
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
The usage is like previous
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
# Inference by TurboMind
lmdeploy chat turbomind ./workspace
# Serving with gradio
lmdeploy serve gradio ./workspace
# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1