Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix turbomind end session bug. Add huggingface demo document #1017

Merged
merged 4 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions docs/en/serving/gradio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Steps to create a huggingface online demo

## create space

First, register for a Hugging Face account. After successful registration, click on your profile picture in the upper right corner and select “New Space” to create one. Follow the Hugging Face guide to choose the necessary configurations, and you will have a blank demo space ready.

## A demo for LMDeploy

Replace the content of `app.py` in your space with the following code:

```python
from lmdeploy.serve.gradio.turbomind_coupled import run_local
from lmdeploy.messages import TurbomindEngineConfig

backend_config = TurbomindEngineConfig(max_batch_size=1, cache_max_entry_count=0.05)
model_path = 'internlm/internlm2-chat-7b'
run_local(model_path, backend_config=backend_config, huggingface_demo=True)
```

Create a `requirements.txt` file with the following content:

```
lmdeploy
```

## FAQs

- ZeroGPU compatibility issue. ZeroGPU is more suitable for inference methods similar to PyTorch, rather than Turbomind. You can switch to the PyTorch backend or enable standard GPUs.
- Gradio version issue, versions above 4.0.0 are currently not supported. You can modify this in `app.py`, for example:
```python
import os
os.system("pip uninstall -y gradio")
os.system("pip install gradio==3.43.0")
```
35 changes: 35 additions & 0 deletions docs/zh_cn/serving/gradio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# 从 LMDeploy 创建一个 huggingface 的在线 demo

## 创建 space

首先,注册一个 huggingface 的账号,注册成功后,可以点击右上角头像,选择 New Space 创建。
根据 huggingface 的引导选择需要的配置,完成后即可得到一个空白的 demo。

## 使用 LMDeploy 的 demo

以 `internlm/internlm2-chat-7b` 模型为例,将 space 空间中的`app.py`内容填写为:

```python
from lmdeploy.serve.gradio.turbomind_coupled import run_local
from lmdeploy.messages import TurbomindEngineConfig

backend_config = TurbomindEngineConfig(max_batch_size=1, cache_max_entry_count=0.05)
model_path = 'internlm/internlm2-chat-7b'
run_local(model_path, backend_config=backend_config, huggingface_demo=True)
```

创建`requirements.txt`文本文件,填写如下安装包:

```
lmdeploy
```

## FAQs

- ZeroGPU 适配问题。ZeroGPU 更适合类似 PyTorch 这样的推理方式,而非 Turbomind。可以改用 pytorch 后端,或者启用普通 GPU。
- gradio 版本问题,目前不支持 4.0.0 以上版本,可以在 `app.py` 中修改,类似:
```python
import os
os.system("pip uninstall -y gradio")
os.system("pip install gradio==3.43.0")
```
25 changes: 16 additions & 9 deletions lmdeploy/serve/gradio/turbomind_coupled.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ def run_local(model_path: str,
server_name: str = 'localhost',
server_port: int = 6006,
tp: int = 1,
huggingface_demo: bool = False,
AllentDan marked this conversation as resolved.
Show resolved Hide resolved
**kwargs):
"""chat with AI assistant through web ui.

Expand Down Expand Up @@ -153,6 +154,8 @@ def run_local(model_path: str,
server_name (str): the ip address of gradio server
server_port (int): the port of gradio server
tp (int): tensor parallel for Turbomind
huggingface_demo (bool): whether for huggingface space demo. Running
on huggingface space require no specified host name or port.
"""
InterFace.async_engine = AsyncEngine(
model_path=model_path,
Expand Down Expand Up @@ -220,15 +223,19 @@ def init():

demo.load(init, inputs=None, outputs=[state_session_id])

print(f'server is gonna mount on: http://{server_name}:{server_port}')
demo.queue(concurrency_count=InterFace.async_engine.instance_num,
max_size=100,
api_open=True).launch(
max_threads=10,
share=True,
server_port=server_port,
server_name=server_name,
)
if huggingface_demo is True:
demo.queue(concurrency_count=InterFace.async_engine.instance_num,
max_size=100).launch()
else:
print(f'server is gonna mount on: http://{server_name}:{server_port}')
demo.queue(concurrency_count=InterFace.async_engine.instance_num,
max_size=100,
api_open=True).launch(
max_threads=10,
share=True,
server_port=server_port,
server_name=server_name,
)


if __name__ == '__main__':
Expand Down
2 changes: 1 addition & 1 deletion lmdeploy/turbomind/turbomind.py
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ def _update_generation_config(self, config: EngineGenerationConfig,
if k in config.__dict__:
config.__dict__[k] = v
deprecated_kwargs.append(k)
if kwargs.get('request_output_len'):
if 'request_output_len' in kwargs:
config.max_new_tokens = kwargs['request_output_len']
deprecated_kwargs.append('request_output_len')
for k in deprecated_kwargs:
Expand Down
Loading