update doc and fix initialization

InternLM · Dec 25, 2023 · 26040fb · 26040fb
1 parent 12fe0f8
commit 26040fb
Show file tree

Hide file tree

Showing 4 changed files with 72 additions and 69 deletions.
diff --git a/docs/en/qos.md b/docs/en/qos.md
@@ -1,14 +1,14 @@
-## Lmdeploy-QoS Introduce and Usage
+## LMDeploy-QoS Introduce and Usage
 
 ### Background
 
 Inference frameworks emerges along with the LLM and AGI for the past period time. We see many inference frameworks providing scalable and high performant services in serving online workloads with language models. The workload they are serving usually comes with multiple user groups and the workload pattern changes quickly in a short time. Because many inference frameworks struggle in meeting requirements of these multi-tenancy traffic patterns and doesn't well shape user's behaviors, we think considering this systematically in LLM inference framework world is valuable and necessary.
 
 ### User Categorizations for Multi-tenancy Handling
 
-Lmdeploy-QoS comes along with Lmdeploy to provide a series of multi-tenancy functionality. It requires user to tag their inference requests with proper user identifications(user_id in config or codebase). It works based on a dictionary like configuration as the multi-tenancy policy. In this configuration, users are mapped to various classes called "user groups" and configured with a ratio value. Our multi-tenancy strategy reads the configuration and schedules user inference requests based on their class priority and the delta between predefined ratio and the realtime allocation ratio. With thorough tests, our Lmdeploy-QoS greatly improves the Llm serving reliability and the GPU resources utilizations for real world large language model inference workload.
+LMDeploy-QoS comes along with LMDeploy to provide a series of multi-tenancy functionality. It requires user to tag their inference requests with proper user identifications(user_id in config or codebase). It works based on a dictionary like configuration as the multi-tenancy policy. In this configuration, users are mapped to various classes called "user groups" and configured with a ratio value. Our multi-tenancy strategy reads the configuration and schedules user inference requests based on their class priority and the delta between predefined ratio and the realtime allocation ratio. With thorough tests, our LMDeploy-QoS greatly improves the Llm serving reliability and the GPU resources utilizations for real world large language model inference workload.
 
-Lmdeploy categorized users into 4 groups:
+LMDeploy categorized users into 4 groups:
 
 - Platinum
 - Gold
@@ -25,9 +25,9 @@ Based on our particular experience in serving Llm services, we can map below des
 
 - Bronze: heavy users that pays very few to Llm providers.
 
-The purpose of introducing the above user group categorizations is for instructions rather than recommendations for all Lmdeploy users because it's not necessarily the one suitable for any Llm business providers. The user can decide their own way of categorizing users based on their observations of daily workload.
+The purpose of introducing the above user group categorizations is for instructions rather than recommendations for all LMDeploy users because it's not necessarily the one suitable for any Llm business providers. The user can decide their own way of categorizing users based on their observations of daily workload.
 
-Below let's talk about how Lmdeploy schedules requests based on this categorizations.
+Below let's talk about how LMDeploy schedules requests based on this categorizations.
 
 ### Multi-tenancy Strategies
 
@@ -63,7 +63,7 @@ That been said, there are ways to meet users' simple requirements to only make u
 
 ### A Sample QoS Configuration
 
-The configuration will be placed along with lmdeploy binary and be periodically loaded by program in runtime.
+The configuration will be deployed along with lmdeploy binary and be periodically loaded by program in runtime.
 
 ```json
 {
@@ -135,7 +135,6 @@ We provide the code link below to show how to call infer request with multi-tena
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:420
 - temperature (float): to modulate the next token probability
     - repetition_penalty (float): The parameter for repetition penalty.
         1.0 means no penalty
@@ -148,7 +147,6 @@ lmdeploy/serve/openai/api_server.py:420
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:110
 Additional arguments supported by LMDeploy:
     - ignore_eos (bool): indicator for ignoring eos
     - session_id (int): if not specified, will set random value
@@ -160,7 +158,6 @@ Additional arguments supported by LMDeploy:
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:387
     Additional arguments supported by LMDeploy:
     - ignore_eos (bool): indicator for ignoring eos
     - session_id (int): if not specified, will set random value
@@ -187,7 +184,7 @@ curl -X POST http://localhost/v1/chat/interactive_qos \
   "temperature": 0.8,
   "repetition_penalty": 1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -210,7 +207,7 @@ curl -X POST http://localhost/v1/chat/completions_qos \
   "repetition_penalty": 1,
   "session_id": -1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -232,7 +229,7 @@ curl -X POST http://localhost/v1/completions_qos \
   "repetition_penalty": 1,
   "session_id": -1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -247,7 +244,7 @@ The content format should follow the guidelines provided in the `qos_config.json
 Upon starting the api_server, pass the configuration file and its path using the `--qos_config_path` flag. An example is illustrated below:
 
 ```bash
-CUDA_VISIBLE_DEVICES=4 python main.py serve api_server ../../download/workspace/workspace/ --server_port 11454 --qos_config_path lmdeploy/serve/qos_engine/qos_config.json
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ./workspace/ --server_port 8000 --qos_config_path lmdeploy/serve/qos_engine/qos_config.json.template
 ```
 
 ### Contributor
@@ -256,8 +253,8 @@ Eric (https://github.com/rhinouser0)
 
 sallyjunjun (https://github.com/sallyjunjun)
 
-sfireworks(https://github.com/sfireworks)
+sfireworks (https://github.com/sfireworks)
 
 Dofgal (https://github.com/Dofgal)
 
-shadow(https://github.com/awslshadowstar)
+shadow (https://github.com/awslshadowstar)
diff --git a/docs/zh_cn/qos.md b/docs/zh_cn/qos.md
@@ -1,33 +1,33 @@
-## Lmdeploy-QoS 介绍与用法
+## LMDeploy-QoS 介绍与用法
 
 ### 背景
 
 推理框架伴随着LLM和AGI在过去一段时间内出现。我们看到许多推理框架为语言模型提供可扩展和高性能的在线工作负载服务。它们所服务的工作负载通常涉及多个用户群体，并且工作负载模式在短时间内快速变化。由于许多推理框架在满足这些多租户流量模式的要求方面存在困难，并且未能很好地塑造用户的行为，我们认为在LLM推理框架领域系统地考虑这一点是有价值且必要的。
 
 ### 多租户处理的用户分类
 
-Lmdeploy-QoS与Lmdeploy一起提供一系列多租户功能。它要求用户使用适当的用户标识（配置文件或代码库中的user_id）标记其推理请求。它基于类似字典的配置作为多租户策略。在这个配置中，用户被映射到称为“用户组”的各种类别，并配置有一个比率值。我们的多租户策略读取配置，并根据其类别优先级和预定义比率与实时分配比率之间的差异安排用户推理请求的调度。经过彻底的测试，我们的Lmdeploy-QoS极大地提高了Llm的服务可靠性和用于实际世界大型语言模型推理工作负载的GPU资源利用率。
+LMDeploy-QoS与LMDeploy一起提供一系列多租户功能。它要求用户使用适当的用户标识(配置文件或代码库中的user_id)标记其推理请求。它基于类似字典的配置作为多租户策略。在这个配置中，用户被映射到称为“用户组”的各种类别，并配置有一个比率值。我们的多租户策略读取配置，并根据其类别优先级和预定义比率与实时分配比率之间的差异安排用户推理请求的调度。经过彻底的测试，我们的LMDeploy-QoS极大地提高了Llm的服务可靠性和用于实际世界大型语言模型推理工作负载的GPU资源利用率。
 
-Lmdeploy将用户分为4组：
+LMDeploy将用户分为4组：
 
-- 白金（Platinum）
-- 金（Gold）
-- 银（Silver）
-- 青铜（Bronze）
+- 白金(Platinum)
+- 金(Gold)
+- 银(Silver)
+- 青铜(Bronze)
 
 根据我们在提供Llm服务方面的特定经验，我们可以将以下描述的4种类型的用户映射到这些用户组中：
 
-- Platinum（白金）: VIP用户或管理员用户。典型例子包括需要不间断在线服务的服务检查员或产品演示员。他们的工作负载频率低，对资源需求也不高。
+- Platinum(白金): VIP用户或管理员用户。典型例子包括需要不间断在线服务的服务检查员或产品演示员。他们的工作负载频率低，对资源需求也不高。
 
-- Gold（金）: 具有合同的业务用户群体，需要可衡量的可靠服务。例如，某个公司A与Llm服务提供商签订了合同，购买了每秒X个请求的服务能力，可用性为Z%，供A公司员工使用，年付Y百万美元。
+- Gold(金): 具有合同的业务用户群体，需要可衡量的可靠服务。例如，某个公司A与Llm服务提供商签订了合同，购买了每秒X个请求的服务能力，可用性为Z%，供A公司员工使用，年付Y百万美元。
 
-- Silver（银）: 绝大多数用户。大多数试用或每月订阅的用户被归类为此类别。他们需要相对较少的服务，但他们的用户体验对于Llm服务的声誉也很重要。
+- Silver(银): 绝大多数用户。大多数试用或每月订阅的用户被归类为此类别。他们需要相对较少的服务，但他们的用户体验对于Llm服务的声誉也很重要。
 
-- Bronze（青铜）: 支付很少费用给Llm提供商的重度用户。
+- Bronze(青铜): 支付很少费用给Llm提供商的重度用户。
 
-以上引入用户组分类的目的是为了提供指导，而不是为所有Lmdeploy用户提供建议，因为这并不一定适用于所有Llm业务提供商。用户可以根据他们对日常工作负载的观察，自行决定如何对用户进行分类。
+以上引入用户组分类的目的是为了提供指导，而不是为所有LMDeploy用户提供建议，因为这并不一定适用于所有Llm业务提供商。用户可以根据他们对日常工作负载的观察，自行决定如何对用户进行分类。
 
-接下来让我们讨论一下Lmdeploy如何根据这些分类安排请求。
+接下来让我们讨论一下LMDeploy如何根据这些分类安排请求。
 
 ### 多租户策略
 
@@ -63,7 +63,7 @@ Lmdeploy将用户分为4组：
 
 ### 一个示例的QoS配置
 
-配置将与lmdeploy二进制文件一起放置，并将由运行时程序定期加载。
+配置将与lmdeploy二进制文件一起部署，并将由运行时程序定期加载。
 
 ```json
 {
@@ -127,15 +127,14 @@ Lmdeploy将用户分为4组：
 }
 ```
 
-### 如何使用 Lmdeploy-QoS 感知进行推理
+### 如何使用 LMDeploy-QoS 感知进行推理
 
 我们提供以下代码链接，展示如何调用具有多租户策略感知的推理请求。
 
 @app.post('/v1/chat/interactive_qos'):
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:420
 - temperature (float): to modulate the next token probability
     - repetition_penalty (float): The parameter for repetition penalty.
         1.0 means no penalty
@@ -148,7 +147,6 @@ lmdeploy/serve/openai/api_server.py:420
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:110
 Additional arguments supported by LMDeploy:
     - ignore_eos (bool): indicator for ignoring eos
     - session_id (int): if not specified, will set random value
@@ -160,7 +158,6 @@ Additional arguments supported by LMDeploy:
 
 ```python
 '''
-lmdeploy/serve/openai/api_server.py:387
     Additional arguments supported by LMDeploy:
     - ignore_eos (bool): indicator for ignoring eos
     - session_id (int): if not specified, will set random value
@@ -187,7 +184,7 @@ curl -X POST http://localhost/v1/chat/interactive_qos \
   "temperature": 0.8,
   "repetition_penalty": 1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -210,7 +207,7 @@ curl -X POST http://localhost/v1/chat/completions_qos \
   "repetition_penalty": 1,
   "session_id": -1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -232,7 +229,7 @@ curl -X POST http://localhost/v1/completions_qos \
   "repetition_penalty": 1,
   "session_id": -1,
   "ignore_eos": false,
-  "user_id": "default"
+  "user_id": "user_id0"
 }'
 ```
 
@@ -247,7 +244,7 @@ curl -X POST http://localhost/v1/completions_qos \
 启动api_server时，通过`--qos_config_path`，将配置文件及路径传入，示例如下：
 
 ```bash
-CUDA_VISIBLE_DEVICES=4 python main.py serve api_server ../../download/workspace/workspace/ --server_port 11454 --qos_config_path lmdeploy/serve/qos_engine/qos_config.json
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ./workspace/ --server_port 8000 --qos_config_path lmdeploy/serve/qos_engine/qos_config.json.template
 ```
 
 ### 贡献者
@@ -256,8 +253,8 @@ Eric (https://github.com/rhinouser0)
 
 sallyjunjun (https://github.com/sallyjunjun)
 
-sfireworks(https://github.com/sfireworks)
+sfireworks (https://github.com/sfireworks)
 
 Dofgal (https://github.com/Dofgal)
 
-shadow(https://github.com/awslshadowstar)
+shadow (https://github.com/awslshadowstar)
diff --git a/lmdeploy/serve/openai/api_server.py b/lmdeploy/serve/openai/api_server.py
@@ -129,6 +129,11 @@ async def chat_completions_v1_qos(request: ChatCompletionRequestQos,
     request_id = str(request.session_id)
     created_time = int(time.time())
 
+    if VariableInterface.qos_engine is None:
+        return create_error_response(
+            HTTPStatus.NOT_FOUND,
+            'cannot parse qos engine config, this api is not work')
+
     result_generator = await VariableInterface.qos_engine.generate_with_qos(
         request)
 
@@ -417,6 +422,11 @@ async def completions_v1_qos(request: CompletionRequestQos,
     if isinstance(request.prompt, str):
         request.prompt = [request.prompt]
 
+    if VariableInterface.qos_engine is None:
+        return create_error_response(
+            HTTPStatus.NOT_FOUND,
+            'cannot parse qos engine config, this api is not work')
+
     generators = await VariableInterface.qos_engine.generate_with_qos(request)
 
     def create_stream_response_json(
@@ -743,6 +753,11 @@ async def chat_interactive_v1_qos(request: GenerateRequestQos,
     if request.session_id == -1:
         request.session_id = random.randint(10087, 23333)
 
+    if VariableInterface.qos_engine is None:
+        return create_error_response(
+            HTTPStatus.NOT_FOUND,
+            'cannot parse qos engine config, this api is not work')
+
     generation = await VariableInterface.qos_engine.generate_with_qos(request)
 
     # Streaming case
@@ -911,24 +926,23 @@ def serve(model_path: str,
             allow_methods=allow_methods,
             allow_headers=allow_headers,
         )
-    qos_config_str = ''
-    if qos_config_path:
-        try:
-            with open(qos_config_path, 'r') as file:
-                qos_config_str = file.read()
-        except FileNotFoundError:
-            qos_config_str = ''
-
     VariableInterface.async_engine = AsyncEngine(model_path=model_path,
                                                  model_name=model_name,
                                                  instance_num=instance_num,
                                                  tp=tp,
                                                  **kwargs)
-    VariableInterface.qos_engine = QosEngine(
-        qos_tag=qos_config_str,
-        engine=VariableInterface.async_engine,
-        **kwargs)
-    VariableInterface.qos_engine.start()
+
+    if qos_config_path:
+        try:
+            with open(qos_config_path, 'r') as file:
+                qos_config_str = file.read()
+                VariableInterface.qos_engine = QosEngine(
+                    qos_tag=qos_config_str,
+                    engine=VariableInterface.async_engine,
+                    **kwargs)
+                VariableInterface.qos_engine.start()
+        except FileNotFoundError:
+            VariableInterface.qos_engine = None
 
     for i in range(3):
         print(f'HINT:    Please open \033[93m\033[1mhttp://{server_name}:'

diff --git a/lmdeploy/serve/qos_engine/qos_engine.py b/lmdeploy/serve/qos_engine/qos_engine.py
@@ -18,18 +18,15 @@ class QosConfig:
     """qos config class: parse qosconfig for qos engine."""
 
     def __init__(self, qos_tag=''):
-        try:
-            qos_config = json.loads(qos_tag)
-            self.is_qos_enabled = qos_config['enable_user_qos']
+        qos_config = json.loads(qos_tag)
+        self.is_qos_enabled = qos_config.get('enable_user_qos', False)
+        logger.debug(f'is_qos_enabled: {self.is_qos_enabled}')
+
+        if self.is_qos_enabled:
             self.user_id_maps = qos_config['user_group_map']
             self.user_group_prio = qos_config['user_groups']
-        except Exception:
-            self.is_qos_enabled = False
-            self.user_id_maps = dict()
-            self.user_group_prio = []
-        logger.debug(f'is_qos_enabled: {self.is_qos_enabled}')
-        logger.debug(f'user_id_maps:  {self.user_id_maps}')
-        logger.debug(f'user_group_prio: {self.user_group_prio}')
+            logger.debug(f'user_id_maps:  {self.user_id_maps}')
+            logger.debug(f'user_group_prio: {self.user_group_prio}')
 
 
 class QosEngine:
@@ -45,8 +42,11 @@ def __init__(self, qos_tag='', engine=None, **kwargs) -> None:
 
         self.qos_user_group = QosGroupQueue(self.qos_config)
 
-        self.usage_stats = UsageStats(60, 6, 0,
-                                      self.qos_config.user_group_prio)
+        self.usage_stats = UsageStats(
+            total_duration=60,
+            buffer_count=6,
+            start_index=0,
+            user_groups=self.qos_config.user_group_prio)
         self.user_served_reqs = dict()
         self._dump_stats_thread = threading.Thread(target=self._dump_stats,
                                                    daemon=True)
@@ -205,11 +205,6 @@ def dequeue(self, usage_stats):
         """dequeue from multiqueue."""
         return self.qos_user_group.dequeue(usage_stats)
 
-    def stop(self):
-        """end qos engine session."""
-        self._stop_event.set()
-        self._dequeue_thread.join()
-
 
 class QosGroupQueue:
     """create groups for qos outer group schedule."""