Skip to content

Commit

Permalink
Update docs, refine en/qos.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhou Zihan authored and sallyjunjun committed Dec 26, 2023
1 parent 26040fb commit d303ef5
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 32 deletions.
46 changes: 23 additions & 23 deletions docs/en/qos.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

### Background

Inference frameworks emerges along with the LLM and AGI for the past period time. We see many inference frameworks providing scalable and high performant services in serving online workloads with language models. The workload they are serving usually comes with multiple user groups and the workload pattern changes quickly in a short time. Because many inference frameworks struggle in meeting requirements of these multi-tenancy traffic patterns and doesn't well shape user's behaviors, we think considering this systematically in LLM inference framework world is valuable and necessary.
Inference frameworks emerge along with the LLM and AGI for the past period of time. We see many inference frameworks providing scalable and high-performance services in serving online workloads with language models. The workload they are serving usually comes with multiple user groups and the workload pattern changes quickly in a short time. Because many inference frameworks struggle in meeting requirements of these multi-tenancy traffic patterns and don't well shape user's behaviors, we think considering this systematically in LLM inference framework world is valuable and necessary.

### User Categorizations for Multi-tenancy Handling

LMDeploy-QoS comes along with LMDeploy to provide a series of multi-tenancy functionality. It requires user to tag their inference requests with proper user identifications(user_id in config or codebase). It works based on a dictionary like configuration as the multi-tenancy policy. In this configuration, users are mapped to various classes called "user groups" and configured with a ratio value. Our multi-tenancy strategy reads the configuration and schedules user inference requests based on their class priority and the delta between predefined ratio and the realtime allocation ratio. With thorough tests, our LMDeploy-QoS greatly improves the Llm serving reliability and the GPU resources utilizations for real world large language model inference workload.
LMDeploy-QoS comes along with LMDeploy to provide a series of multi-tenancy functionality. It requires users to tag their inference requests with proper user identifications(user_id in config or codebase). It works based on a dictionary-like configuration as the multi-tenancy policy. In this configuration, users are mapped to various classes called "user groups" and configured with a ratio value. Our multi-tenancy strategy reads the configuration and schedules user inference requests based on their class priority and the delta between the predefined ratio and the real-time allocation ratio. With thorough tests, our LMDeploy-QoS greatly improves the Llm serving reliability and the GPU resources utilizations for real-world large language model inference workload.

LMDeploy categorized users into 4 groups:

Expand All @@ -17,53 +17,53 @@ LMDeploy categorized users into 4 groups:

Based on our particular experience in serving Llm services, we can map below described 4 types of users to these user groups:

- Platinum: vip users or admin users. Typical examples are service inspector or product demo presenter that requires uninterruptible online service. Their work load are in low frequency and low resource demand.
- Platinum: VIP users or admin users. Typical examples are service inspectors or product demo presenters that require uninterruptible online service. Their workloads are in low frequency and low resource demand.

- Gold: business user groups with contracts that require measurable amount of reliable services. For example certain company A signed a contract with Llm service provider and bought X request/sec service ability with availability Z% for A's employees with a Y millions $ per year payment.
- Gold: business user groups with contracts that require measurable amounts of reliable services. For example, a certain company A signed a contract with Llm service provider and bought X requests/sec service ability with availability Z% for A's employees, with Y million dollars payment per year.

- Silver: vast majority of users. Most trial or monthly subscribed users are categorized into this group. They need relatively small amount of services but their user experiences are also important to Llm service reputation.
- Silver: vast majority of users. Most trial or monthly subscribed users are categorized into this group. They need a relatively small amount of services but their user experiences are also important to Llm service reputation.

- Bronze: heavy users that pays very few to Llm providers.
- Bronze: heavy users that pay very little to Llm providers.

The purpose of introducing the above user group categorizations is for instructions rather than recommendations for all LMDeploy users because it's not necessarily the one suitable for any Llm business providers. The user can decide their own way of categorizing users based on their observations of daily workload.
The purpose of introducing the above user group categorizations is for instructions rather than recommendations for all LMDeploy users because it's not necessarily suitable for any Llm business providers. The user can decide their own way of categorizing users based on their observations of daily workload.

Below let's talk about how LMDeploy schedules requests based on this categorizations.
Below let's talk about how LMDeploy schedules requests based on these categorizations.

### Multi-tenancy Strategies

#### Strategy 1: prioritized scheduling between groups

This strategy works as simple as its title. Requests with higher priority always have higher priority to be inferred. To be noted that because the scheduling action is performed at request receiving time, it won't retrospectively chase back requests with lower priority already under inference.
This strategy works as simple as its title. Requests with higher priority always have higher priority to be inferred. Be noted that because the scheduling action is performed at request receiving time, it won't retrospectively chase back requests with lower priority already under inference.

The below diagram shows how the prioritization works. As you can see platinum request was reprioritized and moved to queue head.

![](https://github.com/InternLM/lmdeploy/assets/52888924/9d63f081-7168-4c74-8456-24f0a4b41649)

#### Strategy 2: proportionally rated scheduling with a pre-defined ratio within user group

This strategy works only within user group. We introduce a within group user quota configuration table. This table defines users' "ideal share ratio" with a sum value of 100% GPU resource. Each "user" appears in the list as a user_id, and a user can only belong to one user group. The term "ideal share" means when the system is full of requests defined in the configuration list in the pending requests, without anyone absent, the user will be schedule in a way that it appears taking the GPU "share" with an amount proportional to the ratio in the configuration. To be noted, if one or more user on the list is absent in the pending requests, the rest users shall be scheduled appeared as taking their share with ratios proportional to their corresponding configured value left among the list.
This strategy works only within the user group. We introduce a within-group user quota configuration table. This table defines users' "ideal share ratio" with a sum value of 100% GPU resource. Each "user" appears in the list as a user_id, and a user can only belong to one user group. The term "ideal share" means when the system is full of pending requests defined in the configuration list, without anyone absent, the users will be scheduled in a way that it appears as taking the GPU "share" with an amount proportional to the ratio in the configuration. To be noted, if one or more users on the list are absent in the pending requests, the rest users shall be scheduled to appear as taking their share with ratios proportional to their corresponding configured value left among the list.

To be noted, the share value configured on the table shall be reasonably large. As a contrary example, to allocate a 1% quota ratio for certain user may represent a ratio of GPU resource's granularity too-fined to serve a inference request, which could lead to starvation for this user.
To be noted, the share value configured on the table shall be reasonably large. As a contrary example, allocating a 1% quota ratio for a certain user may represent a ratio of GPU resource's granularity too fined to serve an inference request, which could lead to starvation for this user.

The below diagram shows a typical example of how this strategy works.

![](https://github.com/InternLM/lmdeploy/assets/52888924/3e1d7135-6b11-4998-89a1-b72af6c962c3)

#### Strategy 3: a combination strategy of 1 and 2

We can call it hybrid strategy. They way we hybrid these 2 strategies is fairly simple: we adopt the strategy 1 in between user groups, and adopt strategy 2 within a user group. So users with different groups having different priority will only obey strategy 1 to determine their privilege in resource allocation. That said, when both strategies appears, the first strategy will overpower the second. When it comes to a situation that no cross group requests are in pending for serving, the within group strategy 2 comes to play.
We can call it a hybrid strategy. The way we hybrid these 2 strategies is fairly simple: we adopt strategy 1 in between user groups, and adopt strategy 2 within a user group. So users with different groups having different priorities will only obey strategy 1 to determine their privilege in resource allocation. That said, when both strategies appear, the first strategy will overpower the second. When it comes to a situation that no cross-group requests are pending for serving, the within-group strategy 2 comes to play.

Below is a diagram shows it.
Below is a diagram showing it.

![](https://github.com/InternLM/lmdeploy/assets/52888924/e335f976-ff15-48db-b1ff-abf1c3327d6e)

To be noted, there could be other ways of hybridding strategy 1 &2, this doc only introduced one method that works well at our scenario. Other hybrid methods shall consider that the prioritization and pro-rated sharing are obviously conflicting strategies so there isn't easy way to mix them to work within a single dimension.
To be noted, there could be other ways of hybrid strategies 1 &2, this doc only introduced one method that works well in our scenario. Other hybrid methods shall consider that the prioritization and pro-rated sharing are obviously conflicting strategies so there isn't an easy way to mix them to work within a single dimension.

That been said, there are ways to meet users' simple requirements to only make use either one of the strategy. For example, a system requires only fair-sharing like strategy can configure with every users under one group. A good example is one model deployed for two different business purposes without dependency; A system having need of strictly chained prioritizations between different group of users can configure with at most 4 groups and each group having only one user. A good example is DAG inference workload.
That been said, there are ways to meet users' simple requirements to only make use of either one of the strategies. For example, a system requiring only a fair-sharing-like strategy can be configured with all users under one group. A good example is one model deployed for two different business purposes without dependency. A system having need of strictly chained prioritizations between different groups of users can be configured with at most 4 groups and each group having only one user. A good example is DAG inference workload.

### A Sample QoS Configuration

The configuration will be deployed along with lmdeploy binary and be periodically loaded by program in runtime.
The configuration will be specified by the `--qos_config_path` flag, and will be loaded by program upon startup.

```json
{
Expand Down Expand Up @@ -129,7 +129,7 @@ The configuration will be deployed along with lmdeploy binary and be periodicall

### How to perform inference job with Lmdeploy-QoS aware

We provide the code link below to show how to call infer request with multi-tenancy strategy awarded.
We provide the code link below to show how to call infer requests with multi-tenancy strategy awarded.

@app.post('/v1/chat/interactive_qos'):

Expand Down Expand Up @@ -165,7 +165,7 @@ Additional arguments supported by LMDeploy:
'''
```

What the qos related argument appear as in http body:
What the qos related argument appears as in http body:

/v1/chat/interactive_qos

Expand Down Expand Up @@ -249,12 +249,12 @@ CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ./workspace/ --server_port 8000

### Contributor

Eric (https://github.com/rhinouser0)
[Eric](https://github.com/rhinouser0)

sallyjunjun (https://github.com/sallyjunjun)
[sallyjunjun](https://github.com/sallyjunjun)

sfireworks (https://github.com/sfireworks)
[sfireworks](https://github.com/sfireworks)

Dofgal (https://github.com/Dofgal)
[Dofgal](https://github.com/Dofgal)

shadow (https://github.com/awslshadowstar)
[shadow](https://github.com/awslshadowstar)
12 changes: 6 additions & 6 deletions docs/zh_cn/qos.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ LMDeploy将用户分为4组:

### 一个示例的QoS配置

配置将与lmdeploy二进制文件一起部署,并将由运行时程序定期加载
配置文件通过启动参数`--qos_config_path`指定,并由程序在启动时加载

```json
{
Expand Down Expand Up @@ -249,12 +249,12 @@ CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ./workspace/ --server_port 8000

### 贡献者

Eric (https://github.com/rhinouser0)
[Eric](https://github.com/rhinouser0)

sallyjunjun (https://github.com/sallyjunjun)
[sallyjunjun](https://github.com/sallyjunjun)

sfireworks (https://github.com/sfireworks)
[sfireworks](https://github.com/sfireworks)

Dofgal (https://github.com/Dofgal)
[Dofgal](https://github.com/Dofgal)

shadow (https://github.com/awslshadowstar)
[shadow](https://github.com/awslshadowstar)
3 changes: 0 additions & 3 deletions lmdeploy/serve/qos_engine/inner_group_schd.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Copyright (c) OpenMMLab. All rights reserved.
import collections
import logging
import threading

logger = logging.getLogger(__name__)

Expand All @@ -23,8 +22,6 @@ def __init__(self, group: str, user_id_map: dict):
self.user_queue_map[user_id] = collections.deque()
self.user_quota_map[user_id] = item['quota_pct'] / total_quota

self.lock = threading.Lock()

def enqueue(self, request_event):
"""Enqueue request to corresponding user queue."""
if request_event[0].user_id in self.user_queue_map:
Expand Down

0 comments on commit d303ef5

Please sign in to comment.