Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce benchmark results #4

Open
shahar-lev opened this issue Aug 7, 2023 · 4 comments
Open

Can't reproduce benchmark results #4

shahar-lev opened this issue Aug 7, 2023 · 4 comments

Comments

@shahar-lev
Copy link

shahar-lev commented Aug 7, 2023

I ran the floowing benchmark scripts:

  1. benchmark_configs/vllm_variable_size
  2. benchmark_configs/vllm_variable_size_latency

The results I got deviate from the ones published in the blog.
The throughput results are between 6% and 14% lower than the expected ones:

Throughput max tokens 32 max tokens 128 max tokens 512 max tokens 1536
Expected 6121 3592 2029 1898
Actual 5752 3180 1734 1653

For qps=1 the latency is the same, but for qps=4 it's 54% worse.

Latency qps 1 qps 4
Expected 3.6 4.6
Actual 3.6 7.1

Setup details:

Linux 5.10.176+ #1 SMP Sat May 6 15:10:33 UTC 2023 x86_64 GNU/Linux
NVIDIA Driver Version: 525.105.17
GPU: 1 x NVIDIA A100-SXM4-40GB
Python 3.10.6
vLLM 0.1.2
CUDA 11.8
Torch 2.0.1+cu118
Transformers 4.30.1

Can you explain what might cause the performance difference?

Note: I had to fix bad imports in launch_scripts/launch_vllm to make it work (for example ServerArgs => EngineArgs)

Below are the detailed results (of my runs):

vllm_range_32_2023-08-02_19:08:59.log:
backend vLLM dur_s 91.73 tokens_per_s 5751.92 qps 10.90 successful_responses 1000 prompt_token_count 512000 response_token_count 15610, median_token_latency=3.129498200757163, median_e2e_latency=49.0919646024704

vllm_range_128_2023-08-02_19:11:07.log:
backend vLLM dur_s 178.07 tokens_per_s 3180.34 qps 5.62 successful_responses 1000 prompt_token_count 512000 response_token_count 54331, median_token_latency=1.7179552376270295, median_e2e_latency=90.2003003358841

vllm_range_512_2023-08-02_19:14:42.log:
backend vLLM dur_s 364.71 tokens_per_s 1738.68 qps 2.74 successful_responses 1000 prompt_token_count 512000 response_token_count 122108, median_token_latency=1.7062032730021375, median_e2e_latency=181.4516226053238

vllm_range_1536_2023-08-02_19:21:23.log:
backend vLLM dur_s 387.94 tokens_per_s 1653.79 qps 2.58 successful_responses 1000 prompt_token_count 512000 response_token_count 129570, median_token_latency=1.7450935804206906, median_e2e_latency=193.9225560426712

vllm_qps_1_numprompts_5000_range_1536_2023-08-02_22:22:20.log:
backend vLLM dur_s 5024.24 tokens_per_s 382.84 qps 1.00 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0408235232035319, median_e2e_latency=3.566364049911499

vllm_qps_4_numprompts_5000_range_1536_2023-08-02_23:46:54.log:
backend vLLM dur_s 1279.33 tokens_per_s 1503.52 qps 3.91 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0629778996757839, median_e2e_latency=7.079227566719055

vllm_qps_8_numprompts_5000_range_1536_2023-08-02_19:30:41.log:
backend vLLM dur_s 1180.98 tokens_per_s 1628.74 qps 4.23 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=2.7175515592098236, median_e2e_latency=267.9675291776657

vllm_qps_16_numprompts_5000_range_1536_2023-08-02_19:51:12.log:
backend vLLM dur_s 1175.71 tokens_per_s 1636.03 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=4.252413267313048, median_e2e_latency=419.6503413915634

vllm_qps_32_numprompts_5000_range_1536_2023-08-02_20:11:37.log:
backend vLLM dur_s 1175.90 tokens_per_s 1635.77 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=5.0277330653612005, median_e2e_latency=496.7233476638794
@cadedaniel
Copy link
Contributor

Hi! We ran on the June 19th version. I believe the newer versions auto-configure the number of GPU blocks available to vLLM, up to 90% of GPU memory. Can you share how many GPU blocks / size of GPU block are present when you run OPT-13B on the A100-40GB?

Also, we found that the number of GPUs vLLM has access to can impact throughput. Is there only one GPU on your machine?

@shahar-lev
Copy link
Author

shahar-lev commented Aug 10, 2023

I'm using a single A100-40GB GPU. I'm running on GCP, so I imagine the physical machine itself has more than one such GPU, but my VM has only one at its disposal.
When running the vllm server, I get the following:

# GPU blocks: 866, # CPU blocks: 327

I believe the block_size used is the default one in version 0.1.2, which is 16.

@shahar-lev
Copy link
Author

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

  • median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).
  • median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

@JenniePing
Copy link

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

  • median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).
  • median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

Hey, I got same import problems, could you please tell me what should I do for "cannot import name'LLMServer'from 'vllm'"? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants