-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't reproduce benchmark results #4
Comments
Hi! We ran on the June 19th version. I believe the newer versions auto-configure the number of GPU blocks available to vLLM, up to 90% of GPU memory. Can you share how many GPU blocks / size of GPU block are present when you run OPT-13B on the A100-40GB? Also, we found that the number of GPUs vLLM has access to can impact throughput. Is there only one GPU on your machine? |
I'm using a single A100-40GB GPU. I'm running on GCP, so I imagine the physical machine itself has more than one such GPU, but my VM has only one at its disposal.
I believe the |
If it helps, I ran the
Just a reminder, with the single |
Hey, I got same import problems, could you please tell me what should I do for "cannot import name'LLMServer'from 'vllm'"? Thanks. |
I ran the floowing benchmark scripts:
benchmark_configs/vllm_variable_size
benchmark_configs/vllm_variable_size_latency
The results I got deviate from the ones published in the blog.
The throughput results are between 6% and 14% lower than the expected ones:
For
qps=1
the latency is the same, but forqps=4
it's 54% worse.Setup details:
Can you explain what might cause the performance difference?
Note: I had to fix bad imports in
launch_scripts/launch_vllm
to make it work (for exampleServerArgs => EngineArgs
)Below are the detailed results (of my runs):
The text was updated successfully, but these errors were encountered: