Releases: triton-inference-server/vllm_backend
Releases · triton-inference-server/vllm_backend
Release 2.49.0 corresponding to NGC container 24.08
What's Changed
- refactor: Remove explicit callings to garbage collect by @kthui in #55
- perf: Check for cancellation on response thread by @kthui in #54
- feat: Add vLLM counter metrics access through Triton by @yinggeh in #53
- feat: Report histogram metrics to Triton metrics server by @yinggeh in #58
- feat: Report more histogram metrics by @yinggeh in #61
Full Changelog: v24.07...v24.08
Release 2.48.0 corresponding to NGC container 24.07
What's Changed
- Removed explicit mode for multi-lora by @oandreeva-nv in #45
- test: Limiting multi-gpu tests to use Ray as distributed_executor_backend by @oandreeva-nv in #47
- perf: Improve vLLM backend performance by using a separate thread for responses by @Tabrizian in #46
Full Changelog: v24.06...v24.07
Release 2.47.0 corresponding to NGC container 24.06
fix: Enhance checks around KIND_GPU and tensor parallelism (#42) Co-authored-by: Olga Andreeva <[email protected]>