Skip to content

Releases: triton-inference-server/vllm_backend

Release 2.49.0 corresponding to NGC container 24.08

30 Aug 18:37
98947a7
Compare
Choose a tag to compare

What's Changed

  • refactor: Remove explicit callings to garbage collect by @kthui in #55
  • perf: Check for cancellation on response thread by @kthui in #54
  • feat: Add vLLM counter metrics access through Triton by @yinggeh in #53
  • feat: Report histogram metrics to Triton metrics server by @yinggeh in #58
  • feat: Report more histogram metrics by @yinggeh in #61

Full Changelog: v24.07...v24.08

Release 2.48.0 corresponding to NGC container 24.07

05 Aug 20:38
128abc3
Compare
Choose a tag to compare

What's Changed

  • Removed explicit mode for multi-lora by @oandreeva-nv in #45
  • test: Limiting multi-gpu tests to use Ray as distributed_executor_backend by @oandreeva-nv in #47
  • perf: Improve vLLM backend performance by using a separate thread for responses by @Tabrizian in #46

Full Changelog: v24.06...v24.07

Release 2.47.0 corresponding to NGC container 24.06

23 Jul 19:27
18a96e3
Compare
Choose a tag to compare
fix: Enhance checks around KIND_GPU and tensor parallelism (#42)

Co-authored-by: Olga Andreeva <[email protected]>