Update on the development branch #2437

kaiyux · 2024-11-12T07:35:41Z

kaiyux
Nov 12, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we have pushed an update to the development branch (and the Triton backend) this Nov 12, 2024.

This update includes:

Model Support
- Added support for Minitron, see examples/nemotron.
- Added a GPT Variant - Granite(20B and 34B), see “GPT Variant - Granite” section in examples/gpt/README.md.
- Added support for LLaVA-OneVision model, see “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.
Features
- Added a trtllm-serve command to launch a FastAPI based server.
- Added support for prompt-lookup speculative decoding, see examples/prompt_lookup/README.md.
- Added FP8 support for Nemotron NAS 51B. See examples/nemotron_nas/README.md.
- Integrated the QServe w4a8 per-group/per-channel quantization, see “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
- Added a C++ example for fast logits using the executor API, see “executorExampleFastLogits” section in examples/cpp/executor/README.md.
API
- [BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
- [BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
Bug fixes
- Fix the issue that the kernel moeTopK() cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
- Fixed an assertion failure on crossKvCacheFraction. (Assertion failed: Must set crossKvCacheFraction for encoder-decoder model #2419)
- Fixed an issue when using smoothquant to quantize Qwen2 model. (Fix errors when using smoothquant to quantize Qwen2 model #2370)
- Fixed a PDL typo in docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in Small Typo #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
- The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on the development branch #2437

{{title}}

Replies: 0 comments

Select a reply

Update on the development branch #2437

kaiyux Nov 12, 2024 Maintainer

Replies: 0 comments

kaiyux
Nov 12, 2024
Maintainer