TensorRT-LLM Release 0.16.0 #2614
Pinned
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/recurrentgemma/README.md
.examples/llama/README.md
.max_num_tokens
dynamic tuning feature, which can be enabled by setting--enable_max_num_tokens_tuning
togptManagerBenchmark
.max_num_tokens
andmax_batch_size
arguments to control the runtime parameters.extended_runtime_perf_knob_config
to enable various performance configurations.AutoAWQ
checkpoints support for Qwen. Refer to the “INT4-AWQ” section inexamples/qwen/README.md
.AutoAWQ
andAutoGPTQ
Hugging Face checkpoints support for LLaMA. (Is it possible load quantized model from huggingface? #2458)allottedTimeMs
to the C++Request
class to support per-request timeout.API Changes
enable_xqa
argument fromtrtllm-build
.--use_embedding_sharing
from convert checkpoints scripts.if __name__ == "__main__"
entry point is required for both single-GPU and multi-GPU cases when using theLLM
API.enable_chunked_prefill
flag to theLlmArgs
of theLLM
API.trtllm-build
command.Model Updates
examples/multimodal/README.md
.examples/multimodal
.examples/sdxl/README.md
. Thanks for the contribution from @Zars19 in Support SDXL and its distributed inference #1514.Fixed Issues
sampling_params
to only be setup ifend_id
is None andtokenizer
is not None in theLLM
API. Thanks to the contribution from @mfuntowicz in [LLM] sampling_params should be setup only if end_id is None and tokenizer is not None #2573.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.11-py3
.nvcr.io/nvidia/tritonserver:24.11-py3
.Known Issues
export NCCL_P2P_LEVEL=SYS
.We are updating the
main
branch regularly with new features, bug fixes and performance optimizations. Therel
branch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM Release 0.16.0.
Beta Was this translation helpful? Give feedback.
All reactions