TensorRT-LLM 0.9.0 Release
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to
examples/multimodal
- Features
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGE] Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support
early_stopping=False
in beam search for C++ Runtime - Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run
GptSession
without OpenMPI #1220 - Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
- Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
- T5
- Mixtral 8x7B
- API
- C++
executor
API- Add Python bindings, see documentation and examples in
examples/bindings
- Add advanced and multi-GPU examples for Python binding of
executor
C++ API, seeexamples/bindings/README.md
- Add documents for C++
executor
API, seedocs/source/executor.md
- Add Python bindings, see documentation and examples in
- High-level API (refer to
examples/high-level-api/README.md
for guidance)- [BREAKING CHANGE] Reuse the
QuantConfig
used intrtllm-build
tool, support broader quantization features - Support in
LLM()
API to accept engines built bytrtllm-build
command - Add support for TensorRT-LLM checkpoint as model input
- Refine
SamplingConfig
used inLLM.generate
orLLM.generate_async
APIs, with the support of beam search, a variety of penalties, and more features - Add support for the StreamingLLM feature, enable it by setting
LLM(streaming_llm=...)
- Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Reuse the
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see
examples/qwen/README.md
for the latest commands - [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see
examples/gpt/README.md
for the latest commands - [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to
trtllm-build
command, to generalize the feature better to more models - [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the
trtllm-build --max_prompt_embedding_table_size
instead. - [BREAKING CHANGE] Changed the
trtllm-build --world_size
flag to--auto_parallel
flag, the option is used for auto parallel planner only. - [BREAKING CHANGE]
AsyncLLMEngine
is removed,tensorrt_llm.GenerationExecutor
class is refactored to work with both explicitly launching withmpirun
in the application level, and accept an MPI communicator created bympi4py
- [BREAKING CHANGE]
examples/server
are removed, seeexamples/app
instead. - [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove
model
parameter fromgptManagerBenchmark
andgptSessionBenchmark
- C++
- Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the
encoder_input_len_range
is not 0, thanks to the contribution from @Eddie-Wang1120 in #992 - Fix the issue that log probabilities in Python runtime are not returned #983
- Multi-GPU fixes for multimodal examples #1003
- Fix wrong
end_id
issue for Qwen #987 - Fix a non-stopping generation issue #1118 #1123
- Fix wrong link in examples/mixtral/README.md #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled #967
- Fix wrong
head_size
when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148 - Fix ChatGLM2-6B building failure on INT8 #1239
- Fix wrong relative path in Baichuan documentation #1242
- Fix wrong
SamplingConfig
tensors inModelRunnerCpp
#1183 - Fix error when converting SmoothQuant LLaMA #1267
- Fix the issue that
examples/run.py
only load one line from--input_file
- Fix the issue that
ModelRunnerCpp
does not transferSamplingConfig
tensor fields correctly #1183
- Fix a weight-only quant bug for Whisper to make sure that the
- Benchmark
- Add emulated static batching in
gptManagerBenchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in
benchmarks/cpp/README.md
- Add percentile latency report to
gptManagerBenchmark
- Add emulated static batching in
- Performance
- Infra
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.02-py3
- Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)
- Base Docker image for TensorRT-LLM is updated to
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team