TensorRT-LLM 0.9.0 Release #1451

kaiyux · 2024-04-16T04:38:01Z

kaiyux
Apr 16, 2024
Maintainer

Hi,

We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR Adding distil-whisper model support to TensorRT-LLM #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to examples/multimodal
Features
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGE] Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support early_stopping=False in beam search for C++ Runtime
- Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in Make Gemma importable from transformers Gemma implementation #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run GptSession without OpenMPI Run GptSession without openmpi? #1220
- Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
  - Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
  - T5
  - Mixtral 8x7B
API
- C++ executor API
  - Add Python bindings, see documentation and examples in examples/bindings
  - Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
  - Add documents for C++ executor API, see docs/source/executor.md
- High-level API (refer to examples/high-level-api/README.md for guidance)
  - [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
  - Support in LLM() API to accept engines built by trtllm-build command
  - Add support for TensorRT-LLM checkpoint as model input
  - Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
  - Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
  - Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands
- [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
- [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models
- [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
- [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
- [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
- [BREAKING CHANGE] examples/server are removed, see examples/app instead.
- [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the encoder_input_len_range is not 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992
- Fix the issue that log probabilities in Python runtime are not returned Question: Return log probabilites #983
- Multi-GPU fixes for multimodal examples How to use multi-gpu in running llava？ #1003
- Fix wrong end_id issue for Qwen qwen end_id setting is wrong so cannot stop at right postition! #987
- Fix a non-stopping generation issue LLAVA is slow due to unnecessary output tokens #1118 Why is llava trt-llm not much faster than transformers? #1123
- Fix wrong link in examples/mixtral/README.md Mixtral - no run.py file #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled llama2-7b bad results for int8-kv-cache + per-channel-int8-weight #967
- Fix wrong head_size when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148
- Fix ChatGLM2-6B building failure on INT8 chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239
- Fix wrong relative path in Baichuan documentation Incorrect documentation in examples /baichuan/ #1242
- Fix wrong SamplingConfig tensors in ModelRunnerCpp ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
- Fix error when converting SmoothQuant LLaMA Smoothquant LLaMA builds not working on 0.8.0 release #1267
- Fix the issue that examples/run.py only load one line from --input_file
- Fix the issue that ModelRunnerCpp does not transfer SamplingConfig tensor fields correctly ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
Benchmark
- Add emulated static batching in gptManagerBenchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
- Add percentile latency report to gptManagerBenchmark
Performance
- Optimize gptDecoderBatch to support batched sampling
- Enable FMHA for models in BART, Whisper and NMT family
- Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in moe router tp removed #1091
- Improve custom all-reduce kernel
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.9.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.9.0 Release #1451

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.9.0 Release #1451

kaiyux Apr 16, 2024 Maintainer

Replies: 0 comments

kaiyux
Apr 16, 2024
Maintainer