Update TensorRT-LLM Release branch (#745)

* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>
NVIDIA · Dec 26, 2023 · 80bc075 · 80bc075
1 parent a8018c1
commit 80bc075
Show file tree

Hide file tree

Showing 19 changed files with 450 additions and 169 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,99 @@
+# Change Log
+
+## Versions 0.6.0 / 0.6.1
+
+  * Models
+      * ChatGLM3
+      * InternLM (contributed by @wangruohui)
+      * Mistral 7B (developed in collaboration with Mistral.AI)
+      * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
+      * Qwen (contributed by @Tlntin and @zhaohb)
+      * Replit Code V-1.5 3B (external contribution)
+      * T5, mT5, Flan-T5 (Python runtime only)
+
+  * Features
+      * Add runtime statistics related to active requests and KV cache
+        utilization from the batch manager (see
+        the [batch manager](docs/source/batch_manager.md) documentation)
+      * Add `sequence_length` tensor to support proper lengths in beam-search
+        (when beam-width > 1 - see
+        [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * BF16 support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Improvements to memory utilization (CPU and GPU - including memory
+        leaks)
+      * Improved error reporting and memory consumption
+      * Improved support for stop and bad words
+      * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
+        [examples/baichuan](examples/baichuan/README.md))
+      * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
+        support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
+      * INT4 AWQ support for the Falcon models
+        (see [examples/falcon](examples/falcon/README.md))
+      * LoRA support (functional preview only - limited to the Python runtime,
+        only QKV support and not optimized in terms of runtime performance) for
+        the GPT model (see the
+        [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
+        in the GPT example)
+      * Multi-GPU support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * New heuristic for launching the Multi-block Masked MHA kernel (similar
+        to FlashDecoding - see
+        [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
+      * Prompt-Tuning support for GPT and LLaMA models (see the
+        [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
+      * Performance optimizations in various CUDA kernels
+      * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
+        [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
+      * Support for different micro batch sizes for context and generation
+        phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
+        `GptSession::Config::genMicroBatchSize` in
+        [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
+      * Support for "remove input padding" for encoder-decoder models (see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Support for context and generation logits (see `mComputeContextLogits` and
+        `mComputeGenerationLogits` in
+        [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
+      * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
+        `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Update to CUTLASS 3.x
+
+  * Bug fixes
+      * Fix for ChatGLM2 #93 and #138
+      * Fix tensor names error "RuntimeError: Tensor names
+        (`host_max_kv_cache_length`) in engine are not the same as expected in
+        the main branch" #369
+      * Fix weights split issue in BLOOM when `world_size = 2` ("array split
+        does not result in an equal division") #374
+      * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
+      * Fix a crash in GenerationSession if stream keyword argument is not None
+        #202
+      * Fix a typo when calling PyNVML API [BUG] code bug #410
+      * Fix bugs related to the improper management of the `end_id` for various
+        models [C++ and Python]
+      * Fix memory leaks [C++ code and Python models]
+      * Fix the std::alloc error when running the gptManagerBenchmark -- issue
+        gptManagerBenchmark std::bad_alloc error #66
+      * Fix a bug in pipeline parallelism when beam-width > 1
+      * Fix a bug with Llama GPTQ due to improper support of GQA
+      * Fix issue #88
+      * Fix an issue with the Huggingface Transformers version #16
+      * Fix link jump in windows readme.md #30 - by @yuanlehome
+      * Fix typo in batchScheduler.h #56 - by @eltociear
+      * Fix typo #58 - by @RichardScottOZ
+      * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
+        builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
+      * Fix the log message to be more accurate on KV cache #224
+      * Fix Windows release wheel installation: Failed to install the release
+        wheel for Windows using pip #261
+      * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
+        in --cpp-only when torch's cxx_abi version is different with gcc #151
+      * Fix linking error during compiling google-test & benchmarks #277
+      * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
+        the lack of bfloat16 #335
+      * Minor bug fixes
+
+## Version 0.5.0
+
+  * TensorRT-LLM v0.5.0 is the first public release.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ TensorRT-LLM
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
 [![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.7.0-green)](./setup.py)
+[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 [Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -108,16 +108,16 @@ concepts used in TensorRT-LLM, we recommend you to read the following
 
 ## Installation
 
-*For Windows installation, see [`Windows`](windows/README.md).*
-
-TensorRT-LLM must be built from source, instructions can be found
+The documentation for installing TensorRT-LLM can be found
 [here](./docs/source/installation.md). An image of a Docker container with
 TensorRT-LLM and its Triton Inference Server Backend will be made available
 soon.
 
 The remaining commands in that document must be executed from the TensorRT-LLM
 container.
 
+*For Windows installation, see [`Windows`](windows/README.md).*
+
 ## Quick Start
 
 To create a TensorRT engine for an existing model, there are 3 steps:
@@ -379,103 +379,43 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
 
 ### Change Log
 
-#### Version 0.6.1
-
-  * Models
-      * ChatGLM3
-      * InternLM (contributed by @wangruohui)
-      * Mistral 7B (developed in collaboration with Mistral.AI)
-      * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
-      * Qwen (contributed by @Tlntin and @zhaohb)
-      * Replit Code V-1.5 3B (external contribution)
-      * T5, mT5, Flan-T5 (Python runtime only)
-
-  * Features
-      * Add runtime statistics related to active requests and KV cache
-        utilization from the batch manager (see
-        the [batch manager](docs/source/batch_manager.md) documentation)
-      * Add `sequence_length` tensor to support proper lengths in beam-search
-        (when beam-width > 1 - see
-        [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
-      * BF16 support for encoder-decoder models (Python runtime - see
-        [examples/enc_dec](examples/enc_dec/README.md))
-      * Improvements to memory utilization (CPU and GPU - including memory
-        leaks)
-      * Improved error reporting and memory consumption
-      * Improved support for stop and bad words
-      * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
-        [examples/baichuan](examples/baichuan/README.md))
-      * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
-        support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
-      * INT4 AWQ support for the Falcon models
-        (see [examples/falcon](examples/falcon/README.md))
-      * LoRA support (functional preview only - limited to the Python runtime,
-        only QKV support and not optimized in terms of runtime performance) for
-        the GPT model (see the
-        [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
-        in the GPT example)
-      * Multi-GPU support for encoder-decoder models (Python runtime - see
-        [examples/enc_dec](examples/enc_dec/README.md))
-      * New heuristic for launching the Multi-block Masked MHA kernel (similar
-        to FlashDecoding - see
-        [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
-      * Prompt-Tuning support for GPT and LLaMA models (see the
-        [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
-      * Performance optimizations in various CUDA kernels
-      * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
-        [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
-      * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
-      * Support for different micro batch sizes for context and generation
-        phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
-        `GptSession::Config::genMicroBatchSize` in
-        [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
-      * Support for "remove input padding" for encoder-decoder models (see
-        [examples/enc_dec](examples/enc_dec/README.md))
-      * Support for context and generation logits (see `mComputeContextLogits` and
-        `mComputeGenerationLogits` in
-        [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
-      * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
-        `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
-      * Update to CUTLASS 3.x
-
-  * Bug fixes
-      * Fix for ChatGLM2 #93 and #138
-      * Fix tensor names error "RuntimeError: Tensor names
-        (`host_max_kv_cache_length`) in engine are not the same as expected in
-        the main branch" #369
-      * Fix weights split issue in BLOOM when `world_size = 2` ("array split
-        does not result in an equal division") #374
-      * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
-      * Fix a crash in GenerationSession if stream keyword argument is not None
-        #202
-      * Fix a typo when calling PyNVML API [BUG] code bug #410
-      * Fix bugs related to the improper management of the `end_id` for various
-        models [C++ and Python]
-      * Fix memory leaks [C++ code and Python models]
-      * Fix the std::alloc error when running the gptManagerBenchmark -- issue
-        gptManagerBenchmark std::bad_alloc error #66
-      * Fix a bug in pipeline parallelism when beam-width > 1
-      * Fix a bug with Llama GPTQ due to improper support of GQA
-      * Fix issue #88
-      * Fix an issue with the Huggingface Transformers version #16
-      * Fix link jump in windows readme.md #30 - by @yuanlehome
-      * Fix typo in batchScheduler.h #56 - by @eltociear
-      * Fix typo #58 - by @RichardScottOZ
-      * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
-        builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
-      * Fix the log message to be more accurate on KV cache #224
-      * Fix Windows release wheel installation: Failed to install the release
-        wheel for Windows using pip #261
-      * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
-        in --cpp-only when torch's cxx_abi version is different with gcc #151
-      * Fix linking error during compiling google-test & benchmarks #277
-      * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
-        the lack of bfloat16 #335
-      * Minor bug fixes
-
-#### Version 0.5.0
-
-  * TensorRT-LLM v0.5.0 is the first public release.
+#### Versions 0.7.0 / 0.7.1
+
+* Models
+  - BART and mBART support in encoder-decoder models
+  - FairSeq Neural Machine Translation (NMT) family
+  - Mixtral-8x7B model
+    - Support weight loading for HuggingFace Mixtral model
+  - OpenAI Whisper
+  - Mixture of Experts support
+  - MPT - Int4 AWQ / SmoothQuant support
+  - Baichuan FP8 quantization support
+* Features
+  - [Preview] Speculative decoding
+  - Add Python binding for `GptManager`
+  - Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
+  - System prompt caching
+  - Enable split-k for weight-only cutlass kernels
+  - FP8 KV cache support for XQA kernel
+  - New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
+  - Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
+  - fMHA support for chunked attention and paged kv cache
+* Bug fixes
+  - Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
+  - Fix LLaMa with LoRA error #637
+  - Fix LLaMA GPTQ failure #580
+  - Fix Python binding for InferenceRequest issue #528
+  - Fix CodeLlama SQ accuracy issue #453
+* Performance
+  - MMHA optimization for MQA and GQA
+  - LoRA optimization: cutlass grouped gemm
+  - Optimize Hopper warp specialized kernels
+  - Optimize AllReduce for parallel attention on Falcon and GPT-J
+  - Enable split-k for weight-only cutlass kernel when SM>=75
+* Documentation
+  - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+
+#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
 
 ### Known Issues
 

diff --git a/benchmarks/python/allowed_configs.py b/benchmarks/python/allowed_configs.py
@@ -232,6 +232,7 @@ class ModelConfig:
                     builder_opt=None,
                     pre_norm=False,
                     do_layer_norm_before=False,
+                    use_custom_all_reduce=False,
                 )),
     "opt_2.7b":
     ModelConfig(name="opt_2.7b",
@@ -250,6 +251,7 @@ class ModelConfig:
                     builder_opt=None,
                     pre_norm=False,
                     do_layer_norm_before=True,
+                    use_custom_all_reduce=False,
                 )),
     "opt_6.7b":
     ModelConfig(name="opt_6.7b",
@@ -268,6 +270,7 @@ class ModelConfig:
                     builder_opt=None,
                     pre_norm=False,
                     do_layer_norm_before=True,
+                    use_custom_all_reduce=False,
                 )),
     "opt_66b":
     ModelConfig(name="opt_66b",
@@ -286,6 +289,7 @@ class ModelConfig:
                     builder_opt=None,
                     pre_norm=True,
                     do_layer_norm_before=True,
+                    use_custom_all_reduce=False,
                 )),
     "llama_7b":
     ModelConfig(name="llama_7b",
@@ -512,6 +516,7 @@ class ModelConfig:
                     max_output_len=200,
                     builder_opt=None,
                     remove_input_padding=False,
+                    use_custom_all_reduce=False,
                 )),
     "bloom_560m":
     ModelConfig(name="bloom_560m",
@@ -528,6 +533,7 @@ class ModelConfig:
                     max_input_len=1024,
                     max_output_len=1024,
                     builder_opt=None,
+                    use_custom_all_reduce=False,
                 )),
     "bloom_176b":
     ModelConfig(name="bloom_176b",
@@ -544,6 +550,7 @@ class ModelConfig:
                     max_input_len=1024,
                     max_output_len=1024,
                     builder_opt=None,
+                    use_custom_all_reduce=False,
                 )),
     "bert_base":
     ModelConfig(name="bert_base",

diff --git a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a
diff --git a/...orrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a b/...orrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt
@@ -1,3 +1,3 @@
-516ff2db1e17536e92150b0c05200589  libtensorrt_llm_batch_manager_static.a
-428a500536705184a1aad8aaf5c9c0ca  libtensorrt_llm_batch_manager_static.pre_cxx11.a
-33b6139e3bb108df093aab3a6de38a87f1f1e2dd commit
+ffe001b0bf9ee66b3e3696423d6d09a2  libtensorrt_llm_batch_manager_static.a
+3657ea3400959a64be77c12d8598dd72  libtensorrt_llm_batch_manager_static.pre_cxx11.a
+9a775b3dbb20444f130f13f90e675cc971fe7e15 commit
diff --git a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a
diff --git a/...sorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a b/...sorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt
@@ -1,2 +1,2 @@
-0403e89a23fd77aed43cac0ecd8136cf  libtensorrt_llm_batch_manager_static.a
-9fa2a1c18860eaf226a6ce61a8e3ed5d  libtensorrt_llm_batch_manager_static.pre_cxx11.a
+bb69bf376c5f955c327e867049639d78  libtensorrt_llm_batch_manager_static.a
+14b107676c74ce17bfc8ce950b36a984  libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp b/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
@@ -121,7 +121,8 @@ class FusedMHARunnerV2::mhaImpl
         if (mLaunchParams.useKernelWithoutAlibi)
         {
             // The kernel adopts the log2f optimziation.
-            set_alpha(params.scale_bmm1, scale_bmm1 * float(M_LOG2E), DATA_TYPE_FP32);
+            constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
+            set_alpha(params.scale_bmm1, scale_bmm1 * float(kLog2e), DATA_TYPE_FP32);
         }
         else
         {