Skip to content

Commit

Permalink
Update TensorRT-LLM Release branch (#745)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <[email protected]>
  • Loading branch information
kaiyux and Shixiaowei02 authored Dec 26, 2023
1 parent a8018c1 commit 80bc075
Show file tree
Hide file tree
Showing 19 changed files with 450 additions and 169 deletions.
99 changes: 99 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Change Log

## Versions 0.6.0 / 0.6.1

* Models
* ChatGLM3
* InternLM (contributed by @wangruohui)
* Mistral 7B (developed in collaboration with Mistral.AI)
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
* Qwen (contributed by @Tlntin and @zhaohb)
* Replit Code V-1.5 3B (external contribution)
* T5, mT5, Flan-T5 (Python runtime only)

* Features
* Add runtime statistics related to active requests and KV cache
utilization from the batch manager (see
the [batch manager](docs/source/batch_manager.md) documentation)
* Add `sequence_length` tensor to support proper lengths in beam-search
(when beam-width > 1 - see
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* BF16 support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* Improvements to memory utilization (CPU and GPU - including memory
leaks)
* Improved error reporting and memory consumption
* Improved support for stop and bad words
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
[examples/baichuan](examples/baichuan/README.md))
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
* INT4 AWQ support for the Falcon models
(see [examples/falcon](examples/falcon/README.md))
* LoRA support (functional preview only - limited to the Python runtime,
only QKV support and not optimized in terms of runtime performance) for
the GPT model (see the
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
in the GPT example)
* Multi-GPU support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* New heuristic for launching the Multi-block Masked MHA kernel (similar
to FlashDecoding - see
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
* Prompt-Tuning support for GPT and LLaMA models (see the
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
* Performance optimizations in various CUDA kernels
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
* Support for different micro batch sizes for context and generation
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
`GptSession::Config::genMicroBatchSize` in
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
* Support for "remove input padding" for encoder-decoder models (see
[examples/enc_dec](examples/enc_dec/README.md))
* Support for context and generation logits (see `mComputeContextLogits` and
`mComputeGenerationLogits` in
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Update to CUTLASS 3.x

* Bug fixes
* Fix for ChatGLM2 #93 and #138
* Fix tensor names error "RuntimeError: Tensor names
(`host_max_kv_cache_length`) in engine are not the same as expected in
the main branch" #369
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
does not result in an equal division") #374
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
* Fix a crash in GenerationSession if stream keyword argument is not None
#202
* Fix a typo when calling PyNVML API [BUG] code bug #410
* Fix bugs related to the improper management of the `end_id` for various
models [C++ and Python]
* Fix memory leaks [C++ code and Python models]
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
gptManagerBenchmark std::bad_alloc error #66
* Fix a bug in pipeline parallelism when beam-width > 1
* Fix a bug with Llama GPTQ due to improper support of GQA
* Fix issue #88
* Fix an issue with the Huggingface Transformers version #16
* Fix link jump in windows readme.md #30 - by @yuanlehome
* Fix typo in batchScheduler.h #56 - by @eltociear
* Fix typo #58 - by @RichardScottOZ
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
* Fix the log message to be more accurate on KV cache #224
* Fix Windows release wheel installation: Failed to install the release
wheel for Windows using pip #261
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
in --cpp-only when torch's cxx_abi version is different with gcc #151
* Fix linking error during compiling google-test & benchmarks #277
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
the lack of bfloat16 #335
* Minor bug fixes

## Version 0.5.0

* TensorRT-LLM v0.5.0 is the first public release.
142 changes: 41 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ TensorRT-LLM
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
[![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
[![version](https://img.shields.io/badge/release-0.7.0-green)](./setup.py)
[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)

[Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
Expand Down Expand Up @@ -108,16 +108,16 @@ concepts used in TensorRT-LLM, we recommend you to read the following

## Installation

*For Windows installation, see [`Windows`](windows/README.md).*

TensorRT-LLM must be built from source, instructions can be found
The documentation for installing TensorRT-LLM can be found
[here](./docs/source/installation.md). An image of a Docker container with
TensorRT-LLM and its Triton Inference Server Backend will be made available
soon.

The remaining commands in that document must be executed from the TensorRT-LLM
container.

*For Windows installation, see [`Windows`](windows/README.md).*

## Quick Start

To create a TensorRT engine for an existing model, there are 3 steps:
Expand Down Expand Up @@ -379,103 +379,43 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`

### Change Log

#### Version 0.6.1

* Models
* ChatGLM3
* InternLM (contributed by @wangruohui)
* Mistral 7B (developed in collaboration with Mistral.AI)
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
* Qwen (contributed by @Tlntin and @zhaohb)
* Replit Code V-1.5 3B (external contribution)
* T5, mT5, Flan-T5 (Python runtime only)

* Features
* Add runtime statistics related to active requests and KV cache
utilization from the batch manager (see
the [batch manager](docs/source/batch_manager.md) documentation)
* Add `sequence_length` tensor to support proper lengths in beam-search
(when beam-width > 1 - see
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* BF16 support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* Improvements to memory utilization (CPU and GPU - including memory
leaks)
* Improved error reporting and memory consumption
* Improved support for stop and bad words
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
[examples/baichuan](examples/baichuan/README.md))
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
* INT4 AWQ support for the Falcon models
(see [examples/falcon](examples/falcon/README.md))
* LoRA support (functional preview only - limited to the Python runtime,
only QKV support and not optimized in terms of runtime performance) for
the GPT model (see the
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
in the GPT example)
* Multi-GPU support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* New heuristic for launching the Multi-block Masked MHA kernel (similar
to FlashDecoding - see
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
* Prompt-Tuning support for GPT and LLaMA models (see the
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
* Performance optimizations in various CUDA kernels
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
* Support for different micro batch sizes for context and generation
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
`GptSession::Config::genMicroBatchSize` in
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
* Support for "remove input padding" for encoder-decoder models (see
[examples/enc_dec](examples/enc_dec/README.md))
* Support for context and generation logits (see `mComputeContextLogits` and
`mComputeGenerationLogits` in
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Update to CUTLASS 3.x

* Bug fixes
* Fix for ChatGLM2 #93 and #138
* Fix tensor names error "RuntimeError: Tensor names
(`host_max_kv_cache_length`) in engine are not the same as expected in
the main branch" #369
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
does not result in an equal division") #374
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
* Fix a crash in GenerationSession if stream keyword argument is not None
#202
* Fix a typo when calling PyNVML API [BUG] code bug #410
* Fix bugs related to the improper management of the `end_id` for various
models [C++ and Python]
* Fix memory leaks [C++ code and Python models]
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
gptManagerBenchmark std::bad_alloc error #66
* Fix a bug in pipeline parallelism when beam-width > 1
* Fix a bug with Llama GPTQ due to improper support of GQA
* Fix issue #88
* Fix an issue with the Huggingface Transformers version #16
* Fix link jump in windows readme.md #30 - by @yuanlehome
* Fix typo in batchScheduler.h #56 - by @eltociear
* Fix typo #58 - by @RichardScottOZ
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
* Fix the log message to be more accurate on KV cache #224
* Fix Windows release wheel installation: Failed to install the release
wheel for Windows using pip #261
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
in --cpp-only when torch's cxx_abi version is different with gcc #151
* Fix linking error during compiling google-test & benchmarks #277
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
the lack of bfloat16 #335
* Minor bug fixes

#### Version 0.5.0

* TensorRT-LLM v0.5.0 is the first public release.
#### Versions 0.7.0 / 0.7.1

* Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
* Features
- [Preview] Speculative decoding
- Add Python binding for `GptManager`
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
* Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
* Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
* Documentation
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)

#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).

### Known Issues

Expand Down
7 changes: 7 additions & 0 deletions benchmarks/python/allowed_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ class ModelConfig:
builder_opt=None,
pre_norm=False,
do_layer_norm_before=False,
use_custom_all_reduce=False,
)),
"opt_2.7b":
ModelConfig(name="opt_2.7b",
Expand All @@ -250,6 +251,7 @@ class ModelConfig:
builder_opt=None,
pre_norm=False,
do_layer_norm_before=True,
use_custom_all_reduce=False,
)),
"opt_6.7b":
ModelConfig(name="opt_6.7b",
Expand All @@ -268,6 +270,7 @@ class ModelConfig:
builder_opt=None,
pre_norm=False,
do_layer_norm_before=True,
use_custom_all_reduce=False,
)),
"opt_66b":
ModelConfig(name="opt_66b",
Expand All @@ -286,6 +289,7 @@ class ModelConfig:
builder_opt=None,
pre_norm=True,
do_layer_norm_before=True,
use_custom_all_reduce=False,
)),
"llama_7b":
ModelConfig(name="llama_7b",
Expand Down Expand Up @@ -512,6 +516,7 @@ class ModelConfig:
max_output_len=200,
builder_opt=None,
remove_input_padding=False,
use_custom_all_reduce=False,
)),
"bloom_560m":
ModelConfig(name="bloom_560m",
Expand All @@ -528,6 +533,7 @@ class ModelConfig:
max_input_len=1024,
max_output_len=1024,
builder_opt=None,
use_custom_all_reduce=False,
)),
"bloom_176b":
ModelConfig(name="bloom_176b",
Expand All @@ -544,6 +550,7 @@ class ModelConfig:
max_input_len=1024,
max_output_len=1024,
builder_opt=None,
use_custom_all_reduce=False,
)),
"bert_base":
ModelConfig(name="bert_base",
Expand Down
Git LFS file not shown
Git LFS file not shown
6 changes: 3 additions & 3 deletions cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
516ff2db1e17536e92150b0c05200589 libtensorrt_llm_batch_manager_static.a
428a500536705184a1aad8aaf5c9c0ca libtensorrt_llm_batch_manager_static.pre_cxx11.a
33b6139e3bb108df093aab3a6de38a87f1f1e2dd commit
ffe001b0bf9ee66b3e3696423d6d09a2 libtensorrt_llm_batch_manager_static.a
3657ea3400959a64be77c12d8598dd72 libtensorrt_llm_batch_manager_static.pre_cxx11.a
9a775b3dbb20444f130f13f90e675cc971fe7e15 commit
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
0403e89a23fd77aed43cac0ecd8136cf libtensorrt_llm_batch_manager_static.a
9fa2a1c18860eaf226a6ce61a8e3ed5d libtensorrt_llm_batch_manager_static.pre_cxx11.a
bb69bf376c5f955c327e867049639d78 libtensorrt_llm_batch_manager_static.a
14b107676c74ce17bfc8ce950b36a984 libtensorrt_llm_batch_manager_static.pre_cxx11.a
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,8 @@ class FusedMHARunnerV2::mhaImpl
if (mLaunchParams.useKernelWithoutAlibi)
{
// The kernel adopts the log2f optimziation.
set_alpha(params.scale_bmm1, scale_bmm1 * float(M_LOG2E), DATA_TYPE_FP32);
constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
set_alpha(params.scale_bmm1, scale_bmm1 * float(kLog2e), DATA_TYPE_FP32);
}
else
{
Expand Down
Loading

0 comments on commit 80bc075

Please sign in to comment.