Update on the development branch #2111

kaiyux · 2024-08-13T14:38:04Z

kaiyux
Aug 13, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we have pushed an update to the development branch (and the Triton backend) this Aug 13rd, 2024.

This update includes:

Model Support
- Supported EXAONE model, see examples/exaone/README.md.
Features
- Supported GLM, Baichuan and Gemma models for the Python high level API.
- Supported LoRA for MoE models.
- Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported beam search for streaming mode.
- The ModelWeightsLoader is enabled for LLaMA family models (experimental)
  - There are known issues with W4A16, W8A16 and W4A16_GPTQ, pass the environment variable TRTLLM_DISABLE_UNIFIED_CONVERTER=1 to disable the model weights loader for those cases and fallback to the legacy path.
API
- The C++ batch manager API is deprecated in favor of the C++ executor API, and it will be removed in a future release of TensorRT-LLM.
Bug fixes
- Fix the engine build failure when deduced max_seq_len is not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
- Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
- The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.

Thanks,
The TensorRT-LLM Engineering Team