TensorRT-LLM 0.9.0 Release #1451
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
examples/multimodal
early_stopping=False
in beam search for C++ Runtimetransformers
Gemma implementation #1147GptSession
without OpenMPI Run GptSession without openmpi? #1220executor
APIexamples/bindings
executor
C++ API, seeexamples/bindings/README.md
executor
API, seedocs/source/executor.md
examples/high-level-api/README.md
for guidance)QuantConfig
used intrtllm-build
tool, support broader quantization featuresLLM()
API to accept engines built bytrtllm-build
commandSamplingConfig
used inLLM.generate
orLLM.generate_async
APIs, with the support of beam search, a variety of penalties, and more featuresLLM(streaming_llm=...)
examples/qwen/README.md
for the latest commandsexamples/gpt/README.md
for the latest commandstrtllm-build
command, to generalize the feature better to more modelstrtllm-build --max_prompt_embedding_table_size
instead.trtllm-build --world_size
flag to--auto_parallel
flag, the option is used for auto parallel planner only.AsyncLLMEngine
is removed,tensorrt_llm.GenerationExecutor
class is refactored to work with both explicitly launching withmpirun
in the application level, and accept an MPI communicator created bympi4py
examples/server
are removed, seeexamples/app
instead.model
parameter fromgptManagerBenchmark
andgptSessionBenchmark
encoder_input_len_range
is not 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992end_id
issue for Qwen qwen end_id setting is wrong so cannot stop at right postition! #987head_size
when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148SamplingConfig
tensors inModelRunnerCpp
ModelRunnerCpp
does not transferSamplingConfig
Tensor fields correctly #1183examples/run.py
only load one line from--input_file
ModelRunnerCpp
does not transferSamplingConfig
tensor fields correctlyModelRunnerCpp
does not transferSamplingConfig
Tensor fields correctly #1183gptManagerBenchmark
benchmarks/cpp/README.md
gptManagerBenchmark
gptDecoderBatch
to support batched samplingnvcr.io/nvidia/pytorch:24.02-py3
nvcr.io/nvidia/tritonserver:24.02-py3
Currently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.9.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions