Release v3.0.0-beta2 · PaddlePaddle/PaddleNLP

本次更新强化了PaddleNLP的基础设施，新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能，同时重命名了数据索引工具。

此外，还修复了MoE模型参数保存与加载等问题，提升了文本处理准确性，并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化，包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。

核心变更与增强功能

基础设施强化：
- 新增Qwen2.5模型（#9157 ），Mixtral 8*22B。进一步丰富模型库。
- Tokenizer功能升级，现支持加载额外解码标记added_tokens_decoder（#8997 ），提升灵活性。
- 数据索引工具tool_helpers重命名为fast_dataindex（#9134 ），以更直观反映其功能特性。
- 实现训练过程中数据间隔跳过的功能（#8989 ），优化数据处理效率。
- Unified Checkpoint优化：
  - 更新优化器异步保存信号（#8975 ），保证保存稳定。
  - 修复统一检查点中的多项问题（#9082 ），确保功能正确性。
问题修复：
- 解决了MoE模型参数保存与加载的问题（#9045 ）。
- 修正Tokenizer中空格与特殊符号处理的不足（#9010 , #9144 ），提升文本处理准确性。
文档与测试更新：
- 更新多个文档，涵盖LLM模型文档（如#8990 , #8999 ）及量化文档（#9057 ）等，确保信息的时效性与准确性。
- 新增测试用例，如针对PIR模式序列并行的测试（#9015 ），强化测试覆盖度。
- 修复文档中的链接错误（如#9127 ），提升用户体验。
其他关键变更：
- 推理性能优化：
  - LLM推理代码得到优化，支持更多模型与参数配置（如#8986 , #8995 ），拓宽应用场景。
  - 实现Qwen2_Moe多GPU推理（#9121 ）及wint4量化（#9129 ），提升推理效率。
  - 加强LLM推理对FP8与INT8的支持（如#9032 , #9151 ），满足多样化精度需求。
- 硬件支持拓展：
  - 增强对DCU、XPU、MLU等国产硬件的支持（如#8983 , #8504 , #9075 ），促进国产化替代。
  - 优化上述硬件上的模型训练与推理性能，提升整体运算效率。
- 自动并行优化：
  - 修复训练过程中数据重复跳过的问题（#8980 ），确保数据处理的正确性。
  - 更新自动并行配置与检查点转换器（如#8847 , #9136 ），提升并行训练的灵活性与稳定性。
  - 新增损失NaN/Inf检查器（#8943 ），及时发现并处理潜在数值问题。
  - 优化分布式训练中的数据加载与梯度合并流程（如#9120 , #9179 ），提升训练速度与稳定性。

What's Changed

[Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
更正run_dpo.py文件路径 by @Mangodadada in #8952
fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
[Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
fix pip error in legacy benchmarks by @fightfat in #8978
【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
[llm]update finetune.md by @lugimzzz in #8990
tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
add DCU inference docs by @YanhuiDua in #8983
[Distributed]Add loss nan/inf checker by @ForFishes in #8943
【llm】update docs by @lugimzzz in #8999
[Feature] Fused Mixtral support by @penPenf28 in #8901
[XPU] Add README.md for llama2-7b by @xiguapipi in #8979
Add gcu llama readme by @EnflameGCU in #8950
fix qwen model use_casual_mask by @deepllz in #9009
[ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
[LLM Inference] Fix step.cu bug by @yuanlehome in #8995
Refine checkpoint converter by @zhangbo9674 in #9001
[Feature] fused mixtral wint4 by @penPenf28 in #9013
llm inference docs by @Sunny-bot1 in #8976
[LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
fix llama3 static run by @yuanlehome in #8849
[paddle inference cpu]update cpu inference by @bukejiyu in #8984
fix the tipc ce case by @wawltor in #8748
[Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
[Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
[Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
Fix checker of nan/inf by @ForFishes in #9029
[Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
[Unified Checkpoint] Update async save info by @DesmonDay in #8982
[llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
[Bugfix] fix bias optional by @penPenf28 in #9037
fix setup.py for llm inference by @yuanlehome in #9041
[Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
[Inference] update fakequant support by @lixcli in #9047
add test for pir sequence parallel on llama model by @liym27 in #9015
Fix moe save load by @Meiyim in #9045
Update quantization.md by @ZHUI in #9057
【Fix】Initialize dp degree in single GPU by @greycooker in #9056
fix bos download by @westfish in #9023
[Inference] Update fakequant script by @lixcli in #9054
[AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
[MLU] Support rms_norm_mlu by @PeiyuLau in #8504
[Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
[Inference] Qwen2 support fp8 inference by @ckl117 in #8954
[Version] update version info by @DrownFish19 in #9060
[NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
[MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
[Inference] Fix weight_only_int4 bug by @lixcli in #9073
[Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
fix llm predict install error by @fightfat in #9088
[PIR] add pir grad merge test by @AndSonder in #9074
Update readme by @EnflameGCU in #9046
[LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
[data] update tool_helpers version and add unittest by @JunnYu in #9093
fix baseline because of PR#8769 by @fightfat in #9092
fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
[CI] Fix paddlepaddle install by @DesmonDay in #9102
[LLM] fix train on npu by @SylarTiaNII in #9101
Disable ut by @zhangbo9674 in #9108
[AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
[Inference] Remove ceval from run_finetune by @lixcli in #9100
[Bugfix] fix multi-gpu infer by @penPenf28 in #9107
【Inference】fix step kernel by @gzy19990617 in #9122
[DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
[Inference] FP8 gemm auto-tune by @ckl117 in #9094
Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
[LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
[Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
[Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
Add fast_ln spmd rules by @From00 in #9125
fix pir dtype by @wanghuancoder in #9130
Remove ring_flash_attention warning by @DrownFish19 in #9119
[DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
Add hardware flops for pretraining by @ZHUI in #9069
[Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
[Auto Parallel] Fix ckpt_converter for auto_parallel by @zhangyuqin1998 in #9136
[Inference] Update fakequant by @lixcli in #9140
[DOC] Update docs by @DrownFish19 in #9141
[LLM Inference] Qwen2_Moe Support wint4 by @CJ77Qi in #9129
add multy devices supported models by @a31413510 in #9079
[fix] freeze 参数冗余存储兼容shard-reshard (#9067) by @bo-ke in #9148
[Docs] Update LLM docs by @DrownFish19 in #9143
fix llm ce predict run error by @fightfat in #9149
[Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens by @lvdongyi in #9144
[Tokenizer] Fix decode output with space in decode_token by @DrownFish19 in #9010
【Inference】Optimize top_p kernel performance by @gzy19990617 in #9132
[Models] Add Qwen2.5 by @DrownFish19 in #9157
Update README.md by @ZHUI in #9160
[Inference] FP8 dual gemm auto-tune and support compile parallelization by @ckl117 in #9151
[AutoParallel] enable ci for dp amp clip by @JZ-LIANG in #9062
[llm]support dpo pp by @lugimzzz in #9039
[Tools] Rename tool_helpers to fast_dataindex. by @ZHUI in #9134
[Trainer] Support skip data intervals by @greycooker in #8989
remove run_pretrain_auto_static.py CI when open PIR by @fightfat in #9177
[Tokenizer] Enable padding_side as call time kwargs by @lvdongyi in #9161
Revert "[Tokenizer] Enable padding_side as call time kwargs" by @ZHUI in #9192
[XPU] add xpu support for llama sft by @tizhou86 in #9152
[AutoParallel] Add FLAGS_enable_fused_ffn_qkv_pass for llama by @zhangbo9674 in #9182
[AutoParallel] Fix ckpt convert bug for sharding v2 by @zhangbo9674 in #9179
[Test] Disable dynamic to static test case for paddle PIR by @DrownFish19 in #9196
Fix ppt eval hang by @gongel in #9218
Update branch version to 3.0.0b2 by @gongel in #9220
Update branch version to 3.0.0b2 by @gongel in #9221
Revert "Fix ppt eval hang" by @ZHUI in #9229

New Contributors

@Mangodadada made their first contribution in #8952
@xingmingyyj made their first contribution in #8847
@penPenf28 made their first contribution in #8901
@xiguapipi made their first contribution in #8979
@Sunny-bot1 made their first contribution in #8976
@CJ77Qi made their first contribution in #8892
@lixcli made their first contribution in #9032
@gzy19990617 made their first contribution in #8909
@SevenSamon made their first contribution in #9014
@chang-wenbin made their first contribution in #9016
@DrRyanHuang made their first contribution in #9127
@a31413510 made their first contribution in #9079
@lvdongyi made their first contribution in #9144
@tizhou86 made their first contribution in #9152

Full Changelog: v3.0.0-beta1...v3.0.0-beta2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0-beta2

核心变更与增强功能

What's Changed

New Contributors

Contributors