v3.0.0-beta2
Pre-release
Pre-release
ZHUI
released this
08 Oct 08:52
·
42 commits
to release/3.0-beta2
since this release
本次更新强化了PaddleNLP的基础设施,新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能,同时重命名了数据索引工具。
此外,还修复了MoE模型参数保存与加载等问题,提升了文本处理准确性,并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化,包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。
核心变更与增强功能
-
基础设施强化:
-
问题修复:
-
文档与测试更新:
-
其他关键变更:
- 推理性能优化:
- 硬件支持拓展:
- 自动并行优化:
What's Changed
- [Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
- 更正run_dpo.py文件路径 by @Mangodadada in #8952
- fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
- [Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
- fix pip error in legacy benchmarks by @fightfat in #8978
- 【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
- [llm]update finetune.md by @lugimzzz in #8990
- tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
- add DCU inference docs by @YanhuiDua in #8983
- [Distributed]Add loss nan/inf checker by @ForFishes in #8943
- 【llm】update docs by @lugimzzz in #8999
- [Feature] Fused Mixtral support by @penPenf28 in #8901
- [XPU] Add README.md for llama2-7b by @xiguapipi in #8979
- Add gcu llama readme by @EnflameGCU in #8950
- fix qwen model use_casual_mask by @deepllz in #9009
- [ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
- [LLM Inference] Fix step.cu bug by @yuanlehome in #8995
- Refine checkpoint converter by @zhangbo9674 in #9001
- [Feature] fused mixtral wint4 by @penPenf28 in #9013
- llm inference docs by @Sunny-bot1 in #8976
- [LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
- fix llama3 static run by @yuanlehome in #8849
- [paddle inference cpu]update cpu inference by @bukejiyu in #8984
- fix the tipc ce case by @wawltor in #8748
- [Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
- [Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
- [Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
- Fix checker of nan/inf by @ForFishes in #9029
- [Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
- [Unified Checkpoint] Update async save info by @DesmonDay in #8982
- [llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
- [Bugfix] fix bias optional by @penPenf28 in #9037
- fix setup.py for llm inference by @yuanlehome in #9041
- [Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
- [Inference] update fakequant support by @lixcli in #9047
- add test for pir sequence parallel on llama model by @liym27 in #9015
- Fix moe save load by @Meiyim in #9045
- Update quantization.md by @ZHUI in #9057
- 【Fix】Initialize dp degree in single GPU by @greycooker in #9056
- fix bos download by @westfish in #9023
- [Inference] Update fakequant script by @lixcli in #9054
- [AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
- [MLU] Support rms_norm_mlu by @PeiyuLau in #8504
- [Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
- [Inference] Qwen2 support fp8 inference by @ckl117 in #8954
- [Version] update version info by @DrownFish19 in #9060
- [NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
- [MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
- Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
- [Inference] Fix weight_only_int4 bug by @lixcli in #9073
- [Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
- fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
- fix llm predict install error by @fightfat in #9088
- [PIR] add pir grad merge test by @AndSonder in #9074
- Update readme by @EnflameGCU in #9046
- [LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
- [data] update tool_helpers version and add unittest by @JunnYu in #9093
- fix baseline because of PR#8769 by @fightfat in #9092
- fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
- [CI] Fix paddlepaddle install by @DesmonDay in #9102
- [LLM] fix train on npu by @SylarTiaNII in #9101
- Disable ut by @zhangbo9674 in #9108
- [AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
- [Inference] Remove ceval from run_finetune by @lixcli in #9100
- [Bugfix] fix multi-gpu infer by @penPenf28 in #9107
- 【Inference】fix step kernel by @gzy19990617 in #9122
- [DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
- [Inference] FP8 gemm auto-tune by @ckl117 in #9094
- Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
- [LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
- [Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
- [Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
- Add fast_ln spmd rules by @From00 in #9125
- fix pir dtype by @wanghuancoder in #9130
- Remove ring_flash_attention warning by @DrownFish19 in #9119
- [DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
- Add hardware flops for pretraining by @ZHUI in #9069
- [Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
- [Auto Parallel] Fix ckpt_converter for auto_parallel by @zhangyuqin1998 in #9136
- [Inference] Update fakequant by @lixcli in #9140
- [DOC] Update docs by @DrownFish19 in #9141
- [LLM Inference] Qwen2_Moe Support wint4 by @CJ77Qi in #9129
- add multy devices supported models by @a31413510 in #9079
- [fix] freeze 参数冗余存储 兼容shard-reshard (#9067) by @bo-ke in #9148
- [Docs] Update LLM docs by @DrownFish19 in #9143
- fix llm ce predict run error by @fightfat in #9149
- [Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens by @lvdongyi in #9144
- [Tokenizer] Fix decode output with space in decode_token by @DrownFish19 in #9010
- 【Inference】Optimize top_p kernel performance by @gzy19990617 in #9132
- [Models] Add Qwen2.5 by @DrownFish19 in #9157
- Update README.md by @ZHUI in #9160
- [Inference] FP8 dual gemm auto-tune and support compile parallelization by @ckl117 in #9151
- [AutoParallel] enable ci for dp amp clip by @JZ-LIANG in #9062
- [llm]support dpo pp by @lugimzzz in #9039
- [Tools] Rename tool_helpers to fast_dataindex. by @ZHUI in #9134
- [Trainer] Support skip data intervals by @greycooker in #8989
- remove run_pretrain_auto_static.py CI when open PIR by @fightfat in #9177
- [Tokenizer] Enable padding_side as call time kwargs by @lvdongyi in #9161
- Revert "[Tokenizer] Enable padding_side as call time kwargs" by @ZHUI in #9192
- [XPU] add xpu support for llama sft by @tizhou86 in #9152
- [AutoParallel] Add FLAGS_enable_fused_ffn_qkv_pass for llama by @zhangbo9674 in #9182
- [AutoParallel] Fix ckpt convert bug for sharding v2 by @zhangbo9674 in #9179
- [Test] Disable dynamic to static test case for paddle PIR by @DrownFish19 in #9196
- Fix ppt eval hang by @gongel in #9218
- Update branch version to 3.0.0b2 by @gongel in #9220
- Update branch version to 3.0.0b2 by @gongel in #9221
- Revert "Fix ppt eval hang" by @ZHUI in #9229
New Contributors
- @Mangodadada made their first contribution in #8952
- @xingmingyyj made their first contribution in #8847
- @penPenf28 made their first contribution in #8901
- @xiguapipi made their first contribution in #8979
- @Sunny-bot1 made their first contribution in #8976
- @CJ77Qi made their first contribution in #8892
- @lixcli made their first contribution in #9032
- @gzy19990617 made their first contribution in #8909
- @SevenSamon made their first contribution in #9014
- @chang-wenbin made their first contribution in #9016
- @DrRyanHuang made their first contribution in #9127
- @a31413510 made their first contribution in #9079
- @lvdongyi made their first contribution in #9144
- @tizhou86 made their first contribution in #9152
Full Changelog: v3.0.0-beta1...v3.0.0-beta2