Release v2.8.0 · PaddlePaddle/PaddleNLP

很高兴地通知大家，飞桨大模型套件发布v2.8.0版本。这个版本中，我们深度优化套件的大模型精调对齐的能力，提升大模型套件在国产计算硬件训推能力，具体工作如下：

特色精调和高效对齐：提供自研极致收敛的RsLoRA+算法，大幅提升PEFT训练收敛速度以及训练效果；引入高性能生成加速到RLHF PPO算法，打破 PPO 训练中生成速度瓶颈，PPO训练性能大幅领先。
大模型训练提速：通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式，大模型训练更快、更稳定。

大模型精调对齐训推优化

精调
- PEFT
  - 新增QLoRA pipeline parallel支持 #7801
  - 自定义python算子，优化LoRA的前反向计算 #8106
  - 新增 rslora，lora+，pissa 算法 #8111
- 长序列
  - 新增长序列方案和模型解耦。RotaryEmbedding，LinearScalingRotaryEmbedding，NTKScalingRotaryEmbedding，DynamicNTKScalingRotaryEmbedding等。#8076
- Alignment
  - 新增PPO 对齐算法 #7305
- 训练策略
  - 新增LLaMA sequence parallel #7746
  - 新增LLaMa master_grad #7658
  - GPT新增auto_parallel的支持。 #8160
- 新增算子
  - 新增GQA 算子支持 #7906
  - 新增gqa fuse attention qkv #7890
  - 新增SwiGLU 算子 #8038
推理
- 新增QWenVL 的静态图推理 #7808
  模型新增
新增Deberta，Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct

基础框架升级

Trainer升级
- Trainer新增 ignore_save_lr_and_optim 参数，可以忽略保存lr scheduler以及optimizer权重 #7978
- Trainer新增 Wandb 和 Tensorboard 支持。#7863
- Trainer支持同时解析命令行与json文件参数 #7768
- trainer新增gradient_sync_after_accumulate支持。#8045
- dataloader新增cuda编译检查 #8099
AutoParallel升级
- llama 自动并行支持bf16损失 #7874
- 增加refined-recompute机制#7349
- 在AMP-O2策略下支持master_grad#7658
- 进一步完善动静统一自动并行分布式训练基本功能#7985 #8114
- 新增Llama2模型基于AutoTrainer的半自动训练 #7851 #7885
- 新增llama的hybrid_parallel_topo_order策略。#8011
- llama模型组网动静统一 #8127
其他
- 重构download下载逻辑，支持从bos、hf hub、aistudio、model scope下载模型 #7608 #8020 #8088
- 新增分布式训练的pipeline parallel #8051
- 适配npu的FA #8171 #8210
- llama新增block_attention/cachekv quant #7649

其他支持

新增俄罗斯套娃（matryoshka representation learning）检索策略，节省计算和存储资源。#8165

问题修复

日志级别修改，并增加timelog计时日志，兼容不同设备。#8261
修复pipeline并行中随机初始化的shared weights不一致的问题，覆盖GPT/OPT等模型。#7772
关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
修复GPT模型下载key error问题。#8253
修复LlamaRotaryEmbedding #7882
修复allreduce dtype的问题 #7876
修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
修复Wandb单测问题 #8066 #8056
修复Trainer同时解析json与命令行列表参数报错问题#7860
修复Gradio UI 中的推理问题 #7740 #7788
修复 Tokenizer 相关的基础问题 #7797 7870
修复 custom devices上loading rng state的问题。#7894
修复自动并行打印BF16的loss编码错乱的问题#7874
采用float初始化模型，修复静态图自动并行AMP报错问题#8033#8199
修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
修复llama在custom devices的精度问题。#7895
修复NPU AICPU算子问题 #7976
修复FusedLinearWithGradAdd少传参数的问题。#8178

What's Changed

[Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
[AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
Add codecov check by @zjjlivein in #7760
[CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
[DOC] Update trainer.md by @ZHUI in #7761
[Release] Change version to 2.7.0 by @ZHUI in #7764
[benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
[Release] Update release.yml to release tags by @ZHUI in #7765
[AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
[New Features] support dynamic src_length by @wj-Mcat in #7740
Fix unified_checkpoint bug by @DrownFish19 in #7770
[DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
[Trainer] Fix dist dataloader eval by @DesmonDay in #7777
[Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
[PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
[Paddle-Pipelines] update faiss by @qingzhong1 in #7793
Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
[tests] download slow by @JunnYu in #7798
[INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
Add CE for Distributed Hybrid Parallel by @iosmers in #7782
add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
[Pretrain] Fix eval during pretrain by @DesmonDay in #7806
pipeline parallel benchmark by @zhangting2020 in #7759
[Bug fixes] fix br gradio by @wj-Mcat in #7788
delete useless code for write_cache_kv.cu by @yuanlehome in #7812
[llm]support qlora pp by @lugimzzz in #7801
Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
[LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
[Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
[llm]fix lora by @lugimzzz in #7824
fused rms spmd by @liuzhenhai93 in #7830
[Pretrain] Fix eval during pretrain by @DesmonDay in #7827
[neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
[neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
[Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
[semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
[faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
[text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
[LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
Support 5.2 bloom by @zhoutianzi666 in #7846
[unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
[unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
[New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
[Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
[CI] add ci approval pipelines by @zjjlivein in #7859
[fix] fix a bug of trainer/argparser.py by @greycooker in #7860
[Improvement] fix ops improting in utils by @wj-Mcat in #7865
[Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
[Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
Add PPO training. by @guoshengCS in #7305
Update reward_main.py by @wawltor in #7880
Update ppo_main.py by @wawltor in #7881
[LLM] revert benchmark codes by @RichardWooSJTU in #7871
[LLM]support QWenVL second part by @DanGuge in #7808
[Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
[Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
[Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
[CE] Add Qwen into CE process by @ziangqin-baidu in #7887
[Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
[CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
[LLM] fix llama precision on custom devices by @SylarTiaNII in #7895
[AutoConfig]add benchmark scripts by @Liujie0926 in #7897
[RELEASE] Update README.md by @ZHUI in #7834
add qwen benchmark by @wtmlon in #7758
[Trainer] Refactor by @ZHUI in #7909
[CE]add gpt sharding_v2 case by @Liujie0926 in #7914
[Improvement] fix logger level by @KB-Ding in #7903
RuntimeTimer for the toolkit by @KB-Ding in #7913
[New Features] Trainer add Wandb and Tensorboard by @greycooker in #7863
[Bug Fix] Fix timer device by @KB-Ding in #7939
[Auto Parallel] Support semi-auto trainer and fit Llama2 training by @haohongxiang in #7885
gqa fuse attention qkv by @FeixLiu in #7890
rename files and add readme for llama auto_parallel by @zhiqiu in #7944
[Trainer] Skip some trainer test. by @ZHUI in #7949
[Unified checkpoint] Turn off unified checkpoint when using sharding stage3 by @DesmonDay in #7969
[Text Matching] Update text matching by @w5688414 in #7973
修复NPU AICPU算子问题 by @NINGBENZHE in #7976
[Unified Checkpoint] Fix multi-node output share-folder by @DesmonDay in #7977
Add SwiGLU operator by @sneaxiy in #7967
[model_zoo/gpt-3] Fix bugs from PR-61236 which cleared paddle.jit.dy2static.utils_helper by @haohongxiang in #7989
【AutoParallel】Add semi autoparallel amp by @heavyrain-lzy in #7985
[Trainer] ignore_save_lr_and_optim by @JunnYu in #7978
[Gradio] fix llm gradio multi-turn dialogue bug by @JunnYu in #7992
support GQA by @zhangting2020 in #7906
[AutoConfig]add N1C8_resume by @Difers in #7950
[AutoConfig]add N2C16 by @Liujie0926 in #7915
[Unified Checkpoint] Add document by @DesmonDay in #7961
Add SearchApi integration by @SebastjanPrachovskij in #7936
add autotuner buffer check ce case by @Difers in #7993
[Unified Checkpoint] Support peft model by @DesmonDay in #7691
[DATA] Remove repeated chars during preprocessing by @DrownFish19 in #7739
【AutoParalle】construct model using float32 in "amp-o2" by @heavyrain-lzy in #8033
support the loss mask for the pretrain by @wawltor in #8034
[Mixtral] Add mixtral moe by @DesmonDay in #7803
[CI] fix test ptuning by @zjjlivein in #8040
Add SwiGLU for auto Llama by @From00 in #8038
Fix _cache_founf_inf by @co63oc in #7997
【AutoParallelism】fix dataloader bug and add ci for static by @heavyrain-lzy in #8014
fix the index_dataset with old data format by @wawltor in #8049
Fit sharding optimization for auto parallel llama by @From00 in #8021
Optimize the log and enable to print the number of tokens each second. by @Xreki in #7853
【fix】 fix TestWandbCallback by @greycooker in #8056
Fit pir flag in predictor by @cyber-pioneer in #8048
update pp by @lugimzzz in #8059
Revert "Fit pir flag in predictor" by @zjjlivein in #8065
[CI]fix ci scripts for distribute by @Liujie0926 in #8063
unify_criterion_inputs_dynamic_and_static by @liuzhenhai93 in #8053
[Unified Checkpoint] Fix lora unittest by @DesmonDay in #8070
fit cinn and pir flag in predictor by @cyber-pioneer in #8071
Support hybrid_parallel_topo_order for auto parallel Llama by @From00 in #8011
Download重构 by @LOVE-YOURSELF-1 in #8020
[Distributed] Add dp_gradient_sync_after_accumulate by @AndSonder in #8045
[Distributed]Add distributed config for pipeline parallel by @ForFishes in #8051
[UC] Ignore optimizer when UC by @gongel in #8058
【fix】fix TestTensorboardCallback by @greycooker in #8066
[BugFix]Rm overlap limit in dp & pp by @ForFishes in #8089
dist dataloader: add cuda compilation check by @PeiyuLau in #8099
Download----fix new bug by @LOVE-YOURSELF-1 in #8088
[Bug fixes] convert min_new_token -> min_new_tokens by @wj-Mcat in #7883
[CI]update llm_gpt loss_base for Paddle#62500 by @Liujie0926 in #8107
[dist benchmark]add llama2 with autotuner by @Liujie0926 in #8108
[Trainer] Change num_train_epochs default value by @DesmonDay in #8113
[BugFix] shutil.rmtree ignore_errors for shared disks between train nodes. by @ZHUI in #8117
qwen init bug fix by @wtmlon in #8120
【AutoParallel】Add strategy with more options by @heavyrain-lzy in #8114
[AutoParallel] unify llama model by @deepllz in #8127
[benchmark]add skip_memory_metrics for ce_gpt by @Liujie0926 in #8132
[Distributed]Fix comm_overlap config bug by @ForFishes in #8128
Commented out autonlp test by @lugimzzz in #8110
add rslora & lora+ by @wtmlon in #8111
adapter new type promotion rule for Paddle 2.6 by @zxcd in #8079
[benchmark]add auto_pir case by @Liujie0926 in #8144
[Unified Checkpoint] Fix tie_weights save and load by @DesmonDay in #8137
[BugFix] fix test_sample_generate bug by @ZHUI in #8157
support mc2 for mp lora. by @wuhuachaocoding in #8161
Replace Sequence Parallel to Paddle Sequence Parallel by @iosmers in #7966
Trainer json args-parser supports raise error by @gongel in #8163
[Paddle-pipelines] Add pytorch retrieval model tutorials by @w5688414 in #8159
[sharding] Add arg of disabling sharding reduce_avg for accuracy verification by @haohongxiang in #8168
[LoRA] add quick_lora by @JunnYu in #8106
fix read-data timer when ignore_data_skip=False and skip_profile_timer=False by @GuoxiaWang in #8177
Fix FusedLinearWithGradAdd bug by @MarioLulab in #8178
adapt to npu FA. by @wuhuachaocoding in #8171
add long sequence strategies by @WAI-clear in #8076
[Trainer] Saving rng state not seed. by @ZHUI in #8185
【AutoParallel】Change llama in auto-parallel by @heavyrain-lzy in #8151
[CI] 关闭从hf下载和aistudio下载的单测 by @JunnYu in #8198
【AutoParallel】Change the dtype of initializing the model by @heavyrain-lzy in #8199
[Paddle-Pipelines] Add matryoshka representation learning by @w5688414 in #8165
update for npu. by @wuhuachaocoding in #8210
[Paddle-pipelines] remove ._static_mode for static model by @w5688414 in #8214
Support sharding for auto_trainer by @zhangbo9674 in #8164
[Cherry-pick] [Distributed] Support pp non batch comm (#8097) by @SylarTiaNII in #8222
add finetune fused & add mc2 by @NINGBENZHE in #8139
Add checkpoint_done by @gongel in #8223
Support GQA for auto parallel by @zhangbo9674 in #8234
bug fix for pure sharding with [fp16 + main_grad] by @FeixLiu in #8238
[BugFix][NPU] fix llama fa bug by @tianhaodongbd in #8237
[AutoParallel] support GPT for auto_parallel by @liym27 in #8160
[Cherry-pick] [LLM] add decay steps option for finetuning by @SylarTiaNII in #8251
Pissa by @wtmlon in #8250
Optimize llm/GPT3 performance by @MarioLulab in #8172
[BUG] fix to_static by @JunnYu in #8194
Add DeBERTa model by @w5688414 in #8227
[GPT bugs]Fix gpt download bug by @w5688414 in #8253
Fix timer for NPU&XPU by @KB-Ding in #8261
[lora]cherry-pick add scaling by @lugimzzz in #8264
Upgrade paddlenlp to 2.8.0 by @w5688414 in #8266
[BugFix] Try except sequence parallel utils (#8189) by @DesmonDay in #8274
Support Llama3 by @ZHUI in #8315
bug fixer (#8314) by @FeixLiu in #8318

New Contributors

@DanGuge made their first contribution in #7644
@greycooker made their first contribution in #7768
@ZeyuTeng96 made their first contribution in #7832
@shenghwa made their first contribution in #7836
@ziangqin-baidu made their first contribution in #7823
@MarioLulab made their first contribution in #7882
@ErnestinaQiu made their first contribution in #7660
@Difers made their first contribution in #7950
@SebastjanPrachovskij made their first contribution in #7936
@LOVE-YOURSELF-1 made their first contribution in #8020
@PeiyuLau made their first contribution in #8099
@deepllz made their first contribution in #8127
@liym27 made their first contribution in #8160

Full Changelog: v2.7.2...v2.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.8.0

What's Changed

New Contributors

Contributors