v2.8.0
很高兴地通知大家,飞桨大模型套件发布v2.8.0版本。这个版本中,我们深度优化套件的大模型精调对齐的能力,提升大模型套件在国产计算硬件训推能力,具体工作如下:
- 特色精调和高效对齐:提供自研极致收敛的RsLoRA+算法,大幅提升PEFT训练收敛速度以及训练效果;引入高性能生成加速到RLHF PPO算法,打破 PPO 训练中生成速度瓶颈,PPO训练性能大幅领先。
- 大模型训练提速:通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式,大模型训练更快、更稳定。
大模型精调对齐训推优化
- 精调
- 推理
- 新增QWenVL 的静态图推理 #7808
模型新增
- 新增QWenVL 的静态图推理 #7808
- 新增Deberta,Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
- 新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
- 新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct
基础框架升级
- Trainer升级
- AutoParallel升级
- 其他
其他支持
- 新增俄罗斯套娃(matryoshka representation learning)检索策略,节省计算和存储资源。#8165
问题修复
- 日志级别修改,并增加timelog计时日志,兼容不同设备。#8261
- 修复pipeline并行中随机初始化的shared weights不一致的问题,覆盖GPT/OPT等模型。#7772
- 关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
- 修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
- 修复GPT模型下载key error问题。#8253
- 修复LlamaRotaryEmbedding #7882
- 修复allreduce dtype的问题 #7876
- 修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
- 修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
- 修复Wandb单测问题 #8066 #8056
- 修复Trainer同时解析json与命令行列表参数报错问题#7860
- 修复Gradio UI 中的推理问题 #7740 #7788
- 修复 Tokenizer 相关的基础问题 #7797 7870
- 修复 custom devices上loading rng state的问题。#7894
- 修复自动并行打印BF16的loss编码错乱的问题#7874
- 采用float初始化模型,修复静态图自动并行AMP报错问题#8033#8199
- 修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
- 修复llama在custom devices的精度问题。#7895
- 修复NPU AICPU算子问题 #7976
- 修复FusedLinearWithGradAdd少传参数的问题。#8178
What's Changed
- [Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
- [AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
- Add codecov check by @zjjlivein in #7760
- [CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
- [DOC] Update trainer.md by @ZHUI in #7761
- [Release] Change version to 2.7.0 by @ZHUI in #7764
- [benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
- [Release] Update release.yml to release tags by @ZHUI in #7765
- [AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
- [New Features] support dynamic src_length by @wj-Mcat in #7740
- Fix unified_checkpoint bug by @DrownFish19 in #7770
- [DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
- [Trainer] Fix dist dataloader eval by @DesmonDay in #7777
- [Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
- [PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
- [Paddle-Pipelines] update faiss by @qingzhong1 in #7793
- Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
- [tests] download slow by @JunnYu in #7798
- [INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
- Add CE for Distributed Hybrid Parallel by @iosmers in #7782
- add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7806
- pipeline parallel benchmark by @zhangting2020 in #7759
- [Bug fixes] fix br gradio by @wj-Mcat in #7788
- delete useless code for write_cache_kv.cu by @yuanlehome in #7812
- [llm]support qlora pp by @lugimzzz in #7801
- Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
- [LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
- [Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
- [llm]fix lora by @lugimzzz in #7824
- fused rms spmd by @liuzhenhai93 in #7830
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7827
- [neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
- [neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
- [Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
- [semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
- [faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
- [text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
- [LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
- Support 5.2 bloom by @zhoutianzi666 in #7846
- [unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
- [unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
- [New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
- [Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
- [CI] add ci approval pipelines by @zjjlivein in #7859
- [fix] fix a bug of trainer/argparser.py by @greycooker in #7860
- [Improvement] fix ops improting in utils by @wj-Mcat in #7865
- [Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
- [Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
- Add PPO training. by @guoshengCS in #7305
- Update reward_main.py by @wawltor in #7880
- Update ppo_main.py by @wawltor in #7881
- [LLM] revert benchmark codes by @RichardWooSJTU in #7871
- [LLM]support QWenVL second part by @DanGuge in #7808
- [Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
- 【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
- [Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
- 【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
- [Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
- [CE] Add Qwen into CE process by @ziangqin-baidu in #7887
- [Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
- [CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
- [LLM] fix llama precision on custom devices by @SylarTiaNII in #7895
- [AutoConfig]add benchmark scripts by @Liujie0926 in #7897
- [RELEASE] Update README.md by @ZHUI in #7834
- add qwen benchmark by @wtmlon in #7758
- [Trainer] Refactor by @ZHUI in #7909
- [CE]add gpt sharding_v2 case by @Liujie0926 in #7914
- [Improvement] fix logger level by @KB-Ding in #7903
- RuntimeTimer for the toolkit by @KB-Ding in #7913
- [New Features] Trainer add Wandb and Tensorboard by @greycooker in #7863
- [Bug Fix] Fix timer device by @KB-Ding in #7939
- [Auto Parallel] Support semi-auto trainer and fit Llama2 training by @haohongxiang in #7885
- gqa fuse attention qkv by @FeixLiu in #7890
- rename files and add readme for llama auto_parallel by @zhiqiu in #7944
- [Trainer] Skip some trainer test. by @ZHUI in #7949
- [Unified checkpoint] Turn off unified checkpoint when using sharding stage3 by @DesmonDay in #7969
- [Text Matching] Update text matching by @w5688414 in #7973
- 修复NPU AICPU算子问题 by @NINGBENZHE in #7976
- [Unified Checkpoint] Fix multi-node output share-folder by @DesmonDay in #7977
- Add SwiGLU operator by @sneaxiy in #7967
- [model_zoo/gpt-3] Fix bugs from PR-61236 which cleared
paddle.jit.dy2static.utils_helper
by @haohongxiang in #7989 - 【AutoParallel】Add semi autoparallel amp by @heavyrain-lzy in #7985
- [Trainer] ignore_save_lr_and_optim by @JunnYu in #7978
- [Gradio] fix llm gradio multi-turn dialogue bug by @JunnYu in #7992
- support GQA by @zhangting2020 in #7906
- [AutoConfig]add N1C8_resume by @Difers in #7950
- [AutoConfig]add N2C16 by @Liujie0926 in #7915
- [Unified Checkpoint] Add document by @DesmonDay in #7961
- Add SearchApi integration by @SebastjanPrachovskij in #7936
- add autotuner buffer check ce case by @Difers in #7993
- [Unified Checkpoint] Support peft model by @DesmonDay in #7691
- [DATA] Remove repeated chars during preprocessing by @DrownFish19 in #7739
- 【AutoParalle】construct model using float32 in "amp-o2" by @heavyrain-lzy in #8033
- support the loss mask for the pretrain by @wawltor in #8034
- [Mixtral] Add mixtral moe by @DesmonDay in #7803
- [CI] fix test ptuning by @zjjlivein in #8040
- Add SwiGLU for auto Llama by @From00 in #8038
- Fix _cache_founf_inf by @co63oc in #7997
- 【AutoParallelism】fix dataloader bug and add ci for static by @heavyrain-lzy in #8014
- fix the index_dataset with old data format by @wawltor in #8049
- Fit sharding optimization for auto parallel llama by @From00 in #8021
- Optimize the log and enable to print the number of tokens each second. by @Xreki in #7853
- 【fix】 fix TestWandbCallback by @greycooker in #8056
- Fit pir flag in predictor by @cyber-pioneer in #8048
- update pp by @lugimzzz in #8059
- Revert "Fit pir flag in predictor" by @zjjlivein in #8065
- [CI]fix ci scripts for distribute by @Liujie0926 in #8063
- unify_criterion_inputs_dynamic_and_static by @liuzhenhai93 in #8053
- [Unified Checkpoint] Fix lora unittest by @DesmonDay in #8070
- fit cinn and pir flag in predictor by @cyber-pioneer in #8071
- Support hybrid_parallel_topo_order for auto parallel Llama by @From00 in #8011
- Download重构 by @LOVE-YOURSELF-1 in #8020
- [Distributed] Add dp_gradient_sync_after_accumulate by @AndSonder in #8045
- [Distributed]Add distributed config for pipeline parallel by @ForFishes in #8051
- [UC] Ignore optimizer when UC by @gongel in #8058
- 【fix】fix TestTensorboardCallback by @greycooker in #8066
- [BugFix]Rm overlap limit in dp & pp by @ForFishes in #8089
- dist dataloader: add cuda compilation check by @PeiyuLau in #8099
- Download----fix new bug by @LOVE-YOURSELF-1 in #8088
- [Bug fixes] convert min_new_token -> min_new_tokens by @wj-Mcat in #7883
- [CI]update llm_gpt loss_base for Paddle#62500 by @Liujie0926 in #8107
- [dist benchmark]add llama2 with autotuner by @Liujie0926 in #8108
- [Trainer] Change num_train_epochs default value by @DesmonDay in #8113
- [BugFix] shutil.rmtree ignore_errors for shared disks between train nodes. by @ZHUI in #8117
- qwen init bug fix by @wtmlon in #8120
- 【AutoParallel】Add strategy with more options by @heavyrain-lzy in #8114
- [AutoParallel] unify llama model by @deepllz in #8127
- [benchmark]add skip_memory_metrics for ce_gpt by @Liujie0926 in #8132
- [Distributed]Fix comm_overlap config bug by @ForFishes in #8128
- Commented out autonlp test by @lugimzzz in #8110
- add rslora & lora+ by @wtmlon in #8111
- adapter new type promotion rule for Paddle 2.6 by @zxcd in #8079
- [benchmark]add auto_pir case by @Liujie0926 in #8144
- [Unified Checkpoint] Fix tie_weights save and load by @DesmonDay in #8137
- [BugFix] fix test_sample_generate bug by @ZHUI in #8157
- support mc2 for mp lora. by @wuhuachaocoding in #8161
- Replace Sequence Parallel to Paddle Sequence Parallel by @iosmers in #7966
- Trainer json args-parser supports raise error by @gongel in #8163
- [Paddle-pipelines] Add pytorch retrieval model tutorials by @w5688414 in #8159
- [sharding] Add arg of disabling sharding reduce_avg for accuracy verification by @haohongxiang in #8168
- [LoRA] add quick_lora by @JunnYu in #8106
- fix read-data timer when ignore_data_skip=False and skip_profile_timer=False by @GuoxiaWang in #8177
- Fix FusedLinearWithGradAdd bug by @MarioLulab in #8178
- adapt to npu FA. by @wuhuachaocoding in #8171
- add long sequence strategies by @WAI-clear in #8076
- [Trainer] Saving rng state not seed. by @ZHUI in #8185
- 【AutoParallel】Change llama in auto-parallel by @heavyrain-lzy in #8151
- [CI] 关闭从hf下载和aistudio下载的单测 by @JunnYu in #8198
- 【AutoParallel】Change the
dtype
of initializing the model by @heavyrain-lzy in #8199 - [Paddle-Pipelines] Add matryoshka representation learning by @w5688414 in #8165
- update for npu. by @wuhuachaocoding in #8210
- [Paddle-pipelines] remove ._static_mode for static model by @w5688414 in #8214
- Support sharding for auto_trainer by @zhangbo9674 in #8164
- [Cherry-pick] [Distributed] Support pp non batch comm (#8097) by @SylarTiaNII in #8222
- add finetune fused & add mc2 by @NINGBENZHE in #8139
- Add checkpoint_done by @gongel in #8223
- Support GQA for auto parallel by @zhangbo9674 in #8234
- bug fix for pure sharding with [fp16 + main_grad] by @FeixLiu in #8238
- [BugFix][NPU] fix llama fa bug by @tianhaodongbd in #8237
- [AutoParallel] support GPT for auto_parallel by @liym27 in #8160
- [Cherry-pick] [LLM] add decay steps option for finetuning by @SylarTiaNII in #8251
- Pissa by @wtmlon in #8250
- Optimize llm/GPT3 performance by @MarioLulab in #8172
- [BUG] fix to_static by @JunnYu in #8194
- Add DeBERTa model by @w5688414 in #8227
- [GPT bugs]Fix gpt download bug by @w5688414 in #8253
- Fix timer for NPU&XPU by @KB-Ding in #8261
- [lora]cherry-pick add scaling by @lugimzzz in #8264
- Upgrade paddlenlp to 2.8.0 by @w5688414 in #8266
- [BugFix] Try except sequence parallel utils (#8189) by @DesmonDay in #8274
- Support Llama3 by @ZHUI in #8315
- bug fixer (#8314) by @FeixLiu in #8318
New Contributors
- @DanGuge made their first contribution in #7644
- @greycooker made their first contribution in #7768
- @ZeyuTeng96 made their first contribution in #7832
- @shenghwa made their first contribution in #7836
- @ziangqin-baidu made their first contribution in #7823
- @MarioLulab made their first contribution in #7882
- @ErnestinaQiu made their first contribution in #7660
- @Difers made their first contribution in #7950
- @SebastjanPrachovskij made their first contribution in #7936
- @LOVE-YOURSELF-1 made their first contribution in #8020
- @PeiyuLau made their first contribution in #8099
- @deepllz made their first contribution in #8127
- @liym27 made their first contribution in #8160
Full Changelog: v2.7.2...v2.8.0