[Feature] Integrated Training and Inference -- Part 1 #532

pppppM · 2024-03-29T10:38:20Z

加载模型 & Chat 用例：xtuner/model/auto.py

训练 alpaca

# 把 alpaca 数据集转换为 openai 格式 json
python xtuner/tools/convert_dataset.py tatsu-lab/alpaca alpaca --save-dir converted_alpaca
xtuner train xtuner/configs/internlm/internlm2_chat_1_8b/example.py

* support sequence * add configs * add sp example to custom dataset * WIP * add dispatch utils * delete useless codes * move xtuner/engine/sequence_parallel to xtuner/parallel/sequence * fix lint * fix lint * add init_dist to xtuner and add trust_remote_code=True to AutoConfig * add internlm2 custom_dataset sp4 config * Sequence Parallel doc V1 * Sequence Parallel doc V1 * Sequence Parallel doc V1 * fix bugs in llama_varlen_attn_forward * rename indexes to position_ids * add attn_implementation to config * delete useless codes * fix lint * refine default_collate_fn * refine doc * refine doc * refine doc * delete replace_internlm2_rote * add repeat_kv_bshd * fix apply_rotary_pos_emb bug * add enable_sequence_parallel flag * refine doc * assert {'input_ids', 'labels'}.issubset(dataset.column_names) * refine doc

xtuner/model/base.py

xtuner/model/text/finetune.py

xtuner/dataset/hybrid/collate.py

hhaAndroid · 2024-04-07T10:41:24Z

xtuner/model/auto.py

+        attn_kwargs = cls._flash_attn_kwargs(config)
+        kwargs.update(attn_kwargs)
+
+        if torch.cuda.is_bf16_supported():


这样写的话，用户是不是没法通过配置或者输入参数修改模型类型？

hhaAndroid · 2024-04-07T10:43:32Z

xtuner/model/auto.py

+        return model
+
+    @staticmethod
+    def _flash_attn_kwargs(config):


如果用户自己新曾了一个 llm，这个字段应该如何修改？或者说用户如何知道要修改？

这个操作主要是为了保证 attn_mask shape 的正确性（flash_attn, sdpa 和普通 attn 的attn_mask可能不同）。
感觉之后可以把 _built_in_flash_attn_1 _built_in_flash_attn_2 放到一个别的什么地方，之后出一个文档讲一下新增模型需要考虑的东西

xtuner/model/text/finetune.py

xtuner/dataset/hybrid/dataset.py

hhaAndroid · 2024-04-07T11:22:16Z

xtuner/types/sample_params.py

+from pydantic import BaseModel
+
+
+class SampleParams(BaseModel):


这个对象可以在配置里面修改吗？同时要考虑在评测时候不同数据这个参数不一样。需要在评测时候实时传给 model

xtuner/model/auto.py

LZHgrla · 2024-04-07T11:38:58Z

xtuner/model/auto.py

+                        checkpoint: str,
+                        config: Optional[str] = None,
+                        from_hub: bool = False):
+        config = Config.fromfile(config)


这个地方，是不是得配合着下面的if-else分支针对 config 是否为 None 做个判断？

xtuner/model/auto.py

HIT-cwh · 2024-04-08T02:27:09Z

加载模型 & Chat 用例：xtuner/model/auto.py

训练 alpaca

# 把 alpaca 数据集转换为 openai 格式 json
python xtuner/tools/convert_dataset.py tatsu-lab/alpaca alpaca --save-dir converted_alpaca
xtuner train xtuner/configs/internlm/internlm2_chat_1_8b/example.py

如果我要一起训练Alpaca和Alpaca-zh，我是先分别convert之后再用ConcatDataset还是一起convert

xtuner/model/text/finetune.py

pppppM · 2024-04-08T03:53:18Z

xtuner/model/text/finetune.py

+                position_ids.append(torch.arange(chunk_tokens))
+            position_ids = torch.cat(position_ids, dim=0).unsqueeze(0)
+
+            from mmengine import MessageHub


代码位置

xtuner/dataset/hybrid/_pack.py

pppppM · 2024-04-08T04:13:57Z

xtuner/tools/convert_dataset.py

+def main():
+    args = parse_args()
+
+    dataset = load_dataset(path=args.path)


load 方式有很多

支持老用户输入 config，将数据转换成新的

LZHgrla · 2024-04-08T04:24:58Z

xtuner/model/auto.py

+            else:
+                raise RuntimeError
+
+            model: BaseAlgorithm = BUILDER.build(config.model)


这一步会自动下载未finetune的模型，应该得想办法避免

hhaAndroid · 2024-04-15T12:45:25Z

xtuner/utils/tokenizer.py

+        assert eos_token_ids is not None, \
+            'Please set eos_token for Qwen tokenizer!'
+    elif tokenizer.__class__.__name__ == 'ChatGLMTokenizer':
+        eos_token_ids = tokenizer.eos_token_id


这个 if 和下面的 else 有啥区别吗？

LZHgrla · 2024-04-16T02:11:42Z

xtuner/tools/convert_dataset.py

+
+        shard = converted.select(range(begin, end)).to_list()
+        with open(save_path, 'w') as f:
+            json.dump(shard, f)


Suggested change

json.dump(shard, f)

json.dump(shard, f, indent=2)

LZHgrla · 2024-04-16T02:31:39Z

xtuner/dataset/train/base.py

+                 chat_template: Union[Dict, ChatTemplate],
+                 sample_ratio: Union[float, List[float]] = 1.0,
+                 max_length: int = 2048,
+                 pack_to_max_length: bool = True,


增加 shuffle_before_pack 参数？

现在默认就是 shuffle before pack，会有场景需要 pack 前不 shuffle 么？

在 pretrain 场景，同一上下文的数据往往是相连的，会有人想要它们相邻。

LZHgrla · 2024-04-16T02:36:38Z

xtuner/dataset/train/base.py

+        if isinstance(sample_ratio, (list, tuple)):
+            if len(sample_ratio) != len(data_files):
+                raise ValueError('The length of `sample_ratio`'
+                                 f'({len(sample_ratio)}) should be the same '
+                                 'as the length of `data_files`'
+                                 f'({len(data_files)})')


若 data_files 为 None，使用 data_dir 来传数据的时候，这个地方会报错。考虑在此之前就把 data_dir 转换成 data_files？

LZHgrla · 2024-04-16T02:42:11Z

xtuner/dataset/train/base.py

+        return dataset
+
+    def filter_non_labels_data(self, dataset: List[dict]) -> List[dict]:
+        """Filter the data which all labels are ignore.


Suggested change

"""Filter the data which all labels are ignore.

"""Filter out data that do not contain valid labels.

LZHgrla · 2024-04-16T02:46:19Z

xtuner/dataset/train/base.py

+            f'Filtered {ori_samples - new_samples} samples '
+            '(all labels are ignore)',


Suggested change

f'Filtered {ori_samples - new_samples} samples '

'(all labels are ignore)',

f'Filtered {ori_samples - new_samples} samples '

'that do not contain valid labels.',

LZHgrla · 2024-04-16T02:48:02Z

xtuner/model/auto.py

+        if torch.cuda.is_bf16_supported():
+            kwargs.update(torch_dtype=torch.bfloat16)
+        else:
+            kwargs.update(torch_dtype=torch.float16)


如果不使用 DeepSpeed，直接使用普通的 amp optimizer，会报错。

# 设为bf16 RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16' # 设为fp16 ValueError: Attempting to unscale FP16 gradients.

LZHgrla · 2024-04-16T03:22:51Z

xtuner/engine/hooks/chat_hook.py

+            runner.logger.info(f'(ChatHook {position}){answer}')
+
+    def before_train(self, runner: Union[Runner, FlexibleRunner]):
+        runner.logger.info('before_train in EvaluateChatHook.')


Suggested change

runner.logger.info('before_train in EvaluateChatHook.')

runner.logger.info('before_train in ChatHook.')

LZHgrla · 2024-04-16T05:21:26Z

无法使用work_dirs保存的config进行训练，目前我是卡在了 TypeError: collate_fn should be a dict or callable object, but got xtuner.model.TextFinetune.dataloader_collate_fn
是否有必要修一下？（好像有点难修，因为这个逻辑是 dataloader 的，应该是封装在 mmengine 内了）

LZHgrla · 2024-04-16T05:31:32Z

xtuner/chat/backend/huggingface.py

+        super().__init__()
+
+        self.llm = llm
+        self.llm.cuda()


如果是 quant 模型，直接 cuda 会有问题？

HIT-cwh · 2024-04-16T06:02:01Z

xtuner/configs/internlm/internlm2_chat_1_8b/example.py

+#                      PART 2  Model & Tokenizer                      #
+#######################################################################
+model = dict(
+    type=TextFinetune,


传入 use_varlen_attn

HIT-cwh · 2024-04-16T06:20:29Z

pr 567 的修改需要同步

Support sp

HIT-cwh and others added 4 commits March 29, 2024 18:10

integrated norm chat finetune and inference

5fffd8c

Merge branch 'main' into refactor-llm

0f31481

Merge branch 'main' into refactor-llm

f1111a9

pppppM force-pushed the refactor-llm branch from 3724180 to f1111a9 Compare March 29, 2024 10:41

pppppM added 8 commits March 29, 2024 19:08

remove encoder

7edb76d

add open-source dataset convert tool

c5f38d2

fix shard count

e0ab003

fix dataset bugs

b0e71b1

add alpaca example

5da532c

refactored the inheritance hierarchy

5cfd71f

adjust dir structure

2e1d238

add BaseAlogrithm docstrings

046e943

hhaAndroid reviewed Apr 2, 2024

View reviewed changes

pppppM added 2 commits April 7, 2024 15:17

add dataset docstring

68afab4

add pack dataset docstrings

cf8e8af

pppppM force-pushed the refactor-llm branch from 0db53e1 to 6aea3e8 Compare April 7, 2024 08:23

remove old collate fn

9dc1142

pppppM force-pushed the refactor-llm branch from 6aea3e8 to 9dc1142 Compare April 7, 2024 08:32

pppppM added 2 commits April 7, 2024 16:35

Merge branch 'main' into refactor-llm

662cebb

add new chat hook

7ac1e0f

pppppM requested review from hhaAndroid, LZHgrla and HIT-cwh April 7, 2024 09:58

hhaAndroid reviewed Apr 7, 2024

View reviewed changes

add gradient disable interface

0ad84f2

LZHgrla reviewed Apr 7, 2024

View reviewed changes

xtuner/model/auto.py Outdated Show resolved Hide resolved

LZHgrla reviewed Apr 7, 2024

View reviewed changes

xtuner/model/auto.py Outdated Show resolved Hide resolved

pppppM commented Apr 8, 2024

View reviewed changes

xtuner/model/text/finetune.py Show resolved Hide resolved

pppppM commented Apr 8, 2024

View reviewed changes

xtuner/dataset/hybrid/_pack.py Outdated Show resolved Hide resolved

pppppM commented Apr 8, 2024

View reviewed changes

LZHgrla reviewed Apr 8, 2024

View reviewed changes

pppppM added 6 commits April 9, 2024 12:46

support auto model

c0655a1

rename

fd9ecca

update auto model

15860f9

refactor dataset

7236d40

enhance dataset convert

4db2955

remove useless code

aad9ee3

hhaAndroid reviewed Apr 15, 2024

View reviewed changes

pppppM added 2 commits April 15, 2024 21:19

diff files support diff sample ratios

8daafcb

unified naming

df60d91

LZHgrla reviewed Apr 16, 2024

View reviewed changes

HIT-cwh reviewed Apr 16, 2024

View reviewed changes

Merge branch 'main' of github.com:InternLM/xtuner into refactor-llm

6185c9b

HIT-cwh and others added 3 commits April 16, 2024 15:26

support sp in TextFinetune

ca272bf

Merge pull request #2 from HIT-cwh/refactor-llm

b658d76

Support sp

Merge branch 'main' into refactor-llm

d2f1002

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Integrated Training and Inference -- Part 1 #532

[Feature] Integrated Training and Inference -- Part 1 #532

pppppM commented Mar 29, 2024 •

edited

Loading

hhaAndroid Apr 7, 2024

hhaAndroid Apr 7, 2024

HIT-cwh Apr 8, 2024

hhaAndroid Apr 7, 2024

LZHgrla Apr 7, 2024 •

edited

Loading

HIT-cwh commented Apr 8, 2024

pppppM Apr 8, 2024

pppppM Apr 8, 2024

pppppM Apr 8, 2024

LZHgrla Apr 8, 2024 •

edited

Loading

hhaAndroid Apr 15, 2024

LZHgrla Apr 16, 2024

LZHgrla Apr 16, 2024

pppppM Apr 16, 2024

LZHgrla Apr 16, 2024

LZHgrla Apr 16, 2024

LZHgrla Apr 16, 2024 •

edited

Loading

LZHgrla Apr 16, 2024

LZHgrla Apr 16, 2024 •

edited

Loading

LZHgrla Apr 16, 2024

LZHgrla commented Apr 16, 2024

LZHgrla Apr 16, 2024

HIT-cwh Apr 16, 2024

HIT-cwh commented Apr 16, 2024

		from pydantic import BaseModel


		class SampleParams(BaseModel):

	"""Filter the data which all labels are ignore.
	"""Filter out data that do not contain valid labels.

		f'Filtered {ori_samples - new_samples} samples '
		'(all labels are ignore)',

	runner.logger.info('before_train in EvaluateChatHook.')
	runner.logger.info('before_train in ChatHook.')

[Feature] Integrated Training and Inference -- Part 1 #532

Are you sure you want to change the base?

[Feature] Integrated Training and Inference -- Part 1 #532

Conversation

pppppM commented Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LZHgrla Apr 7, 2024 • edited Loading

Choose a reason for hiding this comment

HIT-cwh commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LZHgrla Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LZHgrla Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LZHgrla Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LZHgrla commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HIT-cwh commented Apr 16, 2024

pppppM commented Mar 29, 2024 •

edited

Loading

LZHgrla Apr 7, 2024 •

edited

Loading

LZHgrla Apr 8, 2024 •

edited

Loading

LZHgrla Apr 16, 2024 •

edited

Loading

LZHgrla Apr 16, 2024 •

edited

Loading