Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer? #2

yeyunhu · 2023-06-29T04:34:29Z

There are some different about megatron-lm tokenizer and HF tokenizer.

python llama/tools/preprocess_data.py \
       --input /mnt/workspace/{}.json \
       --output-prefix  \
       --vocab-file **gpt2-vocab.json** \
       --dataset-impl mmap \
       --tokenizer-type **GPT2BPETokenizer** \
       --merge-file gpt2-merges.txt \
       --append-eod

I am confused about the provided tokenizer file in this repo llama/tokenizer, which is different from that of HF one.

The text was updated successfully, but these errors were encountered:

yeyunhu · 2023-06-29T04:45:54Z

My converted Llama model (hugging face) looks like this.

llama/
- config.json
- generation_config.json
- pytorch_model-00001-of-00002.bin
- pytorch_model-00002-of-00002.bin
- pytorch_model.bin.index.json
- special_tokens_map.json
- tokenizer.json
- tokenizer.model
- tokenizer_config.json

MoFHeka · 2023-07-14T07:12:18Z

You could write a new custom_pretrain_llama.py to add HF tokenizer in training step. Add it in build_train_iterable_loaders function or somewhere else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer? #2

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer? #2

yeyunhu commented Jun 29, 2023 •

edited

Loading

yeyunhu commented Jun 29, 2023

MoFHeka commented Jul 14, 2023

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer? #2

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer? #2

Comments

yeyunhu commented Jun 29, 2023 • edited Loading

yeyunhu commented Jun 29, 2023

MoFHeka commented Jul 14, 2023

yeyunhu commented Jun 29, 2023 •

edited

Loading