Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mC4 sampling & pre-processing #61

Open
sbmaruf opened this issue Aug 17, 2022 · 1 comment
Open

mC4 sampling & pre-processing #61

sbmaruf opened this issue Aug 17, 2022 · 1 comment

Comments

@sbmaruf
Copy link
Contributor

sbmaruf commented Aug 17, 2022

Hi @TevenLeScao,

I think there are some confusing and broken link in the mC4 data preprocessing section. Can you take a look?

Both of the links are broken here,

  1. mc4_preprocessing
  2. mc4_sampled_raw

The original link should be,

  1. mc4_preprocessing
  2. mc4_sampled_raw

In addition to that, the multinomial data processing code to create the different language splits are in this pull request, bigscience-workshop/Megatron-DeepSpeed#9

Here's few things,

  1. Did you use this data for any one of your experiments?
  2. If not then I think you can update the doc, https://github.com/bigscience-workshop/bigscience/tree/master/data/mc4

For reference purpose, if you want to keep the code, I'm happy to open a pull request here. If not I'll close the pull request from bigscience/Megatron-Deepspeed repo.

Let me know what do you think.

@sbmaruf sbmaruf changed the title MC4 Pre-processing mC4 sampling & pre-processing Aug 17, 2022
@TevenLeScao
Copy link
Collaborator

TevenLeScao commented Aug 18, 2022

We did use mc4 for early multilingual experiments before switching to OSCAR - let's keep the code for future reference. Thanks for catching this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants