Download the C4 dataset! #5056

dirkgr · 2021-03-16T20:33:41Z

dirkgr
Mar 16, 2021
Maintainer

Lots of people are interested in looking at or working with Google's C4 dataset. Unfortunately, Google does not offer it for download, and instead published open source tools to re-create it from the original Common Crawl data. Fortunately, Common Crawl has allowed us to offer a downloadable version, so here we are!

Five variants

We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual.

All the code snippets below assume you want the c4/en dataset. To get the other variants, just substitute en for any of the other names.

For reference, these are the sizes of the sets:

en: 800GB in TFDS format, 305GB in JSON format
en.noclean: 6.3TB in TFDS format, 2.3TB in JSON format
en.noblocklist: 1003G in TFDS format, 380GB in JSON format
realnewslike: 38GB in TFDS format, 15GB in JSON format
multilingual: 27TB in TFDS format, 9.7TB in JSON format

The en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.

Tensorflow native format

We uploaded the dataset in Tensorflow format into a requester-pays bucket in Google Storage. "Requester-pays" means you might have to pay Google for downloading it. That means you have to have an account in Google's Cloud Platform. The actual pricing is complicated and depends on a lot of factors. If you're processing the data inside the Google Cloud, it is most likely free. If you're downloading the data in the US or Europe, it will cost $0.12/GB. That means if you download the whole c4/en dataset, it will cost about $100.

To use the dataset in tensorflow-datasets format, you will need two things:

gsutil, or some other method of downloading data from Google Storage Engine. There are many ways of obtaining it. Here are the official instructions from Google.
tensorflow and tensorflow-datasets Python packages. You can pip install both of these.

TFDS is incompatible with requester-pays buckets, so you have to download the data locally before you can use it. To do that, run this in your shell:

mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/

Once it is done, you can read the dataset in Python like this:

import tensorflow_datasets as tfds
ds = tfds.load("c4/en", download=False, data_dir="local_datasets_dir/")

JSON format

Not everyone likes the Tensorflow native format, and it is uncompressed, so the files sizes are much larger. For that reason, we also prepared the data in JSON format. huggingface.co agreed to host this dataset for us. Thank you! You can take a look at what's available at https://huggingface.co/datasets/allenai/c4/tree/main.

Huggingface uses Git Large File Storage to actually store the data, so you will need to install that on your machine to get to the files.

Once that is done, downloading the whole dataset, all five variants, is easy:

git clone https://huggingface.co/datasets/allenai/c4

This will download 13TB to your local drive. If you want to be more precise with what you are downloading, follow these commands instead:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

The git clone command in this variant will download a bunch of stub files that Git LFS uses, so you can see all the filenames that exist that way. You can then convert the stubs into their real files with git lfs pull --include "...". For example, if you wanted all the Dutch documents from the multilingual set, you would run

git lfs pull --include "multilingual/c4-nl.*.json.gz"

Acknowledgements

Big ups to the good folks at Common Crawl whose data made this possible (consider donating!), and to Google for creating the code that curates and filters the data!

License

We are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

versae · 2021-03-17T10:28:57Z

versae
Mar 17, 2021

This is great news! And thanks for making it available. C4 is arguably an expensive and difficult dataset to get, but even more so the multilingual part. I work at the National Library of Norway where I have been trying to download and process mC4 to extract data to train Scandinavian language models (Danish, Icelandic, Norwegian, and Swedish), but just processing the smallest dump was kinda expensive (~950€ using Google Dataflow and Tensorflow Datasets), so processing the ~72 dumps is just not an option. So if y'all have extracted that data and there is a way to access it or transfer it that would be awesome :)

9 replies

versae Jun 16, 2021

Awesome! Thanks for making it available. We just downloaded the .json.gz files for Norwegian 🎉

sudo apt install git-lfs
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "multilingual/c4-no.*.json.gz"

MasterScrat Sep 2, 2021

@dirkgr Are the noblocklist files also available for the multilingual version? The quality of the LDNOOBW lists in non-English languages is very mixed. See for example this commit, which removed some fully non-offensive words from the French list (eg the French words for "to work" and "widow" 🤦 ).

dirkgr Sep 8, 2021
Maintainer Author

The multilingual version isn't filtered anyways, so there isn't a block list to remove :-)

MasterScrat Jan 2, 2022

@dirkgr We have used Detoxify to estimate the toxicity of the 1128 GB of French text in mC4 (we intend to release this data soon). We were surprised to see that only 0.25% of the dataset was classified as toxic (toxicity >0.5). This seems extremely low for a dataset that wasn't previously filtered. Do you confirm that the multilingual version of C4 didn't use any toxicity filtering of any kind?

dirkgr Jan 3, 2022
Maintainer Author

I confirm. That said, for some languages the source URLs are very biased. For example, it's possible that a large fraction of the French text is EU documents, which would not be toxic. You'd have to look at the URLs to see if that kind of bias exists.

lintool · 2021-03-17T13:14:06Z

lintool
Mar 17, 2021

Would it be possible to separately provide a manifest of the files with MD5 checksums for validation purposes?

1 reply

dirkgr Mar 25, 2021
Maintainer Author

When you get the files from Huggingface's git repository, they already have SHA1 checksums. Does MD5 provide additional value to you? If so, I'd be happy to add that.

kyleclo · 2021-03-17T17:39:05Z

kyleclo
Mar 17, 2021
Collaborator

Thanks! Would it be possible to add license info?

3 replies

dirkgr Mar 25, 2021
Maintainer Author

Just did!

jonthegeek Jul 23, 2021

Just did!

Is AllenNLP the entity that should be credited when we attribute this content? Thanks for putting it together in any case!

dirkgr Jul 27, 2021
Maintainer Author

I guess credit Google (for developing the code) and AI2 (for running it and publishing the results)?

maybay21 · 2021-03-21T02:16:52Z

maybay21
Mar 21, 2021

A heads up that you'll have to specify your GCP project to perform requester-pays. So it should look like

mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -u <your-gcp-project> -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/

0 replies

leogao2 · 2021-03-27T20:40:23Z

leogao2
Mar 27, 2021

To download all:

sudo apt install git-lfs
git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull

To download only some, replace the last command with git lfs pull --include c4/en (replace c4/en with whichever data you want to download)

3 replies

snakers4 Apr 6, 2021

AI2 has uploaded the data into a requester-pays bucket in Google storage, which means the whole dataset will cost about $100 to download.

You can just download and process everything for free for any language thanks to the Common Crawl team!
I see no reason to pay Google US$100. Instead better support the CC team - http://commoncrawl.org/donate/

Please see these old tutorials:

https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands

to Google for creating the code that curates and filters the data!
We uploaded the dataset in Tensorflow format
requester-pays bucket in Google Storage
That means if you download the whole c4/en dataset, it will cost about $100.

Pardon for my bluntness, buy this looks like a shameless ad for Google, their proritary formats and algorithms.
Please support Common Crawl and do not support Google by using their cloud / paying them.

dirkgr Apr 6, 2021
Maintainer Author

The C4 dataset is based on common crawl, but it is not the same. C4 cleans the data, discarding duplicates, spam, offensive content, etc. Also, C4 is the dataset used to train the T5 model, so you might need that exact data to do comparisons or baselines.

If you want to save the $100, you can download the data from Huggingface instead (and donate to Common Crawl anyways!).

munael Apr 27, 2021

This works. The tutorial in the main post should include the git lfs pull step for those of us new to LFS :P

It might also have needed git switch main as I'd run that before the lfs pull. Just in case.

dirkgr · 2021-04-18T17:15:47Z

dirkgr
Apr 18, 2021
Maintainer Author

We added a new dataset today!

The en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.

Also, we documented some of what we found inside at http://www.cs.cmu.edu/~jessed/data_hosting/documenting_c4.pdf. Go check it out!

0 replies

maybay21 · 2021-04-18T19:35:06Z

maybay21
Apr 18, 2021

I've made the 'en' version a public BigQuery dataset if that's helpful for anyone:
https://console.cloud.google.com/bigquery?project=nlg-experiments&p=nlg-experiments&page=table&d=c4&t=c4_en

0 replies

mishav78 · 2021-05-03T03:52:33Z

mishav78
May 3, 2021

please explain what each contains?
For reference, these are the sizes of the sets:

en: 800GB in TFDS format, 300GB in JSON format
en.noclean: 6.3TB in TFDS format, 2.3TB in JSON format
en.noblocklist: 1003G in TFDS format, 380GB in JSON format
realnewslike: 38GB in TFDS format, 15GB in JSON format

0 replies

palasso · 2021-06-11T09:44:28Z

palasso
Jun 11, 2021

That's great news! Thank you for making it available.

$ gsutil ls gs://allennlp-tensorflow-datasets/c4/en
gs://allennlp-tensorflow-datasets/c4/en/3.0.1/

It seems only 3.0.1 is available. Are there any plans to offer other versions? I am interested in the c4/en/2.2.1/ version specifically.
Thank you 🙂

1 reply

dirkgr Jun 11, 2021
Maintainer Author

Sorry, there are no plans to release the earlier variants. It's quite expensive to process these, so we don't want to do it again for a small difference in versions. But if you find a way to do it, we'd be happy to host it alongside our version!

Luvata · 2021-06-16T12:48:51Z

Luvata
Jun 16, 2021

That's really awesome, I have a question: What is the dump date of this CC corpus ?

1 reply

dirkgr Jun 16, 2021
Maintainer Author

You can see the dates here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py#L101

AmrMKayid · 2021-06-16T17:42:01Z

AmrMKayid
Jun 16, 2021

Thanks a lot for providing this incredible dataset!

I'm trying to download the JSON format through cloning huggingface repo directly to gcp bucket but I keep getting this error shown in the image below! Did anyone try cloning the repo successfully?!

1 reply

dirkgr Jun 16, 2021
Maintainer Author

I don't know what your method is of downloading straight to a GCP bucket. If you are mounting a GCP bucket using FUSE, I am not surprised you're seeing errors. git is very particular about filesystem behavior, and FUSE file system are often not up to the task.

I recommend you create a big drive, attach it to a GCE instance, download it there, and then upload to the bucket.

RyanHuangNLP · 2021-06-25T06:33:53Z

RyanHuangNLP
Jun 25, 2021

@dirkgr Can download specific language in mc4 data? I found that the data format is .js

1 reply

dirkgr Jun 26, 2021
Maintainer Author

Yes, you can. Look at the example under "JSON format".

MarcelSchmidberger · 2021-07-19T10:45:59Z

MarcelSchmidberger
Jul 19, 2021

How can you import the downloaded json c4 data set with huggingface?

1 reply

dirkgr Jul 19, 2021
Maintainer Author

Since we posted this, HuggingFace has made C4 a first-party dataset here: https://huggingface.co/datasets/c4. I haven't tried it myself, but I imagine you can now use it like any other dataset in the datasets project.

rachelwrr · 2021-08-03T07:11:30Z

rachelwrr
Aug 3, 2021

Anyone knows how to retrieve a sentence pair (true+sentence w/synthetic grammar errors) from the C4 dataset published on Huggingface?

0 replies

PaulKMandal · 2022-08-05T16:29:33Z

PaulKMandal
Aug 5, 2022

I would like to cite this database in a paper I will be publishing shortly. I have Google's C4 citation on there, but I would also like to be able to credit AllenNLP since this isn't google's original C4 dataset. Please let me know how to cite this. Thanks.

1 reply

epwalsh Aug 8, 2022
Maintainer

Hey @PaulKMandal, you can site this paper: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=nHy_1doAAAAJ&sortby=pubdate&citation_for_view=nHy_1doAAAAJ:4TOpqqG69KYC

Thanks!

aranku · 2022-08-08T18:59:14Z

aranku
Aug 8, 2022

Hey, which region is the allennlp-tensorflow-datasets GCP bucket being hosted in? I'd like to avoid egress fees if possible. Thanks!

2 replies

epwalsh Aug 8, 2022
Maintainer

This is in us-west1

aranku Aug 8, 2022

Thanks, that was fast :)

spate141 · 2022-11-03T21:33:21Z

spate141
Nov 3, 2022

I would love to learn more about how c4 processing code was used to generate these data dumps. How much time/resources did it take to process the data? Is there any documentation or logs about data processing jobs? How can I convert/process data dumps to JSON format?

1 reply

dirkgr Nov 16, 2022
Maintainer Author

I did this a while ago so I don't remember all the details, but it all began with that script that you linked to. It takes quite some resources in the Google Cloud, so we had to deal with quota issues and other things related to stability of the job. You can see the source code for creating this dataset here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py. It's not super long. The Apache Beam system takes care of all the heavy lifting to run this in a distributed way.

The runs generated logs at the time, but we did not save them. I'm not sure they would be interesting. It's just going to say "downloaded this, downloaded that, processing done 20%, now 21%", etc. The whole thing runs on ephemeral computers in the Google Cloud, so there isn't a log file I can now download from the computer we did this on.

If you want this in JSON format, look at the Huggingface version. That one is already in JSON format!

Download the C4 dataset! #5056

dirkgr Mar 16, 2021 Maintainer

Five variants

Tensorflow native format

JSON format

Acknowledgements

License

Replies: 17 comments · 25 replies

dirkgr Sep 8, 2021 Maintainer Author

dirkgr Jan 3, 2022 Maintainer Author

dirkgr Mar 25, 2021 Maintainer Author

kyleclo Mar 17, 2021 Collaborator

dirkgr Mar 25, 2021 Maintainer Author

dirkgr Jul 27, 2021 Maintainer Author

dirkgr Apr 6, 2021 Maintainer Author

dirkgr Apr 18, 2021 Maintainer Author

dirkgr Jun 11, 2021 Maintainer Author

dirkgr Jun 16, 2021 Maintainer Author

dirkgr Jun 16, 2021 Maintainer Author

dirkgr Jun 26, 2021 Maintainer Author

dirkgr Jul 19, 2021 Maintainer Author

epwalsh Aug 8, 2022 Maintainer

epwalsh Aug 8, 2022 Maintainer

dirkgr Nov 16, 2022 Maintainer Author

dirkgr
Mar 16, 2021
Maintainer

Replies: 17 comments 25 replies

dirkgr Sep 8, 2021
Maintainer Author

dirkgr Jan 3, 2022
Maintainer Author

dirkgr Mar 25, 2021
Maintainer Author

kyleclo
Mar 17, 2021
Collaborator

dirkgr Mar 25, 2021
Maintainer Author

dirkgr Jul 27, 2021
Maintainer Author

dirkgr Apr 6, 2021
Maintainer Author

dirkgr
Apr 18, 2021
Maintainer Author

dirkgr Jun 11, 2021
Maintainer Author

dirkgr Jun 16, 2021
Maintainer Author

dirkgr Jun 16, 2021
Maintainer Author

dirkgr Jun 26, 2021
Maintainer Author

dirkgr Jul 19, 2021
Maintainer Author

epwalsh Aug 8, 2022
Maintainer

epwalsh Aug 8, 2022
Maintainer

dirkgr Nov 16, 2022
Maintainer Author