Download the C4 dataset! #5056
Replies: 17 comments 25 replies
-
This is great news! And thanks for making it available. C4 is arguably an expensive and difficult dataset to get, but even more so the multilingual part. I work at the National Library of Norway where I have been trying to download and process mC4 to extract data to train Scandinavian language models (Danish, Icelandic, Norwegian, and Swedish), but just processing the smallest dump was kinda expensive (~950€ using Google Dataflow and Tensorflow Datasets), so processing the ~72 dumps is just not an option. So if y'all have extracted that data and there is a way to access it or transfer it that would be awesome :) |
Beta Was this translation helpful? Give feedback.
-
Would it be possible to separately provide a manifest of the files with MD5 checksums for validation purposes? |
Beta Was this translation helpful? Give feedback.
-
Thanks! Would it be possible to add license info? |
Beta Was this translation helpful? Give feedback.
-
A heads up that you'll have to specify your GCP project to perform requester-pays. So it should look like
|
Beta Was this translation helpful? Give feedback.
-
To download all:
To download only some, replace the last command with |
Beta Was this translation helpful? Give feedback.
-
We added a new dataset today! The Also, we documented some of what we found inside at http://www.cs.cmu.edu/~jessed/data_hosting/documenting_c4.pdf. Go check it out! |
Beta Was this translation helpful? Give feedback.
-
I've made the 'en' version a public BigQuery dataset if that's helpful for anyone: |
Beta Was this translation helpful? Give feedback.
-
please explain what each contains? en: 800GB in TFDS format, 300GB in JSON format |
Beta Was this translation helpful? Give feedback.
-
That's great news! Thank you for making it available.
It seems only |
Beta Was this translation helpful? Give feedback.
-
That's really awesome, I have a question: What is the dump date of this CC corpus ? |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for providing this incredible dataset! I'm trying to download the JSON format through cloning huggingface repo directly to gcp bucket but I keep getting this error shown in the image below! Did anyone try cloning the repo successfully?! |
Beta Was this translation helpful? Give feedback.
-
@dirkgr Can download specific language in mc4 data? I found that the data format is .js |
Beta Was this translation helpful? Give feedback.
-
How can you import the downloaded json c4 data set with huggingface? |
Beta Was this translation helpful? Give feedback.
-
Anyone knows how to retrieve a sentence pair (true+sentence w/synthetic grammar errors) from the C4 dataset published on Huggingface? |
Beta Was this translation helpful? Give feedback.
-
I would like to cite this database in a paper I will be publishing shortly. I have Google's C4 citation on there, but I would also like to be able to credit AllenNLP since this isn't google's original C4 dataset. Please let me know how to cite this. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hey, which region is the allennlp-tensorflow-datasets GCP bucket being hosted in? I'd like to avoid egress fees if possible. Thanks! |
Beta Was this translation helpful? Give feedback.
-
I would love to learn more about how c4 processing code was used to generate these data dumps. How much time/resources did it take to process the data? Is there any documentation or logs about data processing jobs? How can I convert/process data dumps to JSON format? |
Beta Was this translation helpful? Give feedback.
-
Lots of people are interested in looking at or working with Google's C4 dataset. Unfortunately, Google does not offer it for download, and instead published open source tools to re-create it from the original Common Crawl data. Fortunately, Common Crawl has allowed us to offer a downloadable version, so here we are!
Five variants
We prepared five variants of the data:
en
,en.noclean
,en.noblocklist
,realnewslike
, andmultilingual
.All the code snippets below assume you want the
c4/en
dataset. To get the other variants, just substituteen
for any of the other names.For reference, these are the sizes of the sets:
en
: 800GB in TFDS format, 305GB in JSON formaten.noclean
: 6.3TB in TFDS format, 2.3TB in JSON formaten.noblocklist
: 1003G in TFDS format, 380GB in JSON formatrealnewslike
: 38GB in TFDS format, 15GB in JSON formatmultilingual
: 27TB in TFDS format, 9.7TB in JSON formatThe
en.noblocklist
variant is exactly the same as theen
variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.Tensorflow native format
We uploaded the dataset in Tensorflow format into a requester-pays bucket in Google Storage. "Requester-pays" means you might have to pay Google for downloading it. That means you have to have an account in Google's Cloud Platform. The actual pricing is complicated and depends on a lot of factors. If you're processing the data inside the Google Cloud, it is most likely free. If you're downloading the data in the US or Europe, it will cost $0.12/GB. That means if you download the whole
c4/en
dataset, it will cost about $100.To use the dataset in tensorflow-datasets format, you will need two things:
gsutil
, or some other method of downloading data from Google Storage Engine. There are many ways of obtaining it. Here are the official instructions from Google.tensorflow
andtensorflow-datasets
Python packages. You canpip install
both of these.TFDS is incompatible with requester-pays buckets, so you have to download the data locally before you can use it. To do that, run this in your shell:
mkdir -p local_datasets_dir/c4/en/3.0.1/ gsutil -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/
Once it is done, you can read the dataset in Python like this:
JSON format
Not everyone likes the Tensorflow native format, and it is uncompressed, so the files sizes are much larger. For that reason, we also prepared the data in JSON format. huggingface.co agreed to host this dataset for us. Thank you! You can take a look at what's available at https://huggingface.co/datasets/allenai/c4/tree/main.
Huggingface uses Git Large File Storage to actually store the data, so you will need to install that on your machine to get to the files.
Once that is done, downloading the whole dataset, all five variants, is easy:
This will download 13TB to your local drive. If you want to be more precise with what you are downloading, follow these commands instead:
The
git clone
command in this variant will download a bunch of stub files that Git LFS uses, so you can see all the filenames that exist that way. You can then convert the stubs into their real files withgit lfs pull --include "..."
. For example, if you wanted all the Dutch documents from the multilingual set, you would rungit lfs pull --include "multilingual/c4-nl.*.json.gz"
Acknowledgements
Big ups to the good folks at Common Crawl whose data made this possible (consider donating!), and to Google for creating the code that curates and filters the data!
License
We are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
Beta Was this translation helpful? Give feedback.
All reactions