Skip to content

QubitPi/wiktionary-data

Repository files navigation

license pretty_name language configs tags size_categories
apache-2.0
English Wiktionary Data in JSONL
en
de
la
grc
ko
peo
akk
elx
sa
config_name data_files
Wiktionary
split path
German
german-wiktextract-data.jsonl
split path
Latin
latin-wiktextract-data.jsonl
split path
AncientGreek
ancient-greek-wiktextract-data.jsonl
split path
Korean
korean-wiktextract-data.jsonl
split path
OldPersian
old-persian-wiktextract-data.jsonl
split path
Akkadian
akkadian-wiktextract-data.jsonl
split path
Elamite
elamite-wiktextract-data.jsonl
split path
Sanskrit
sanskrit-wiktextract-data.jsonl
config_name data_files
Knowledge Graph
split path
AllLanguage
word-definition-graph-data.jsonl
Natural Language Processing
NLP
Wiktionary
Vocabulary
German
Latin
Ancient Greek
Korean
Old Persian
Akkadian
Elamite
Sanskrit
Knowledge Graph
100M<n<1B

Wiktionary Data on Hugging Face Datasets

Hugging Face dataset badge

Python Version Badge GitHub workflow status badge Hugging Face sync status badge Apache License Badge

wiktionary-data is a sub-data extraction of the English Wiktionary that currently supports the following languages:

  • Deutsch - German
  • Latinum - Latin
  • Ἑλληνική - Ancient Greek
  • 한국어 - Korean
  • 𐎠𐎼𐎹 - Old Persian
  • 𒀝𒅗𒁺𒌑(𒌝) - Akkadian
  • Elamite
  • संस्कृतम् - Sanskrit, or Classical Sanskrit

wiktionary-data was originally a sub-module of wilhelm-graphdb. While the dataset it's getting bigger, I noticed a wave of more exciting potentials this dataset can bring about that stretches beyond the scope of the containing project. Therefore I decided to promote it to a dedicated module; and here comes this repo.

The Wiktionary language data is available on 🤗 Hugging Face Datasets.

from datasets import load_dataset
dataset = load_dataset("QubitPi/wiktionary-data")

There are two data subsets:

  1. Languages subset that contains extraction of a subset of supported languages:

    dataset = load_dataset("QubitPi/wiktionary-data", "Wiktionary")

    The subset contains the following splits

    • German
    • Latin
    • AncientGreek
    • Korean
    • OldPersian
    • Akkadian
    • Elamite
    • Sanskrit
  2. Graph subset that is useful for constructing knowledge graphs:

    dataset = load_dataset("QubitPi/wiktionary-data", "Knowledge Graph")

    The subset contains the following splits

    • AllLanguage: all the languages listed above in a giant graph

    The Graph data ontology is the following:

    Error loading ontology.png

Tip

Two words are structurally similar if and only if the two shares the same stem

Development

Data Source

Although the original Wiktionary dump is available, parsing it from scratch involves rather complicated process. For example, acquiring the inflection data of most Indo-European languages on Wiktionary has already triggered some research-level efforts. We would probably do it in the future. At present, however, we would simply take the awesome works by tatuylonen which has already processed it and presented it in in JSONL format. wiktionary-data sources the data from raw Wiktextract data (JSONL, one object per line) option there.

Environment Setup

Get the source code:

git clone [email protected]:QubitPi/wiktionary-data.git
cd wiktionary-data

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by

python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv

To activate this environment:

source .venv/bin/activate

or, on Windows

./venv\Scripts\activate

Tip

To deactivate this environment, use

deactivate

Installing Dependencies

pip3 install -r requirements.txt

License

The use and distribution terms for wiktionary-data are covered by the Apache License, Version 2.0.