Create dataset loader for id-en-code-mixed #303

SamuelCahyawijaya · 2022-10-02T16:06:59Z

Dataset	id_en_code_mixed
Description	This dataset contain 825 tweet instances of Indonesian-English, corresponding to four NLP tasks, i.e., tokenization, language identification, lexical normalization, and word translation. Data for lexical normalization task is curated in MultiLexNorm (already in Nusa Catalogue), but other tasks are not. Tokenization for social media data is not as trivial as splitting the token using white space delimiter. In this data, language identification is performed in token-level granularity.
License	CC-BY-NC-SA 4.0

VanillaMacchiato · 2022-10-04T01:27:23Z

#self-assign

haryoa · 2022-12-20T11:55:58Z

#self-assign

SamuelCahyawijaya added this to Nusantara Dataset Initiative Oct 2, 2022

muhsatrio added the hacktoberfest label Oct 3, 2022

github-actions bot assigned VanillaMacchiato Oct 4, 2022

github-actions bot assigned haryoa Dec 20, 2022

Provide feedback