We release a dataset called X-CLAIM for the task of multilingual claim span identification. X-CLAIM consists of 7K real-world claims, and social media posts containing them, collected from various social media platforms (e.g., Instagram) in English, Hindi, Punjabi, Tamil, Telugu and Bengali. We also provide the code of baseline model that trains the encoder-only language models like mDeBERTa on the X-CLAIM dataset.
This work appeared at the EMNLP 2023 (main) conference (paper link). Checkout the video (link) or poster (link) presentation for a brief overview.
Authors: Shubham Mittal, Megha Sundriyal, Preslav Nakov.
The train
, dev
and test
split for the lang
language is provided inside ./data/
folder in the file named ./data/{split}-{lang}.csv
. Each file contains three columns:
tokens
: list of tokens in the social media post's textspan_start_index
: starting token index of claim span in tokens listspan_end_index
: ending token index (included) of claim span in tokens list
For reproducibility of the results, we provide the translated data in the target lang
language in ./data/{split}-en2{lang}.csv
files for the train
and dev
splits. Note that the dev
split is only provided for Telugu and Bengali.
We use language IDs in the lang
variable as per below scheme.
- English:
en
- Hindi:
hi
- Punjabi:
pa
- Tamil:
ta
- Telugu:
te
- Bengali:
bn
To reproduce the multilingual training baselines, the ./data/{split}-multilingual.csv
file contains the aggregated data of all languages for the train
and dev
splits.
Run the below commands to install the required dependencies.
conda create --name xclaim python=3.7
conda activate xclaim
pip install -r requirements.txt
Our two-step methodology to create the X-CLAIM dataset.
Here, we provide the command to run the automated annotation step for marking the claim span in the social media post text using the normalized claim.
cd ./code/
python automated_annotation.py --inp_csv <input_csv_file_path> --out_csv <output_csv_file_path> --gpu <gpu id>
<input_csv_file_path>
file should contain two columns:
source
: the text (e.g., social media post text) onto which the target text is mappedtarget
: the text (e.g., normalized claim) which is mapped onto the source text
<output_csv_file_path>
file will contain four columns (in addition to source
and target
columns):
target_on_source
: the target text mapped onto the source text, i.e., the claim span in social media post created from the normalized claimsource_tokens
: list of the tokens in source textstart_index
: starting token index of thetarget_on_source
textend_index
: ending token index (included) of thetarget_on_source
text
Train encoder-only language models such as mBERT, mDeBERTa and XLM-R on the X-CLAIM dataset using the below instructions.
Run the below command with following variables: setting
, model
, lang
and batchsize
.
cd ./scripts/
sh train.sh <setting> <model> <lang> <batchsize>
setting
= monolingual, multilingual or translatetrainmodel
= mbert, mdeberta or xlmrlang
= en, hi, pa, ta, te, bn or multilingualbatchsize
= 16 (for xlmr) or 32 (for mbert and mdeberta)
Example command for training our best baseline model, Multilingual mDeBERTa on the data containing training data of all the languages
cd ./scripts/
sh train.sh multilingual mdeberta multilingual 32
All experiments are run with a single A100 (40GB) GPU.
Use the below command to evaluate a model, say the above trained Multilingual mDeBERTa checkpoint, on the lang
language.
cd ./code/
python main.py --plm microsoft/mdeberta-v3-base --path_test ../data/test-<lang>.csv --weights <path to model checkpoint>
The commands to evaluate the models in different settings like zero-shot transfer or on multiple languages are provided in ./scripts/test.sh
.
Please cite our work if you use or extend our work:
@inproceedings{mittal-etal-2023-lost,
title = "{L]ost in {T}ranslation, {F}ound in {S}pans: {I}dentifying {C}laims in {M}ultilingual {S}ocial {M}edia",
author = "Mittal, Shubham and
Sundriyal, Megha and
Nakov, Preslav",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.236",
pages = "3887--3902",