Lost in Translation, Found in Spans:
Identifying Claims in Multilingual Social Media

We release a dataset called X-CLAIM for the task of multilingual claim span identification. X-CLAIM consists of 7K real-world claims, and social media posts containing them, collected from various social media platforms (e.g., Instagram) in English, Hindi, Punjabi, Tamil, Telugu and Bengali. We also provide the code of baseline model that trains the encoder-only language models like mDeBERTa on the X-CLAIM dataset.

This work appeared at the EMNLP 2023 (main) conference (paper link). Checkout the video (link) or poster (link) presentation for a brief overview.

Authors: Shubham Mittal, Megha Sundriyal, Preslav Nakov.

X-CLAIM Dataset

The train, dev and test split for the lang language is provided inside ./data/ folder in the file named ./data/{split}-{lang}.csv. Each file contains three columns:

tokens: list of tokens in the social media post's text
span_start_index: starting token index of claim span in tokens list
span_end_index: ending token index (included) of claim span in tokens list

For reproducibility of the results, we provide the translated data in the target lang language in ./data/{split}-en2{lang}.csv files for the train and dev splits. Note that the dev split is only provided for Telugu and Bengali.

We use language IDs in the lang variable as per below scheme.

English: en
Hindi: hi
Punjabi: pa
Tamil: ta
Telugu: te
Bengali: bn

To reproduce the multilingual training baselines, the ./data/{split}-multilingual.csv file contains the aggregated data of all languages for the train and dev splits.

Installation

Run the below commands to install the required dependencies.

conda create --name xclaim python=3.7
conda activate xclaim
pip install -r requirements.txt

Data Curation

Our two-step methodology to create the X-CLAIM dataset.

Here, we provide the command to run the automated annotation step for marking the claim span in the social media post text using the normalized claim.

cd ./code/
python automated_annotation.py --inp_csv <input_csv_file_path> --out_csv <output_csv_file_path>  --gpu <gpu id>

<input_csv_file_path> file should contain two columns:

source: the text (e.g., social media post text) onto which the target text is mapped
target: the text (e.g., normalized claim) which is mapped onto the source text

<output_csv_file_path> file will contain four columns (in addition to source and target columns):

target_on_source: the target text mapped onto the source text, i.e., the claim span in social media post created from the normalized claim
source_tokens: list of the tokens in source text
start_index: starting token index of the target_on_source text
end_index: ending token index (included) of the target_on_source text

Model

Train encoder-only language models such as mBERT, mDeBERTa and XLM-R on the X-CLAIM dataset using the below instructions.

Training

Run the below command with following variables: setting, model, lang and batchsize.

cd ./scripts/
sh train.sh <setting> <model> <lang> <batchsize>

setting = monolingual, multilingual or translatetrain
model = mbert, mdeberta or xlmr
lang = en, hi, pa, ta, te, bn or multilingual
batchsize = 16 (for xlmr) or 32 (for mbert and mdeberta)

Example command for training our best baseline model, Multilingual mDeBERTa on the data containing training data of all the languages

cd ./scripts/
sh train.sh multilingual mdeberta multilingual 32

All experiments are run with a single A100 (40GB) GPU.

Evaluating

Use the below command to evaluate a model, say the above trained Multilingual mDeBERTa checkpoint, on the lang language.

cd ./code/
python main.py --plm microsoft/mdeberta-v3-base --path_test ../data/test-<lang>.csv --weights <path to model checkpoint>

The commands to evaluate the models in different settings like zero-shot transfer or on multiple languages are provided in ./scripts/test.sh.

Cite

Please cite our work if you use or extend our work:

@inproceedings{mittal-etal-2023-lost,
    title = "{L]ost in {T}ranslation, {F}ound in {S}pans: {I}dentifying {C}laims in {M}ultilingual {S}ocial {M}edia",
    author = "Mittal, Shubham  and
      Sundriyal, Megha  and
      Nakov, Preslav",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.236",
    pages = "3887--3902",

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lost in Translation, Found in Spans:
Identifying Claims in Multilingual Social Media

X-CLAIM Dataset

Installation

Data Curation

Model

Training

Evaluating

Cite

About

Languages

mbzuai-nlp/x-claim

Folders and files

Latest commit

History

Repository files navigation

Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media

X-CLAIM Dataset

Installation

Data Curation

Model

Training

Evaluating

Cite

About

Resources

Stars

Watchers

Forks

Languages

Lost in Translation, Found in Spans:
Identifying Claims in Multilingual Social Media