Skip to content

Latest commit

 

History

History
113 lines (85 loc) · 7.37 KB

File metadata and controls

113 lines (85 loc) · 7.37 KB

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances

Introduction

DeToxy is a large scale multimodal dataset created by manually annotating toxic labels onto spoken utterances that allows content moderation researchers to encompass audio modality onto their work as well. DeToxy is sourced from various openly available speech databases and consists of over 2 million utterances. The datasets that have been used for creating DeToxy are CMU-MOSEI, CMU-MOSI, Common Voice, IEMOCAP, LJ Speech, MELD, MSP-Improv, MSP-Podcast, Social-IQ, SwitchBoard and VCTK. We also provide DeToxy-B, a balanced version of the dataset, curated from the original larger version taking into consideration auxiliary factors like trigger terms and utterance sentiment labels.

Dataset Statistics

Paper

The paper containing the detailed explanation of the dataset can be found here - https://arxiv.org/pdf/2110.07592.pdf

Objective

Social network platforms are generally meant to share positive, constructive, and insightful content. However, in recent times, people often get exposed to objectionable content like threats, identity attacks, hate speech, insults, obscene texts, offensive remarks, or bullying. With the rise of different forms of content available online beyond just written text, i.e., audio and video, it is crucial that we device efficient content moderation systems for these forms of shared media. However, most prior work in literature and available datasets focus primarily on the modality of conversational text, with other modalities of conversation ignored at large. Thus, to alleviate this problem, in this paper, we propose a new dataset DeToxy, for the relatively new and unexplored Spoken Language Processing (SLP) task of toxicity classification in spoken utterances, which remains a crucial problem to solve for interactive intelligent systems, with broad applications in the field of content moderation in online audio/video content, gaming, customer service, etc.

Download the data

The datasets can be downloaded through the use of links below. For some of the datasets requests to the respective labs might be required.

Dataset Link
CMU-MOSEI http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/
CMU-MOSI http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/
Common Voice https://commonvoice.mozilla.org/en/datasets (We had used Version 6.1 for our dataset).
IEMOCAP https://sail.usc.edu/iemocap/
LJ Speech https://keithito.com/LJ-Speech-Dataset/
MELD https://affective-meld.github.io/ (Download raw data)
MSP-Improv https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html
MSP-Podcast https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html
Social-IQ https://www.thesocialiq.com/
SwitchBoard https://catalog.ldc.upenn.edu/LDC97S62
VCTK https://datashare.ed.ac.uk/handle/10283/2950

Annotated dataset can be found in the data folder.

Description of the .csv files

Column Name Description
Dataset This column gives the name of the dataset.
FileName The Filename of each audio file from datasets.
text Individual utterances from each audio file as a string.
label2a The label (toxic: 1, non-toxic: 0) annotated for each utterance.
Starting The starting time of the utterance in the given audio file in seconds (only used to segment data).
Ending The ending time of the utterance in the given audio file in seconds (only used to segment data).

The files

/data/metadata.csv - contains notes for the preparation of each dataset.
/data/test.csv - contains the utterances in the test set along with Toxicity Labels and Starting/Ending Time.
/data/train.csv - contains the utterances in the training set along with Toxicity Labels and Starting/Ending Time.
/data/trigger_test.csv - contains the utterances in the trigger term test set along with Toxicity Labels and Starting/Ending Time.
/data/valid.csv - contains the utterances in the validation set along with Toxicity Labels and Starting/Ending Time.

Setup Instructions

  1. Clone the entire repository into your local machine.
  2. Search and download online all the open source datasets mentioned in the introduction and place them all in a new folder in this directory.
  3. Open Anaconda Command Prompt and Setup a new environment

  4.  C:\> conda create -n DeToxy pip python=3.6
    
  5. Activate the environment and upgrade pip.

  6. C:\> activate DeToxy
    (DeToxy) C:\>python -m pip install --upgrade pip
    
  7. All other requirements can be installed using requirements.txt

  8.  (DeToxy) C:\>pip install -r requirements.txt
    
  9. Using the Audio (Final DeToxy-B Audio Prep.ipynb) and Transcript (Final DeToxy-B Transcript Prep.ipynb) Jupyter Notebooks present in the data prep folder to prepare the datasets. The Gold Transcripts are already present in the data folder and is not required to run.
  10.  (DeToxy) C:\> jupyter notebook
    
  11. The codes present in the two_step_approach folder is used to run the two step experiments present in the paper. First use the transcribe.py code is used to generate the transcripts for all the audio files using models pretrained on Librispeech, Common Voice and Switchboard.

  12.  (DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\e2e\> python transcribe.py
    
  13. After creating the transcripts for each dataset and pretrained model use the Civil_Comments Jupyter Notebook to predict and evaluate using the model trained on publicly available toxic dataset(text).
  14. To perform the DB(DeToxy-B) two step approach find the main.py file present in the two_step_approach/DB folder. Make changes to the default paths accordingly inside the parse function. Also make changes to the csv file names preset in the main function. Change the name of the Stats File according to the transcripts that is being evaluated. The training dataset will remain the same dataset for all the other transcripts. Then run the main.py file.

  15.  (DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\two_step_approach\DB\> python main.py
    
  16. For the End to End approach, first select the model that you want to train on. This change has to be done in line 1559 under the MMI_Model_Single class in mmi_module_only_audio.py. Change it accordingly in run_iemocap-ours-meld.py under line 32 and 33. In the main function the paths for files can be changed according to one's needs. Then run the run_iemocap script using the line below.
  17.  (DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\e2e\> python run_iemocap-ours-meld.py 
    

Citation

Please cite the following paper if you find this dataset useful in your research:
S. Ghosh, S. Lepcha, Sakshi, R. R. Shah and S. Umesh. DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances. arXiv preprint arxiv.2110.07592 (2022).