DeToxy is a large scale multimodal dataset created by manually annotating toxic labels onto spoken utterances that allows content moderation researchers to encompass audio modality onto their work as well. DeToxy is sourced from various openly available speech databases and consists of over 2 million utterances. The datasets that have been used for creating DeToxy are CMU-MOSEI, CMU-MOSI, Common Voice, IEMOCAP, LJ Speech, MELD, MSP-Improv, MSP-Podcast, Social-IQ, SwitchBoard and VCTK. We also provide DeToxy-B, a balanced version of the dataset, curated from the original larger version taking into consideration auxiliary factors like trigger terms and utterance sentiment labels.
The paper containing the detailed explanation of the dataset can be found here - https://arxiv.org/pdf/2110.07592.pdf
Social network platforms are generally meant to share positive, constructive, and insightful content. However, in recent times, people often get exposed to objectionable content like threats, identity attacks, hate speech, insults, obscene texts, offensive remarks, or bullying. With the rise of different forms of content available online beyond just written text, i.e., audio and video, it is crucial that we device efficient content moderation systems for these forms of shared media. However, most prior work in literature and available datasets focus primarily on the modality of conversational text, with other modalities of conversation ignored at large. Thus, to alleviate this problem, in this paper, we propose a new dataset DeToxy, for the relatively new and unexplored Spoken Language Processing (SLP) task of toxicity classification in spoken utterances, which remains a crucial problem to solve for interactive intelligent systems, with broad applications in the field of content moderation in online audio/video content, gaming, customer service, etc.
The datasets can be downloaded through the use of links below. For some of the datasets requests to the respective labs might be required.
Dataset | Link |
---|---|
CMU-MOSEI | http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ |
CMU-MOSI | http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/ |
Common Voice | https://commonvoice.mozilla.org/en/datasets (We had used Version 6.1 for our dataset). |
IEMOCAP | https://sail.usc.edu/iemocap/ |
LJ Speech | https://keithito.com/LJ-Speech-Dataset/ |
MELD | https://affective-meld.github.io/ (Download raw data) |
MSP-Improv | https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html |
MSP-Podcast | https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html |
Social-IQ | https://www.thesocialiq.com/ |
SwitchBoard | https://catalog.ldc.upenn.edu/LDC97S62 |
VCTK | https://datashare.ed.ac.uk/handle/10283/2950 |
Annotated dataset can be found in the data folder.
Column Name | Description |
---|---|
Dataset | This column gives the name of the dataset. |
FileName | The Filename of each audio file from datasets. |
text | Individual utterances from each audio file as a string. |
label2a | The label (toxic: 1, non-toxic: 0) annotated for each utterance. |
Starting | The starting time of the utterance in the given audio file in seconds (only used to segment data). |
Ending | The ending time of the utterance in the given audio file in seconds (only used to segment data). |
/data/metadata.csv - contains notes for the preparation of each dataset.
/data/test.csv - contains the utterances in the test set along with Toxicity Labels and Starting/Ending Time.
/data/train.csv - contains the utterances in the training set along with Toxicity Labels and Starting/Ending Time.
/data/trigger_test.csv - contains the utterances in the trigger term test set along with Toxicity Labels and Starting/Ending Time.
/data/valid.csv - contains the utterances in the validation set along with Toxicity Labels and Starting/Ending Time.
- Clone the entire repository into your local machine.
- Search and download online all the open source datasets mentioned in the introduction and place them all in a new folder in this directory.
- Open Anaconda Command Prompt and Setup a new environment
- Activate the environment and upgrade pip.
- All other requirements can be installed using requirements.txt
- Using the Audio (Final DeToxy-B Audio Prep.ipynb) and Transcript (Final DeToxy-B Transcript Prep.ipynb) Jupyter Notebooks present in the data prep folder to prepare the datasets. The Gold Transcripts are already present in the data folder and is not required to run.
- The codes present in the two_step_approach folder is used to run the two step experiments present in the paper. First use the transcribe.py code is used to generate the transcripts for all the audio files using models pretrained on Librispeech, Common Voice and Switchboard.
- After creating the transcripts for each dataset and pretrained model use the Civil_Comments Jupyter Notebook to predict and evaluate using the model trained on publicly available toxic dataset(text).
- To perform the DB(DeToxy-B) two step approach find the main.py file present in the two_step_approach/DB folder. Make changes to the default paths accordingly inside the parse function. Also make changes to the csv file names preset in the main function. Change the name of the Stats File according to the transcripts that is being evaluated. The training dataset will remain the same dataset for all the other transcripts. Then run the main.py file.
- For the End to End approach, first select the model that you want to train on. This change has to be done in line 1559 under the MMI_Model_Single class in mmi_module_only_audio.py. Change it accordingly in run_iemocap-ours-meld.py under line 32 and 33. In the main function the paths for files can be changed according to one's needs. Then run the run_iemocap script using the line below.
C:\> conda create -n DeToxy pip python=3.6
C:\> activate DeToxy
(DeToxy) C:\>python -m pip install --upgrade pip
(DeToxy) C:\>pip install -r requirements.txt
(DeToxy) C:\> jupyter notebook
(DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\e2e\> python transcribe.py
(DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\two_step_approach\DB\> python main.py
(DeToxy) C:\Toxicity-Detection-in-Spoken-Utterances\e2e\> python run_iemocap-ours-meld.py
Please cite the following paper if you find this dataset useful in your research:
S. Ghosh, S. Lepcha, Sakshi, R. R. Shah and S. Umesh. DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances. arXiv preprint arxiv.2110.07592 (2022).