Automatic Toxic Comment Detection in Social Media for Russian

NRU HSE, Fundamental and computational linguistics, Moscow 2022

All collected and utilized data is provided in the corresponding folders and files (see detailed structure below).
Code for replicating one of the models is also provided in this fork.

Links to the trained single- and multitask BERT models:

vk data, 1 task
https://drive.google.com/uc?id=1barEeEUgEUXHHkYN-l-s2i8AZYEyTtYp
vk data, 2 tasks
https://drive.google.com/uc?id=1--iwGBQHBUXXktC9kqHllnmPzwN9wRYz
several source data, 1 task
https://drive.google.com/uc?id=1gJ1IPzpaVG81EzyyF7l9m67L_IbH4uZQ
several source data, 2 tasks
https://drive.google.com/uc?id=1Xu-4-3kYv8HCU2j7zgx84FZm778lzIKk

In case links become unavailable, feel free to contact me on [email protected]

Quick examples on how to infer multitask models in Colab:

!gdown https://drive.google.com/link_from_above

pipe = inferPipeline(modelPath = 'sample_dir/model.pt',
                     maxSeqLen = 128)
# for predicting on one task:
output = pipe.infer([['every text is in'], ['separate list']],
                    ['ToxicityDetection'])
# for predicting on both tasks:
output = pipe.infer([['every text is in'], ['separate list']],
                    ['ToxicityDetection', 'DistortionDetection'])

For more details, please refer to the multi-task-NLP documentation.

Repository structure:

├── hypothesis_testing_data  # data needed to test the hypothesis  
│   ├── uncorrected_data_NEW.tsv  # uncorrected test comments  
│   ├── corrected_data_NEW.tsv  # test comments with manual correction  
|   └── preprocessed_data_NEW.tsv  # test comments preprocessed automatically  
│  
├── preprocessing_data  # data needed for preprocessing approach  
│   ├── bad_wordlist.txt  # list of offensive, obscene and otherwise toxic words  
|   └── replacement.json  # rules for replacing cyrillic letters  
│  
├── toxicity_corpus  # folder for publishing collected distorted toxicity data  
│   ├── DATASTATEMENT.md  # data statement fot the corpus  
|   └── distorted_toxicity.tsv  # corpus file  
│      
├── training_data  # train and val data and task files for training neural networks  
│   ├── ...     
│  
├── Testing models.ipynb  # notebook for first experiment  
├── Approach 1 - preprocessing.ipynb  # notebook for first approach of second experiment  
├── Approach 2 - MT BERT.ipynb  # notebook for first approach of second experiment  
├── parsing and preparing data.ipynb # code for getting and structuring data  
├── corpus analysis.ipynb # code for counting some corpus statistics  
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Toxic Comment Detection in Social Media for Russian

Links to the trained single- and multitask BERT models:

Quick examples on how to infer multitask models in Colab:

Repository structure:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
hypothesis_testing_data		hypothesis_testing_data
preprocessing_data		preprocessing_data
toxicity_corpus		toxicity_corpus
training_data		training_data
Approach 1 - preprocessing.ipynb		Approach 1 - preprocessing.ipynb
Approach 2 - MT BERT.ipynb		Approach 2 - MT BERT.ipynb
Defence slides.pdf		Defence slides.pdf
README.md		README.md
Testing models.ipynb		Testing models.ipynb
corpus analysis.ipynb		corpus analysis.ipynb
parsing and preparing data.ipynb		parsing and preparing data.ipynb

alla-g/toxicity-detection-thesis

Folders and files

Latest commit

History

Repository files navigation

Automatic Toxic Comment Detection in Social Media for Russian

Links to the trained single- and multitask BERT models:

Quick examples on how to infer multitask models in Colab:

Repository structure:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages