Kaggle Competition: Quora Insincere Questions Classification

Introduction

This competition is sponsored by Quora. The objective is to predict whether a question asked on Quora is sincere or not. This is a kernels only comeptition with contraint of two-hour runtime.

An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

has a non-neutral tone
is disparaging or inflammatory
isn't grounded in reality
uses sexual content

Submissions are evaluated on F1 score between the predicted and the observed targets

Model Development

I have a standard workflow for model development. First starts with simple linear-based model, then add complexities if needed. Eventually, I will deploy neural network models with ensemble technique for final submission. Following is each step during my model development:

Establish a strong baseline with the hybrid "NB-SVM" model (link to model V0)
Try tree-based model LightGBM (link to model V1)
Try a blending model: "NB-SVM" + LightGBM (link to the blending model V11)
Establish baseline for neural network model (link to model V2)

1st layer: embedding layer without pretrained
2nd layer: spatial dropout
3rd layer: bidirectional with LSTM
4th layer: global max pooling 1D
5th layer: output dense layer

Try neural network model with pretrained embedding weights I used a very similar neural network architecture like above. The only changes are 1) adding text cleaning 2). using pretrained word embedding weights

Neural Networks with Glove word embedding (link to model V30)
Neural Networks with Paragram word embedding (link to model V31)
Neural Networks with FastText word embedding (link to model V32)

Try to use LSTM Attention with Glove word embedding (link to model V40)
Use both LSTM Attention and Capsule Neural Network (CapsNet) (link to model V5)