# A strong and simple baseline to classify toxic comments on wikipedia with keras

This time we’re going to discuss a current machine learning completion on kaggle. In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits. I will show you how to create a strong baseline using python and keras.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline


Let’s have a look at the data first and drop samples with missing comment_text.

train_df = pd.read_csv("data/train.csv").fillna("sterby")

train_df.head()


idcomment_texttoxicsevere_toxicobscenethreatinsultidentity_hate
022256635Nonsense? kiss off, geek. what I said is true...100000
127450690"\n\n Please do not vandalize pages, as you di...000000
254037174"\n\n ""Points of interest"" \n\nI removed the...000000
377493077Asking some his nationality is a Racial offenc...000000
479357270The reader here is not going by my say so for ...000000

So we have the comment in the field “comment_text” and six different labels. Note that this problem is a multi-label multi-class problem. You can find further information about this type of problem here [link to blogpost].

X_train = train_df["comment_text"].values
y_train = train_df[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test = test_df["comment_text"].values


So how do we approach the problem?

## A fasttext-like model

A simple and efficient baseline for sentence classification is to represent sentences as bag of words and train a linear classifier, e.g., a logistic regression or an SVM. However, linear classifiers do not share parameters among features and classes, especially in a multi-label setting like ours. This possibly limits their generalization ability. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks.

We represent each sample as a sequence of words $x_1, \dots, x_N$. For each word we have a look-up table $A$ for so called word embeddings. These $m$-dimensional embeddings are initialized randomly and updated while training. These word representations are then averaged into a text representation $h = \sum_{i=1}^N A[x_i]$, which is in turn fed to a linear classifier. The text representation $h$ is an hidden variable of the dimension of the embedding which can be potentially be reused. This architecture is similar to the cbow model of Mikolov et al., where the middle word is replaced by a label. But here, the embeddings are trained to solve a specific problem.

We do some adjustments to this architecture by computing the text representation by $h_j = \max_{i=1}^N A[x_i]_j$, for $j\in{1,\dots, m}$. We do this, because the maximum collects the strongest features from the text which is what we want in out use case. For sentiment analysis, like in facebooks paper, the average is a more suitable approach.

## Now to the implementation part:

First we do the necessary keras imports…

from keras.preprocessing import sequence
from keras.models import Model, Input
from keras.layers import Dense, Embedding, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


… and set some parameters.

max_features = 20000  # number of words we want to keep
maxlen = 100  # max length of the comments in the model
batch_size = 64  # batch size for the model
embedding_dims = 20  # dimension of the hidden variable, i.e. the embedding dimension


Next we have to tokenize the comments.

tok = Tokenizer(num_words=max_features)
tok.fit_on_texts(list(X_train) + list(X_test))
x_train = tok.texts_to_sequences(X_train)
x_test = tok.texts_to_sequences(X_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))

95851 train sequences
226998 test sequences
Average train sequence length: 65
Average test sequence length: 75


Then we pad the sequences to our desired length.

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (95851, 100)
x_test shape: (226998, 100)


Now we can setup the model. Note that the GlobalMaxPooling1D will compute out hidden variable $h$ and the sigmoid output-layer with six units will compute a multi-label multi-class linear model on top of the hidden variable.

comment_input = Input((maxlen,))

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
comment_emb = Embedding(max_features, embedding_dims, input_length=maxlen,
embeddings_initializer="uniform")(comment_input)

# we add a GlobalMaxPooling1D, which will extract features from the embeddings
# of all words in the comment
h = GlobalMaxPooling1D()(comment_emb)

# We project onto a six-unit output layer, and squash it with a sigmoid:
output = Dense(6, activation='sigmoid')(h)

model = Model(inputs=comment_input, outputs=output)

model.compile(loss='binary_crossentropy',
metrics=['accuracy'])


Now we train the model for three epochs.

hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=3, validation_split=0.1)

Train on 86265 samples, validate on 9586 samples
Epoch 1/3
86265/86265 [==============================] - 24s 278us/step - loss: 0.0837 - acc: 0.9739 - val_loss: 0.0592 - val_acc: 0.9798
Epoch 2/3
86265/86265 [==============================] - 25s 285us/step - loss: 0.0531 - acc: 0.9810 - val_loss: 0.0568 - val_acc: 0.9801
Epoch 3/3
86265/86265 [==============================] - 24s 279us/step - loss: 0.0491 - acc: 0.9821 - val_loss: 0.0570 - val_acc: 0.9799


That looks good. This provides a strong first baseline. So feel free to try this and join the competiton. Some ideas to improve this baseline can be found in this article.