This time we’re going to discuss a current machine learning competion on kaggle. In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits. I will show you how to create a strong baseline based on fasttext using python and keras.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
Let’s have a look at the data first and drop samples with missing comment_text.
train_df = pd.read_csv("data/train.csv").fillna("sterby")
test_df = pd.read_csv("data/test.csv").fillna("sterby")
train_df.head()
So we have the comment in the field “comment_text” and six different labels. Note that this problem is a multi-label multi-class problem. You can find further information about this type of problem here.
X_train = train_df["comment_text"].values
y_train = train_df[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test = test_df["comment_text"].values
So how do we approach the problem?
A fasttext-like model
A simple and efficient baseline for sentence classification is to represent sentences as bag of words and train a linear classifier, e.g., a logistic regression or an SVM. However, linear classifiers do not share parameters among features and classes, especially in a multi-label setting like ours. This possibly limits their generalization ability. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks.
We represent each sample as a sequence of words . For each word we have a look-up table
for so called word embeddings. These
-dimensional embeddings are initialized randomly and updated while training. These word representations are then averaged into a text representation
, which is in turn fed to a linear classifier. The text representation
is an hidden variable of the dimension of the embedding which can be potentially be reused. This architecture is similar to the cbow model of Mikolov et al., where the middle word is replaced by a label. But here, the embeddings are trained to solve a specific problem.

from A. Joulin et al.
We do some adjustments to this architecture by computing the text representation by , for
. We do this, because the maximum collects the strongest features from the text which is what we want in out use case. For sentiment analysis, like in facebooks paper, the average is a more suitable approach.
Now to the implementation part:
First we do the necessary keras imports…
from keras.preprocessing import sequence
from keras.models import Model, Input
from keras.layers import Dense, Embedding, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.optimizers import Adam
… and set some parameters.
max_features = 20000 # number of words we want to keep
maxlen = 100 # max length of the comments in the model
batch_size = 64 # batch size for the model
embedding_dims = 20 # dimension of the hidden variable, i.e. the embedding dimension
Next we have to tokenize the comments.
tok = Tokenizer(num_words=max_features)
tok.fit_on_texts(list(X_train) + list(X_test))
x_train = tok.texts_to_sequences(X_train)
x_test = tok.texts_to_sequences(X_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))
Then we pad the sequences to our desired length.
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
Now we can setup the model. Note that the GlobalMaxPooling1D will compute out hidden variable and the sigmoid output-layer with six units will compute a multi-label multi-class linear model on top of the hidden variable.
comment_input = Input((maxlen,))
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
comment_emb = Embedding(max_features, embedding_dims, input_length=maxlen,
embeddings_initializer="uniform")(comment_input)
# we add a GlobalMaxPooling1D, which will extract features from the embeddings
# of all words in the comment
h = GlobalMaxPooling1D()(comment_emb)
# We project onto a six-unit output layer, and squash it with a sigmoid:
output = Dense(6, activation='sigmoid')(h)
model = Model(inputs=comment_input, outputs=output)
model.compile(loss='binary_crossentropy',
optimizer=Adam(0.01),
metrics=['accuracy'])
Now we train the model for three epochs.
hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=3, validation_split=0.1)
That looks good. This provides a strong first baseline. So feel free to try this and join the competition. An important part of text classification is to understand why your model makes a decision. I wrote about it here, check it out.
References and further reading:
- Joulin et. al: Bag of Tricks for Efficient Text Classification [https://arxiv.org/pdf/1607.01759.pdf]
You might also be interested in:
- Guide to sequence tagging with neural networks in python: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Rec …
- Guide to word vectors with gensim and keras: Today, I tell you what word vectors are, how you create them in python and finally how you ca …
- Enhancing LSTMs with character embeddings for Named entity recognition: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
01/05/2018 at 11:02 pm
Hi! I am doing something very similar (but I am using transfer learning). However I run into a problem when I try to test my model. I am getting strange results like: [7.64048770e-02 1.20464265e-01 6.57446861e-01 4.18934338e-02]
but this should be: [0 1 1 0].
My training and validation accuracy is at 93% – do you have suggestions on testing these multilabel multiclassifier models? Is there something similar to model.predict_classes() for the multilabel case?
01/06/2018 at 7:58 am
Are you sure you use binary_crossentropy loss and sigmoid output units? Alos note, that neural nets output probabilities. If you want binary values to threshold the predictions by 0.5 (in the multilabel case or binary case) or pick the most probable class (in the mutliclass case).
Does this answer your question?
02/20/2018 at 3:25 am
Does it make sense to threshold the model prediction by 0.5 and then applying the sklearn accuracy_score,recall, precision and f1-score function to gauge the model’s performance ?
I’m a novice so I am having a hard time understanding how to implement a metric for multi label with keras/python properly.
02/24/2018 at 8:25 am
This totally makes sense! You can implement a keras callback that does this evaluation after every epoch on a validation set. Does this help you?
02/26/2018 at 3:03 pm
Hi, Yes it does. Thanks
BTW, can I apply this multi label approach for tagging data columns in tables. Example, Column A represents water level values from a particular river in Canada and Column B represents water level values from a river in the USA. I add the tags to column A with ‘canada river’ and ‘canada water level’ and column B with ‘usa river’ and ‘usa water level’. If I see a new column in a table I would then like to apply the appropriate tags based on the values in the new column . What features would I need to use as input ?
03/13/2018 at 8:41 am
Hi Dai,
I’m not sure if I get your question. Of course you can use this approach to predict multiple outputs at the same time. Your case sounds more like regression, so you would use a linear output layer and a appropriate loss function (e.g. mean squared error).
Can you describe your problem again?
01/10/2018 at 2:11 pm
Isn’t class imbalance affecting your result ?
01/10/2018 at 6:57 pm
First note, that this is just a baseline approach. I will show a more sophisticated method soon.
In general I can say, I found class imbalance is not a big problem here. I’m not sure why yet and I might be wrong. What are your experiences with the dataset?
01/22/2018 at 7:45 pm
Hi Tobias, great work. Do you now have your follow-up post “In the next post we will look into a more sophisticated concept and extend the current model by using an attention mechanism.”? Look forward to reading it.
01/27/2018 at 12:01 pm
Hi Gu, I’m happy that you like it!
I’m working on it. But it will take some time since I’m experimenting with different concepts and also have other work to do 🙂
But you will get it as soon as possible.
02/21/2018 at 5:53 pm
Why you are not using something like
model = Sequential()
model.add(—- write code line here —–)
just for reference see how the person has written the code in the answer section
i am newbie to Keras
https://stackoverflow.com/questions/43728235/what-is-the-difference-between-keras-maxpooling1d-and-globalmaxpooling1d-functi
02/24/2018 at 8:28 am
This is also a valid way to write keras models. Note that keras has two APIs. The sequential (that is used in your reference) and the functional (which I use). I agree, that the sequential is easier to use in the beginning, but if you want to build more complicated models, then you will need to use the functional API.
02/24/2018 at 8:05 pm
Hi,
Thanks for your previous reply.
my next question is why we are not using and setting “embeddings_regularizer” .
Can we use l1 or l2 regularization in multiclass & multi label classification
03/13/2018 at 8:30 am
Hi Arsalan,
I had no success with using regularizers for embeddings, but you can try it.
Of course you can use l1 and l2 regularization also for multiclass and multilabel classification.
Tell me how it works! 🙂
02/25/2018 at 6:01 am
can you please explain the theory behind setting the values of these parameters
max_features = 20000 # number of words we want to keep
maxlen = 100 # max length of the comments in the model
embedding_dims = 20 # dimension of the hidden variable, i.e. the embedding
dimension
03/13/2018 at 8:37 am
-max_features is basically the upper limit of words in your vocabulary. This acts as some kind of regularizer because it prevents the model form overfitting to uncommon words.
-maxlen is the length you pad your input sequences. This is not necessary from a theoretical point of view, but static computation graphs like tensorflow (and keras) require a fixed input length for recurrent networks.
all these values need to be tuned or set by inspecting the dataset. Try a few values and see how it affects the performance.
03/03/2018 at 2:33 pm
Great post ! something i was looking for . But i get the below error when i tried to replicate your code using my train and test files. The error comes for the last input line where we train the model for 3 epochs
AttributeError: ‘list’ object has no attribute ‘ndim’
Do you have any idea why ? My input file is in same structure as you have posted as in input line 3.
03/13/2018 at 8:44 am
You have to put a numpy array to the keras fit method. So wrap the list in a numpy array. Then it should work. Probably you need to reshape.
09/21/2018 at 3:09 am
Thanks! It works well, but I am not able to predict an unseen comment.
how ?
09/21/2018 at 7:10 am
You have to process the new comment the same way you do for training and just pass it to the model. You might want to limit the number of tokens used in training and replace all remaining tokens by an unknown token. After that you just apply the model to your new comment.