This is the third post in my series about named entity recognition. If you haven’t seen the last two, have a look now. The last time we used a conditional random field to model the sequence structure of our sentences. This time we use a LSTM model to do the tagging. At the end of this guide, you will know how to use neural networks in keras to tag sequences of words. For more details on neural networks and LSTMs in particular, I suggest to read this excellent blog post.

Now we want to apply this model. Let’s start by loading the data.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
In [2]:
data = data.fillna(method="ffill")
In [3]:
data.tail(10)
Out[3]:
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [4]:
words = list(set(data["Word"].values))
words.append("ENDPAD")
In [5]:
n_words = len(words); n_words
Out[5]:
35179
In [6]:
tags = list(set(data["Tag"].values))
In [7]:
n_tags = len(tags); n_tags
Out[7]:
17

So we have 47959 sentences containing 35178 different words with 17 different tags. We use the SentenceGetter class from last post to retrieve sentences with their labels.

In [8]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
In [9]:
getter = SentenceGetter(data)
In [10]:
sent = getter.get_next()

This is how a sentence looks now.

In [11]:
print(sent)
[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

Okay, that looks like expected, now get all sentences.

In [12]:
sentences = getter.sentences

Now check how long the senctences are.

In [13]:
import matplotlib.pyplot as plt
plt.style.use("ggplot")
In [14]:
plt.hist([len(s) for s in sentences], bins=50)
plt.show()

For the use of neural nets (at least with keras, this is no theoretical reason) we need to use equal-lenght input sequences. So we are going to pad our sentences to a length of 50. But first we need dictionaries of words and tags.

In [15]:
max_len = 50
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}
In [16]:
word2idx["Obama"]
Out[16]:
25292
In [17]:
tag2idx["B-geo"]
Out[17]:
11

Now we map the senctences to a sequence of numbers and then pad the sequence.

In [18]:
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]
Using TensorFlow backend.
In [19]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words - 1)
In [20]:
X[1]
Out[20]:
array([ 8461, 33837, 21771,  6080, 20007, 11069, 10139, 24950, 11069,
       30594,  3989, 17574, 13378, 15727,  3808, 20200,   230, 27681,
       23901, 21530, 12892,  9370, 12368, 16610, 33447, 35178, 35178,
       35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178,
       35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178,
       35178, 35178, 35178, 35178, 35178], dtype=int32)

And we need to do the same for our tag sequence.

In [21]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]
In [22]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
In [23]:
y[1]
Out[23]:
array([ 2,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 16,  5,
        5,  5, 14,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5], dtype=int32)

For training the network we also need to change the labels y to categorial.

In [24]:
from keras.utils import to_categorical
In [25]:
y = [to_categorical(i, num_classes=n_tags) for i in y]

We split in train and test set.

In [26]:
from sklearn.model_selection import train_test_split
In [27]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)

Now we can fit a LSTM network with an embedding layer. Note that we used the functional API of keras here, as it is more suitable for complicated architectures.

In [28]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
In [29]:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50, input_length=max_len)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer
In [30]:
model = Model(input, out)
In [31]:
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
In [32]:
history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=5, validation_split=0.1, verbose=1)
Train on 38846 samples, validate on 4317 samples
Epoch 1/5
38846/38846 [==============================] - 247s - loss: 0.1419 - acc: 0.9640 - val_loss: 0.0630 - val_acc: 0.9815
Epoch 2/5
38846/38846 [==============================] - 250s - loss: 0.0552 - acc: 0.9840 - val_loss: 0.0513 - val_acc: 0.9847
Epoch 3/5
38846/38846 [==============================] - 245s - loss: 0.0462 - acc: 0.9865 - val_loss: 0.0480 - val_acc: 0.9857
Epoch 4/5
38846/38846 [==============================] - 245s - loss: 0.0417 - acc: 0.9878 - val_loss: 0.0462 - val_acc: 0.9859
Epoch 5/5
38846/38846 [==============================] - 246s - loss: 0.0388 - acc: 0.9886 - val_loss: 0.0446 - val_acc: 0.9864
In [33]:
hist = pd.DataFrame(history.history)
In [34]:
plt.figure(figsize=(12,12))
plt.plot(hist["acc"])
plt.plot(hist["val_acc"])
plt.show()

Now look at some predictions.

In [50]:
i = 2318
p = model.predict(np.array([X_te[i]]))
p = np.argmax(p, axis=-1)
print("{:15} ({:5}): {}".format("Word", "True", "Pred"))
for w, pred in zip(X_te[i], p[0]):
    print("{:15}: {}".format(words[w], tags[pred]))
Word            (True ): Pred
The            : O
State          : B-org
Department     : I-org
said           : O
Friday         : B-tim
Washington     : B-geo
is             : O
working        : O
with           : O
the            : O
Ethiopian      : B-gpe
government     : O
,              : O
international  : O
partners       : O
and            : O
non-governmental: O
organizations  : O
in             : O
responding     : O
to             : O
concerns       : O
over           : O
humanitarian   : O
conditions     : O
in             : O
the            : O
eastern        : O
region         : O
.              : O

This looks pretty good and it did require any feature engineering. So feel free to try some of the proposed methods yourself. In the next post I’ll show you a combination of a CRF and a LSTM for sequence tagging.

 

You might also be interested in: