This is the fourth post in my series about named entity recognition. If you haven’t seen the last three, have a look now. The last time we used a recurrent neural network to model the sequence structure of our sentences. Now we use a hybrid approach combining a bidirectional LSTM model and a CRF model. The so called LSTM-CRF is a state-of-the-art approach to named entity recognition.

Let’s recall the situation from my post about conditional random fields. We are given a input sequence x = (x_1,\dots, x_m), i.e. the words of a sentence and a sequence of output states s = (s_1,\dots, s_m), i.e. the named entity tags. In conditional random fields we modeled the conditional probability

    \[p(s_1,\dots,s_m|x_1,\dots,x_m)\]

of the output state sequence give a input sequence. We did this by defining a feature map

    \[\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d\]

that maps an entire input sequence x paired with an entire state sequence s to some d-dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector w\in\mathbb{R}^d

    \[p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},\]

where s^\prime ranges over all possible output sequences. We can view the expression w\cdot\Phi(x, s) = \text{score}_{crf}(x,s) as a scoring how well the state sequence fits the given input sequence. The idea is now, to replace the linear scoring function by a non-linear neural network. So we define the score

    \[\text{score}_{lstm-crf}(x,s) = \sum_{i=0}^n W_{s_{i-1}, s_i} \cdot \text{LSTM}(x)_i + b_{s_{i-1},s_i},\]

where W_{s_{i-1}, s_i} and b are the weight vector and the bias corresponding to the transition from s_{i-1} to s_i, respectively.
Note, that the score functions are also called \textit{potential functions}. After constructing this score function, we can optimize the conditional probability p(s|x; W, b) like in the usual CRF and propagating back trough the network. We are going to use the implementation provided by the keras-contrib package, that contains useful extensions to the official keras package.

Now dive into the applied part. We start as always by loading the data.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
In [2]:
data = data.fillna(method="ffill")
In [3]:
data.tail(10)
Out[3]:
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [4]:
words = list(set(data["Word"].values))
words.append("ENDPAD")
n_words = len(words); n_words
Out[4]:
35179
In [5]:
tags = list(set(data["Tag"].values))
n_tags = len(tags); n_tags
Out[5]:
17

So we have 47959 sentences containing 35178 different words with 17 different tags. We use the SentenceGetter class from last post to retrieve sentences with their labels.

In [6]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
In [7]:
getter = SentenceGetter(data)
In [8]:
sent = getter.get_next()

This is how a sentence looks now.

In [9]:
print(sent)
[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

Okay, that looks like expected, now get all sentences.

In [10]:
sentences = getter.sentences

Prepare the data

Now we introduce dictionaries of words and tags.

In [11]:
max_len = 75
word2idx = {w: i + 1 for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}
In [12]:
word2idx["Obama"]
Out[12]:
21450
In [13]:
tag2idx["B-geo"]
Out[13]:
12

Now we map the senctences to a sequence of numbers and then pad the sequence. Note that we increased the index of the words by one to use zero as a padding value. This is done because we want to use the mask_zeor parameter of the embedding layer to ignore inputs with value zero.

In [14]:
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]
Using TensorFlow backend.
In [15]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words-1)

And we need to do the same for our tag sequence.

In [16]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]
In [17]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

For training the network we also need to change the labels y to categorial.

In [18]:
from keras.utils import to_categorical
In [19]:
y = [to_categorical(i, num_classes=n_tags) for i in y]

We split in train and test set.

In [20]:
from sklearn.model_selection import train_test_split
In [21]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)

Setup the CRF-LSTM

Now we can fit a LSTM-CRF network with an embedding layer.

In [22]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF
In [23]:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words + 1, output_dim=20,
                  input_length=max_len, mask_zero=True)(input)  # 20-dim embedding
model = Bidirectional(LSTM(units=50, return_sequences=True,
                           recurrent_dropout=0.1))(model)  # variational biLSTM
model = TimeDistributed(Dense(50, activation="relu"))(model)  # a dense layer as suggested by neuralNer
crf = CRF(n_tags)  # CRF layer
out = crf(model)  # output
In [24]:
model = Model(input, out)
In [25]:
model.compile(optimizer="rmsprop", loss=crf.loss_function, metrics=[crf.accuracy])
In [26]:
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 75, 20)            703600    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 75, 100)           28400     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 75, 50)            5050      
_________________________________________________________________
crf_1 (CRF)                  (None, 75, 17)            1190      
=================================================================
Total params: 738,240
Trainable params: 738,240
Non-trainable params: 0
_________________________________________________________________
In [27]:
history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=5,
                    validation_split=0.1, verbose=1)
Train on 38846 samples, validate on 4317 samples
Epoch 1/5
38846/38846 [==============================] - 276s 7ms/step - loss: 8.9402 - acc: 0.9071 - val_loss: 8.6667 - val_acc: 0.9508
Epoch 2/5
38846/38846 [==============================] - 269s 7ms/step - loss: 8.7123 - acc: 0.9602 - val_loss: 8.6239 - val_acc: 0.9618
Epoch 3/5
38846/38846 [==============================] - 258s 7ms/step - loss: 8.6843 - acc: 0.9676 - val_loss: 8.6114 - val_acc: 0.9652
Epoch 4/5
38846/38846 [==============================] - 256s 7ms/step - loss: 8.6737 - acc: 0.9706 - val_loss: 8.6062 - val_acc: 0.9654
Epoch 5/5
38846/38846 [==============================] - 251s 6ms/step - loss: 8.6674 - acc: 0.9724 - val_loss: 8.6042 - val_acc: 0.9668
In [28]:
hist = pd.DataFrame(history.history)
In [29]:
import matplotlib.pyplot as plt
plt.style.use("ggplot")
plt.figure(figsize=(12,12))
plt.plot(hist["acc"])
plt.plot(hist["val_acc"])
plt.show()
Learning curve CRF-LSTM

Evaluation

Now we can evaluate our model systematically. You can find the details in this post, here we just apply it.

In [30]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
In [32]:
test_pred = model.predict(X_te, verbose=1)
4796/4796 [==============================] - 11s 2ms/step
In [34]:
idx2tag = {i: w for w, i in tag2idx.items()}

def pred2label(pred):
    out = []
    for pred_i in pred:
        out_i = []
        for p in pred_i:
            p_i = np.argmax(p)
            out_i.append(idx2tag[p_i].replace("PAD", "O"))
        out.append(out_i)
    return out
    
pred_labels = pred2label(test_pred)
test_labels = pred2label(y_te)
In [38]:
print("F1-score: {:.1%}".format(f1_score(test_labels, pred_labels)))
F1-score: 67.1%
In [37]:
print(classification_report(test_labels, pred_labels))
             precision    recall  f1-score   support

        geo       0.84      0.88      0.86      3603
        per       0.77      0.76      0.77      1734
        gpe       0.97      0.91      0.94      1520
        tim       0.90      0.84      0.87      2013
        org       0.70      0.66      0.68      1988
        eve       0.50      0.03      0.05        37
        nat       0.00      0.00      0.00        14
        art       0.00      0.00      0.00        49

avg / total       0.83      0.81      0.82     10958

Finally, we look at some predictions.

In [52]:
i = 1927
p = model.predict(np.array([X_te[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_te[i], -1)
print("{:15}||{:5}||{}".format("Word", "True", "Pred"))
print(30 * "=")
for w, t, pred in zip(X_te[i], true, p[0]):
    if w != 0:
        print("{:15}: {:5} {}".format(words[w-1], tags[t], tags[pred]))
Word           ||True ||Pred
==============================
Egyptian       : B-gpe B-gpe
police         : O     O
have           : O     O
arrested       : O     O
at             : O     O
least          : O     O
16             : O     O
members        : O     O
of             : O     O
the            : O     O
opposition     : O     O
Muslim         : B-org B-org
Brotherhood    : I-org I-org
as             : O     O
parts          : O     O
of             : O     O
the            : O     O
country        : O     O
prepare        : O     O
for            : O     O
parliamentary  : O     O
runoff         : O     O
elections      : O     O
Saturday       : B-tim B-tim
.              : O     O

This looks pretty good and it did require any feature engineering. The power of the CRF is not really visible here, but if we had a dataset with more complicated named entites this would be quite strong.

Inference with the LSTM-CRF

UPDATE: Since many people asked how to use this with new data, which is obviously important, I show you now how to do it. To stay in the frame of this tutorial, I will assume the text is already tokenized. If you don’t know how to do this, I recommend looking at my post about word vectors.

In [33]:
test_sentence = ["Hawking", "was", "a", "Fellow", "of", "the", "Royal", "Society", ",", "a", "lifetime", "member",
                 "of", "the", "Pontifical", "Academy", "of", "Sciences", ",", "and", "a", "recipient", "of",
                 "the", "Presidential", "Medal", "of", "Freedom", ",", "the", "highest", "civilian", "award",
                 "in", "the", "United", "States", "."]

Now we transform every word to it’s integer index. Note that we mapping unknown words to zero. Normally you would want to add a UNKNOWN token to your vocabulary. Then you cut the vocabulary on which you train the model and replace all uncommon words by the UNKNOWN token. We haven’t done this for simplicity.

In [37]:
x_test_sent = pad_sequences(sequences=[[word2idx.get(w, 0) for w in test_sentence]],
                            padding="post", value=0, maxlen=max_len)

And now we can predict with the model and see what we got.

In [44]:
p = model.predict(np.array([x_test_sent[0]]))
p = np.argmax(p, axis=-1)
print("{:15}||{}".format("Word", "Prediction"))
print(30 * "=")
for w, pred in zip(test_sentence, p[0]):
    print("{:15}: {:5}".format(w, tags[pred]))
Word           ||Prediction
==============================
Hawking        : B-per
was            : O    
a              : O    
Fellow         : O    
of             : O    
the            : O    
Royal          : B-org
Society        : I-org
,              : O    
a              : O    
lifetime       : O    
member         : O    
of             : O    
the            : O    
Pontifical     : B-org
Academy        : I-org
of             : I-org
Sciences       : I-org
,              : O    
and            : O    
a              : O    
recipient      : O    
of             : O    
the            : O    
Presidential   : O    
Medal          : I-gpe
of             : O    
Freedom        : B-geo
,              : O    
the            : O    
highest        : O    
civilian       : O    
award          : O    
in             : O    
the            : O    
United         : B-geo
States         : I-geo
.              : O    

I hope you enjoied this post and learned something that you can apply in your daily work. Next time I show you how you can improve the model even further by using character-level embeddings.

References and further reading:

Title image taken from Huang et al.

You might also be interested in: