This is the sixth post in my series about named entity recognition. If you haven’t seen the last five, have a look now. The last time we used character embeddings and a LSTM to model the sequence structure of our sentences and predict the named entities. This time I’m going to show you some cutting edge stuff. We will use a residual LSTM network together with ELMo embeddings [1], developed at Allen NLP. You will learn how to wrap a tensorflow hub pre-trained model to work with keras. The resulting model with give you state-of-the-art performance on the named entity recognition task.

What are ELMo embeddings?

ELMo embeddings are embeddings from a language model trained on the 1 Billion Word Benchmark and the pretrained version is availiable on tensorflow hub. Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. They are computed on top of two-layer bidirectional language model with character convolutions, as a linear function of the internal network states. Concretely, ELMos use a pre-trained, multi-layer, bi-directional, LSTM-based language model and extract the hidden state of each layer for the input sequence of words. Then, they compute a weighted sum of those hidden states to obtain an embedding for each word. The weight of each hidden state is task-dependent and is learned. ELMo improves the performance of models across a wide range of tasks, spanning from question answering and sentiment analysis to named entity recognition. This setup allows us to do semi-supervised learning, where the biLM is pre-trained at a large scale and easily incorporated into a wide range of existing neural NLP architectures.

I suggest having a look at the great paper “Deep contextualized word representations”

Data preperation

Let’s start by loading and preparing the data. If you are familiar with the last post of this series, you can skip this part and jump directly to the model setup.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt"ggplot")

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [2]:
words = list(set(data["Word"].values))
n_words = len(words); n_words
In [3]:
tags = list(set(data["Tag"].values))
n_tags = len(tags); n_tags

So we have 47959 sentences containing 35178 different words with 17 different tags. We use the SentenceGetter class from last post to retrieve sentences with their labels.

In [4]:
class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1 = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
        self.grouped ="Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    def get_next(self):
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
            return None
In [5]:
getter = SentenceGetter(data)
In [6]:
sent = getter.get_next()

This is how a sentence looks now.

In [7]:
[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

Okay, that looks as expected, now get all sentences.

In [8]:
sentences = getter.sentences

For the use of neural nets (at least with keras, there is no theoretical reason) we need to use equal-lenght input sequences. So we are going to pad our sentences to a length of 50. But first we need a dictionary of tags to map our labels to numbers.

In [9]:
max_len = 50
tag2idx = {t: i for i, t in enumerate(tags)}
In [10]:

To apply the EMLo embedding from tensorflow hub, we have to use strings as input. So we take the tokenized sentences and pad them to the desired length.

In [11]:
X = [[w[0] for w in s] for s in sentences]
In [12]:
new_X = []
for seq in X:
    new_seq = []
    for i in range(max_len):
X = new_X

This is how a input sample looks like now.

In [13]:

And we need to do the same for our tag sequence, but map the string to an integer.

In [14]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]
In [15]:
from keras.preprocessing.sequence import pad_sequences
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
Using TensorFlow backend.
In [16]:
array([ 3,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6, 13,  6,
        6,  6,  7,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6],

We split in train and test set.

In [17]:
from sklearn.model_selection import train_test_split
In [18]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=2018)

The ELMo residual LSTM model

In [19]:
batch_size = 32

Now we can initialize the ELMo embedding from tensorflow hub.

In [20]:
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K

Initialize the tensorflow session.

In [21]:
sess = tf.Session()

If you run the following code for the first time, it will download the pretrained model. This might take a while.

In [22]:
elmo_model = hub.Module("", trainable=True)
INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.

Now we create a function that takes a sequence of strings and returns a sequence of 1024-dimensional vectors of the ELMo embedding. We will later use this function with the Lambda layer of keras to get the embedding sequence.

In [23]:
def ElmoEmbedding(x):
    return elmo_model(inputs={
                            "tokens": tf.squeeze(tf.cast(x, tf.string)),
                            "sequence_len": tf.constant(batch_size*[max_len])

Next, we can create a residual LSTM network with an ELMo embedding layer.

In [24]:
from keras.models import Model, Input
from keras.layers.merge import add
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Lambda
In [25]:
input_text = Input(shape=(max_len,), dtype=tf.string)
embedding = Lambda(ElmoEmbedding, output_shape=(None, max_len, 1024))(input_text)
x = Bidirectional(LSTM(units=512, return_sequences=True,
                       recurrent_dropout=0.2, dropout=0.2))(embedding)
x_rnn = Bidirectional(LSTM(units=512, return_sequences=True,
                           recurrent_dropout=0.2, dropout=0.2))(x)
x = add([x, x_rnn])  # residual connection to the first biLSTM
out = TimeDistributed(Dense(n_tags, activation="softmax"))(x)
In [26]:
model = Model(input_text, out)
In [27]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

We need to make the number of samples divisible by the batch_size to make it work. Otherwise the last batch in keras will break the architecture. I haven’t found a fix for this yet. Please tell me if you have an idea.

In [28]:
X_tr, X_val = X_tr[:1213*batch_size], X_tr[-135*batch_size:]
y_tr, y_val = y_tr[:1213*batch_size], y_tr[-135*batch_size:]
y_tr = y_tr.reshape(y_tr.shape[0], y_tr.shape[1], 1)
y_val = y_val.reshape(y_val.shape[0], y_val.shape[1], 1)

And now we can finally fit the model. Since the computation of ELMo is pretty computational expensive, you better fit the model on a GPU.

In [29]:
history =, y_tr, validation_data=(np.array(X_val), y_val),
                    batch_size=batch_size, epochs=5, verbose=1)
Train on 38816 samples, validate on 4320 samples
Epoch 1/5
38816/38816 [==============================] - 433s 11ms/step - loss: 0.0625 - acc: 0.9818 - val_loss: 0.0459 - val_acc: 0.9858
Epoch 2/5
38816/38816 [==============================] - 430s 11ms/step - loss: 0.0404 - acc: 0.9869 - val_loss: 0.0421 - val_acc: 0.9865
Epoch 3/5
38816/38816 [==============================] - 429s 11ms/step - loss: 0.0334 - acc: 0.9886 - val_loss: 0.0426 - val_acc: 0.9868
Epoch 4/5
38816/38816 [==============================] - 429s 11ms/step - loss: 0.0275 - acc: 0.9904 - val_loss: 0.0431 - val_acc: 0.9868
Epoch 5/5
38816/38816 [==============================] - 430s 11ms/step - loss: 0.0227 - acc: 0.9920 - val_loss: 0.0461 - val_acc: 0.9867
In [30]:
hist = pd.DataFrame(history.history)
In [31]:
plt.title("Learning curves")

Now look at some predictions.

In [35]:
i = 19
p = model.predict(np.array(X_te[i:i+batch_size]))[0]
p = np.argmax(p, axis=-1)
print("{:15} {:5}: ({})".format("Word", "Pred", "True"))
for w, true, pred in zip(X_te[i], y_te[i], p):
    if w != "__PAD__":
        print("{:15}:{:5} ({})".format(w, tags[pred], tags[true]))
Word            Pred : (True)
Meanwhile      :O     (O)
,              :O     (O)
in             :O     (O)
Belgrade       :B-geo (B-geo)
,              :O     (O)
Serbia         :B-geo (B-geo)
's             :O     (O)
extreme        :O     (O)
nationalist    :O     (O)
Radical        :B-geo (B-org)
Party          :I-geo (I-org)
has            :O     (O)
filed          :O     (O)
a              :O     (O)
motion         :O     (O)
of             :O     (O)
no-confidence  :O     (O)
in             :O     (O)
the            :O     (O)
government     :O     (O)
of             :O     (O)
Prime          :B-per (B-per)
Minister       :I-per (O)
Vojislav       :B-per (B-per)
Kostunica      :I-per (I-per)
to             :O     (O)
protest        :O     (O)
the            :O     (O)
extradition    :O     (O)
of             :O     (O)
11             :O     (O)
suspects       :O     (O)
to             :O     (O)
the            :O     (O)
court          :O     (O)
since          :B-tim (B-tim)
October        :I-tim (I-tim)
.              :O     (O)

This looks pretty perfect! And it didn’t require any feature engineering! With this architecture you should be able to achieve state-of-the-art results in multiple language related sequence tagging problems. While ELMo uses a feature-based approach, read about the recent Bert model which uses a fine-tuning based approach. Stay tuned for more NLP posts and try some of the proposed methods yourself.

Further readings:

  1. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. “Deep contextualized word representations”:

You might also be interested in: