In this post, I will introduce you to something called Named Entity Recognition (NER). NER is a part of natural language processing (NLP) and information retrieval (IR). The task in NER is to find the entity-type of words. Entities can, for example, be locations, time expressions or names. I got a dataset from kaggle. Now we load it and peak at a few examples.

In [13]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
In [15]:
data = data.fillna(method="ffill")
In [191]:
data.tail(10)
Out[191]:
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [17]:
words = list(set(data["Word"].values))
In [19]:
n_words = len(words); n_words
Out[19]:
35178

So we have 47959 sentences containing 35178 different words.

We start by writing a small class to retrieve a sentence from the dataset.

In [83]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
    
    def get_next(self):
        try:
            s = self.data[self.data["Sentence #"] == "Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s["Word"].values.tolist(), s["POS"].values.tolist(), s["Tag"].values.tolist()    
        except:
            self.empty = True
            return None, None, None
In [84]:
getter = SentenceGetter(data)
In [85]:
sent, pos, tag = getter.get_next()

This is how a sentence looks.

In [65]:
print(sent); print(pos); print(tag)
['They', 'marched', 'from', 'the', 'Houses', 'of', 'Parliament', 'to', 'a', 'rally', 'in', 'Hyde', 'Park', '.']
['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', 'TO', 'DT', 'NN', 'IN', 'NNP', 'NNP', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O']

A first idea: Memorization

The first simple idea and baseline might be to just remember the most common named entity for every word and predict that. In case we don’t know a word we just predict ‘O’. The following class does that. I implement it inheriting from a scikit-learn base classes to use the class with the inbuilt cross-validation.

In [142]:
from sklearn.base import BaseEstimator, TransformerMixin


class MemoryTagger(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y):
        '''
        Expects a list of words as X and a list of tags as y.
        '''
        voc = {}
        self.tags = []
        for x, t in zip(X, y):
            if t not in self.tags:
                self.tags.append(t)
            if x in voc:
                if t in voc[x]:
                    voc[x][t] += 1
                else:
                    voc[x][t] = 1
            else:
                voc[x] = {t: 1}
        self.memory = {}
        for k, d in voc.items():
            self.memory[k] = max(d, key=d.get)
    
    def predict(self, X, y=None):
        '''
        Predict the the tag from memory. If word is unknown, predict 'O'.
        '''
        return [self.memory.get(x, 'O') for x in X]
In [128]:
tagger = MemoryTagger()
In [129]:
tagger.fit(sent, tag)
In [139]:
print(tagger.predict(sent))
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']
In [131]:
tagger.tags
Out[131]:
['O', 'B-geo', 'B-gpe']

Okay, it looks like it basically works. Now we do a 5-fold cross-validation.

In [94]:
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report
In [98]:
words = data["Word"].values.tolist()
tags = data["Tag"].values.tolist()
In [99]:
pred = cross_val_predict(estimator=MemoryTagger(), X=words, y=tags, cv=5)

We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the Wikipedia articles.

In [100]:
report = classification_report(y_pred=pred, y_true=tags)
print(report)
             precision    recall  f1-score   support

      B-art       0.23      0.06      0.10       402
      B-eve       0.50      0.25      0.33       308
      B-geo       0.78      0.84      0.81     37644
      B-gpe       0.94      0.93      0.94     15870
      B-nat       0.41      0.28      0.33       201
      B-org       0.66      0.49      0.56     20143
      B-per       0.79      0.64      0.71     16990
      B-tim       0.87      0.77      0.82     20333
      I-art       0.04      0.01      0.01       297
      I-eve       0.36      0.12      0.18       253
      I-geo       0.73      0.58      0.65      7414
      I-gpe       0.61      0.45      0.52       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.69      0.53      0.60     16784
      I-per       0.73      0.66      0.69     17251
      I-tim       0.56      0.13      0.21      6528
          O       0.97      0.99      0.98    887908

avg / total       0.94      0.95      0.94   1048575

This looks not so bad! The precision is quit reasonable, but as you might have guessed, the recall is pretty weak. This is due to the fact, that we cannot predict on words we don’t know. To overcome this issue, we will now introduce a simple machine learning model to predict the named entities.

A simple machine learning approach

To do machine learning, we convert the data to a simple feature vector for every word and then use a random forest to classify the words.

In [119]:
from sklearn.ensemble import RandomForestClassifier

The most simple feature map only contains information of the word itself.

In [114]:
def feature_map(word):
    '''Simple feature map.'''
    return np.array([word.istitle(), word.islower(), word.isupper(), len(word),
                     word.isdigit(),  word.isalpha()])
In [116]:
words = [feature_map(w) for w in data["Word"].values.tolist()]
In [120]:
pred = cross_val_predict(RandomForestClassifier(n_estimators=20),
                         X=words, y=tags, cv=5)
In [121]:
report = classification_report(y_pred=pred, y_true=tags)
print(report)
             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00       402
      B-eve       0.00      0.00      0.00       308
      B-geo       0.26      0.80      0.40     37644
      B-gpe       0.25      0.04      0.07     15870
      B-nat       0.00      0.00      0.00       201
      B-org       0.65      0.17      0.27     20143
      B-per       0.97      0.20      0.33     16990
      B-tim       0.29      0.32      0.30     20333
      I-art       0.00      0.00      0.00       297
      I-eve       0.00      0.00      0.00       253
      I-geo       0.00      0.00      0.00      7414
      I-gpe       0.00      0.00      0.00       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.36      0.03      0.06     16784
      I-per       0.47      0.02      0.04     17251
      I-tim       0.50      0.06      0.11      6528
          O       0.97      0.98      0.97    887908

avg / total       0.88      0.87      0.86   1048575

Wow, that looks really bad. This is expected, since the features lack a lot of information necessary for the decision. So now we enhance our simple features on the one hand by memory and on the other hand by using context information.

In [198]:
from sklearn.preprocessing import LabelEncoder

class FeatureTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.memory_tagger = MemoryTagger()
        self.tag_encoder = LabelEncoder()
        self.pos_encoder = LabelEncoder()
        
    def fit(self, X, y):
        words = X["Word"].values.tolist()
        self.pos = X["POS"].values.tolist()
        tags = X["Tag"].values.tolist()
        self.memory_tagger.fit(words, tags)
        self.tag_encoder.fit(tags)
        self.pos_encoder.fit(self.pos)
        return self
    
    def transform(self, X, y=None):
        def pos_default(p):
            if p in self.pos:
                return self.pos_encoder.transform([p])[0]
            else:
                return -1
        
        pos = X["POS"].values.tolist()
        words = X["Word"].values.tolist()
        out = []
        for i in range(len(words)):
            w = words[i]
            p = pos[i]
            if i < len(words) - 1:
                wp = self.tag_encoder.transform(self.memory_tagger.predict([words[i+1]]))[0]
                posp = pos_default(pos[i+1])
            else:
                wp = self.tag_encoder.transform(['O'])[0]
                posp = pos_default(".")
            if i > 0:
                if words[i-1] != ".":
                    wm = self.tag_encoder.transform(self.memory_tagger.predict([words[i-1]]))[0]
                    posm = pos_default(pos[i-1])
                else:
                    wm = self.tag_encoder.transform(['O'])[0]
                    posm = pos_default(".")
            else:
                posm = pos_default(".")
                wm = self.tag_encoder.transform(['O'])[0]
            out.append(np.array([w.istitle(), w.islower(), w.isupper(), len(w), w.isdigit(), w.isalpha(),
                                 self.tag_encoder.transform(self.memory_tagger.predict([w]))[0],
                                 pos_default(p), wp, wm, posp, posm]))
        return out
In [199]:
from sklearn.pipeline import Pipeline
In [200]:
pred = cross_val_predict(Pipeline([("feature_map", FeatureTransformer()), 
                                   ("clf", RandomForestClassifier(n_estimators=20, n_jobs=3))]),
                         X=data, y=tags, cv=5)
In [201]:
report = classification_report(y_pred=pred, y_true=tags)
print(report)
             precision    recall  f1-score   support

      B-art       0.17      0.08      0.11       402
      B-eve       0.40      0.28      0.33       308
      B-geo       0.83      0.85      0.84     37644
      B-gpe       0.98      0.93      0.95     15870
      B-nat       0.20      0.23      0.22       201
      B-org       0.73      0.64      0.68     20143
      B-per       0.82      0.75      0.78     16990
      B-tim       0.89      0.80      0.84     20333
      I-art       0.03      0.01      0.01       297
      I-eve       0.28      0.13      0.18       253
      I-geo       0.76      0.67      0.71      7414
      I-gpe       0.78      0.47      0.59       198
      I-nat       0.38      0.22      0.28        51
      I-org       0.73      0.67      0.70     16784
      I-per       0.85      0.74      0.79     17251
      I-tim       0.81      0.53      0.64      6528
          O       0.98      0.99      0.99    887908

avg / total       0.96      0.96      0.96   1048575

This improved the result a bit, but this is still not very convincing. In the next post, I will show how to do better with more sophisticated algorithms.

You might also be interested in: