Named entity recognition with conditional random fields in python

This is the second post in my series about named entity recognition. If you haven’t seen the first one, have a look now. Last time we started by memorizing entities for words and then used a simple classification model to improve the results a bit. This model also used context properties and the structure of the word in question. But the results where not overwhelmingly good, so now we’re going to look into a more sophisticated algorithm, a so called conditional random field (CRF).

We denote $x = (x_1,\dots, x_m)$ as the input sequence, i.e. the words of a sentence and $s = (s_1,\dots, s_m)$ as the sequence of output states, i.e. the named entity tags. In conditional random fields we model the conditional probability $$p(s_1,\dots,s_m|x_1,\dots,x_m).$$ We do this by define a feature map $$\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d$$ that maps an entire input sequence $x$ paired with an entire state sequence $s$ to some $d$-dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector $w\in\mathbb{R}^d$ $$p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},$$ where $s^\prime$ ranges over all possible output sequences. For the estimation of $w$, we assume that we have a set of $n$ labeled examples ${(x^i, s^i)}_{i=1}^n$. Now we define the regularized log-likelihood function $L$ $$L(w) = \sum_{i=1}^n \log p(s^i|x^i; w) - \frac{\lambda_2}{2}|w|_2^2 - \lambda_1 |w|_1.$$ The terms $\frac{\lambda_2}{2}|w|_2^2$ and $\lambda_1 |w|_1$ forces the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters $\lambda_2$ and $\lambda_1$ allows to enforce more or less regularization. The parameter vector $w^*$ is then estimated as $$w^* = \text{arg max}_{w\in \mathbb{R}^d} L(w)$$ If we estimated the vector $w^*$, we can find the most likely tag a sentence $s^*$ for a sentence $x$ by $$s^* = \text{arg max}_{s} p(s|x; w^*).$$ For more details we refer to M.Collins [http://www.cs.columbia.edu/~mcollins/crf.pdf].

Load the dataset

If you want to run the tutorial yourself, you can find the dataset here. Now we want to apply this model. Let’s start by loading the data.

import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")

data = data.fillna(method="ffill")

data.tail(10)

Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
words = list(set(data["Word"].values))

n_words = len(words); n_words

35178


So we have 47959 sentences containing 35178 different words. We change the SentenceGetter class from last post a little and use it to retrieve sentences with their labels.

class SentenceGetter(object):

def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
s["POS"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]

def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None

getter = SentenceGetter(data)

sent = getter.get_next()


This is how a sentence looks now.

print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


Okay, that looks like expected, now get all sentences.

sentences = getter.sentences


Craft features

Now we craft a set of features and prepare the dataset.

def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]

features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True

if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True

return features

def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
return [label for token, postag, label in sent]

def sent2tokens(sent):
return [token for token, postag, label in sent]

X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]


Fit the CRF

Now we can initialize the algorithm. We use the conditional random field (CRF) implementation provided by sklearn-crfsuite.

from sklearn_crfsuite import CRF

crf = CRF(algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=False)


Okay, let’s look if it works. Like last time, we performe a 5-fold cross-validation.

Evaluate the model

from sklearn.cross_validation import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report

pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)


We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.

report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

             precision    recall  f1-score   support

B-art       0.37      0.11      0.17       402
B-eve       0.52      0.35      0.42       308
B-geo       0.85      0.90      0.88     37644
B-gpe       0.97      0.94      0.95     15870
B-nat       0.66      0.37      0.47       201
B-org       0.78      0.72      0.75     20143
B-per       0.84      0.81      0.82     16990
B-tim       0.93      0.88      0.90     20333
I-art       0.11      0.03      0.04       297
I-eve       0.34      0.21      0.26       253
I-geo       0.82      0.79      0.80      7414
I-gpe       0.92      0.55      0.69       198
I-nat       0.61      0.27      0.38        51
I-org       0.81      0.79      0.80     16784
I-per       0.84      0.89      0.87     17251
I-tim       0.83      0.76      0.80      6528
O       0.99      0.99      0.99    887908

avg / total       0.97      0.97      0.97   1048575


This looks like a good start. We easily beat the results from the last post.

crf.fit(X, y)

CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
max_linesearch=None, min_freq=None, model_filename=None,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose=False)


Inspect the model

The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to performe the investigation.

import eli5

eli5.show_weights(crf, top=30)

O B-art I-art B-eve I-eve B-geo I-geo B-gpe I-gpe B-nat I-nat B-org I-org B-per I-per B-tim I-tim From \ To 4.29 0.879 0.0 1.575 0.0 2.092 0.0 1.387 0.0 1.605 0.0 2.497 0.0 4.17 0.0 2.986 0.0 -0.014 0.0 8.442 0.0 0.0 -0.398 0.0 0.0 0.0 0.0 0.0 0.516 0.0 -0.844 0.0 0.336 0.0 -0.651 0.0 8.04 0.0 0.0 -0.702 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.016 0.0 -0.684 0.0 -0.753 0.0 0.0 0.0 7.956 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.572 0.0 -0.324 0.0 0.0 0.0 7.341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.621 0.0 0.677 0.752 0.0 0.545 0.0 0.0 8.752 0.579 0.0 0.0 0.0 1.155 0.0 1.143 0.0 2.344 0.0 -0.469 0.822 0.0 0.0 0.0 0.0 7.424 -1.366 0.0 0.0 0.0 -0.074 0.0 1.331 0.0 1.033 0.0 0.679 -1.609 0.0 -0.32 0.0 0.681 0.0 0.0 7.485 0.0 0.0 2.05 0.0 1.459 0.0 0.767 0.0 -0.298 0.0 0.0 0.0 0.0 -1.087 0.0 0.0 6.337 0.0 0.0 0.0 0.0 0.148 0.0 0.0 0.0 -1.108 0.0 0.0 0.0 0.0 0.625 0.0 0.0 0.0 0.0 7.067 0.0 0.0 -0.305 0.0 -0.413 0.0 -1.979 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.197 0.0 0.0 1.188 0.0 0.0 0.0 0.051 1.32 0.0 0.0 0.0 -0.331 0.0 0.447 0.0 0.0 0.0 0.0 7.109 1.054 0.0 0.255 0.0 -0.242 0.0 0.0 0.0 0.0 -1.562 0.0 0.573 0.0 0.0 0.0 0.0 7.236 1.639 0.0 0.421 0.0 0.364 0.0 0.0 0.0 0.0 0.723 0.0 0.734 0.0 2.176 0.0 2.405 0.0 0.0 7.146 1.165 0.0 0.18 0.0 0.0 0.0 0.0 -2.072 0.0 -1.568 0.0 0.0 0.0 -0.341 0.0 0.0 6.299 1.055 0.0 0.286 -1.079 0.0 0.249 0.0 -0.083 0.0 -1.338 0.0 0.061 0.0 -0.148 0.0 1.338 0.0 0.0 7.245 -0.263 0.0 0.0 0.072 0.0 -0.11 0.0 -1.437 0.0 0.0 0.0 -0.374 0.0 1.854 0.0 0.0 7.069
y=O top featuresy=B-art top featuresy=I-art top featuresy=B-eve top featuresy=I-eve top featuresy=B-geo top featuresy=I-geo top featuresy=B-gpe top featuresy=I-gpe top featuresy=B-nat top featuresy=I-nat top featuresy=B-org top featuresy=I-org top featuresy=B-per top featuresy=I-per top featuresy=B-tim top featuresy=I-tim top features
Weight?Feature
+8.012word.lower():last
+7.999word.lower():month
+5.813word.lower():chairman
+5.612word.lower():columbia
+5.555word.lower():year
+5.232word.lower():week
+5.146word.lower():months
+5.067word.lower():internet
+4.833word.lower():weeks
+4.726word.lower():after
+4.684word.lower():republicans
+4.558word[-3:]:And
+4.436word.lower():ambassador
+4.406word.lower():chief
+4.383word.lower():trade
+4.344word.lower():early
+4.272word.lower():years
+4.216+1:word.lower():americans
+4.140word.lower():tourism
+4.127+1:word.lower():american
+4.079word.lower():christian
+4.075word.lower():spokesman
+4.060word[-3:]:De
+4.060word[-2:]:De
… 9204 more positive …
… 5208 more negative …
-4.091word[-2:]:0s
-4.158word.lower():afternoon
-4.447word.lower():palestinian
-4.515word.lower():summer
-4.607word.lower():morning
-4.801word.lower():multi-party
Weight?Feature
+5.369word.lower():twitter
+4.858word.lower():spaceshipone
+4.294word.lower():nevirapine
+4.271+1:word.lower():enkhbayar
+4.263+1:word.lower():boots
+3.893word.lower():english
+3.802-1:word.lower():engine
+3.655word[-3:]:One
+3.588-1:word.lower():film
+3.540word.lower():russian
+3.499word.lower():canal
+3.397+1:word.lower():al-arabiya
+3.345-1:word.lower():adumim
+3.237word.lower():sopranos
+3.186-1:word.lower():to
+3.150word.lower():spanish
+3.130-1:word.lower():shown
+3.014word.lower():economics
+3.006-1:word.lower():tamilnet
+2.997word.lower():frankenstadion
+2.973word.lower():settlement
+2.936word[-2:]:00
+2.919word.lower():dollar
+2.889-1:word.lower():republic
+2.889+1:word.lower():helicopters
+2.877+1:word.lower():search
+2.875-1:word.lower():program
+2.831word.lower():endeavor
+2.711word[-3:]:vor
+2.685word.lower():sidnaya
… 957 more positive …
… 81 more negative …
Weight?Feature
+3.025-1:word.lower():boeing
+2.553+1:word.lower():gained
+2.473+1:word.lower():came
+2.418-1:word.lower():cajun
+2.297word.lower():notice
+2.260word.lower():constitution
+2.112word.lower():flowers
+2.109+1:word.lower():times
+2.072+1:word.lower():marks
+2.056word.lower():a
+2.048+1:word.lower():teshome
+1.980+1:word.lower():treaty
+1.876+1:word.lower():expands
+1.875+1:word.lower():reports
+1.866-1:word.lower():dignity
+1.859word.lower():dome
+1.852+1:word.lower():early
+1.844+1:word.lower():roses
+1.805-1:word.lower():jerusalem
+1.800-1:word.lower():balad
+1.793+1:word.lower():outside
+1.779word.lower():monument
+1.774-1:word.lower():baghdad
+1.765-1:word.lower():beijing
+1.757+1:word.lower():rival
+1.747-1:word.lower():hitler
+1.668word[-3:]:One
+1.667word.lower():lies
+1.660word.lower():declaration
+1.645word.lower():mustard
… 882 more positive …
… 81 more negative …
Weight?Feature
+4.333word.lower():games
+4.263word.lower():ramadan
+4.160-1:word.lower():falklands
+3.501-1:word.lower():typhoon
+3.484word[-3:]:mes
+3.050+1:word.lower():dean
+3.046+1:word.lower():men
+3.028-1:word.lower():wars
+2.942-1:word.lower():happy
+2.938-1:word.lower():solemn
+2.915word.lower():hopman
+2.899word.lower():katrina
+2.846word.lower():olympic
+2.843word[-3:]:pic
+2.758-1:word.lower():war
+2.745word.lower():parma
+2.714-1:word.lower():midnight
+2.596word.lower():australian
+2.570-1:word.lower():2002
+2.547+1:word.lower():security
+2.518+1:word.lower():sabbath
+2.454+1:word.lower():open
+2.446+1:word.lower():event
+2.442word.lower():passover
+2.433-1:word.lower():nazi
+2.409+1:word.lower():ends
+2.390word.lower():holocaust
+2.350-1:word.lower():reigning
+2.262word[-3:]:mme
+2.262word.lower():somme
… 437 more positive …
… 49 more negative …
Weight?Feature
+4.329+1:word.lower():mascots
+3.603word.lower():games
+3.022+1:word.lower():era
+2.756word.lower():series
+2.577word.lower():dean
+2.509+1:word.lower():rally
+2.508+1:word.lower():caused
+2.504+1:word.lower():disaster
+2.441word.lower():sabbath
+2.426+1:word.lower():tore
+2.420+1:word.lower():without
+2.230-1:word.lower():jewish
+2.220+1:word.lower():now
+2.216+1:word.lower():project
+2.164+1:word.lower():suicide
+2.112-1:word.lower():awareness
+1.940+1:word.lower():holiday
+1.916+1:word.lower():peace
+1.880word[-3:]:ean
+1.861-1:word.lower():hurricane
+1.831+1:word.lower():even
+1.828+1:word.lower():finals
+1.762word.lower():conference
+1.760-1:word.lower():typhoon
+1.753-1:word.lower():may
+1.743+1:word.lower():tennis
+1.712-1:word.lower():rights
+1.702word.lower():year
+1.699+1:word.lower():olympics
+1.696word.lower():awareness
… 393 more positive …
… 64 more negative …
Weight?Feature
+6.238word.lower():mid-march
+6.002word.lower():caribbean
+5.503word.lower():martian
+5.446word.lower():beijing
+5.086word.lower():persian
+4.737-1:word.lower():hamas
+4.521-1:word.lower():mr.
+4.509word.lower():balkans
+4.362-1:word.lower():serb
+4.310word.lower():quake-zone
+4.224word.lower():philippines
+4.192word.lower():burma
+4.169+1:word.lower():phoned
+4.167word.lower():washington
+4.152word.lower():france
+4.137word.lower():paris
+4.131-1:word.lower():taleban
+4.016-1:word.lower():bordeaux
+3.943word.lower():mars
+3.900+1:word.lower():moqtada
+3.886-1:word.lower():cypriot
+3.870word.lower():mid-june
+3.837word.lower():wheeler
+3.788word.lower():pearl
+3.744-1:word.lower():malaysian
+3.698word.lower():athens
+3.616word.lower():séances
+3.616word.lower():port-au-prince
+3.589word.lower():christians
… 5949 more positive …
… 1365 more negative …
-4.659word[-3:]:The
Weight?Feature
+4.211word.lower():led-invasion
+4.151word.lower():holiday
+4.065word.lower():caribbean
+3.651+1:word.lower():possessions
+3.446+1:word.lower():regional
+3.430+1:word.lower():french
+3.374-1:word.lower():nahr
+3.296-1:word.lower():tokugawa
+3.296word.lower():shogunate
+3.232word.lower():restaurant
+3.127word.lower():island
+3.063word.lower():autonomy
+3.059+1:word.lower():produced
+3.054-1:word.lower():kennedy
+2.992-1:word.lower():christmas
+2.890word.lower():ocean
+2.885word.lower():east
+2.852+1:word.lower():block
+2.826-1:word.lower():sumatran
+2.745-1:word.lower():surma
+2.721-1:word.lower():john
+2.675word.lower():subway
+2.645+1:word.lower():crude
+2.635+1:word.lower():service
+2.623+1:word.lower():holidays
+2.593word.lower():lions
+2.482+1:word.lower():islamic
+2.409+1:word.lower():crisis
… 2989 more positive …
… 525 more negative …
-2.367word[-3:]:ost
-2.493word[-3:]:day
Weight?Feature
+6.735word.lower():afghan
+6.602word.lower():niger
+6.219word.lower():nepal
+5.432word.lower():spaniard
+5.391word.lower():azerbaijan
+5.138word.lower():iranian
+5.127word.lower():mexican
+5.080word.lower():argentine
+4.926word.lower():gibraltar
+4.829word.lower():iraqi
+4.706word.lower():spaniards
+4.662word.lower():croats
+4.638word.lower():venezuelan
+4.599word.lower():cuban
+4.526word.lower():korean
+4.526word.lower():polish
+4.480word.lower():aussies
+4.313word.lower():bahamas
+4.301word.lower():syrian
+4.280word.lower():andorra
+4.278word.lower():jordan
+4.271word.lower():turkish
+4.234word.lower():madagonia
+4.226word.lower():chechen
+4.224word.lower():chilean
+4.215word.lower():kenyan
+4.209word.lower():irish
+4.206word.lower():egyptian
+4.191word.lower():palestinians
+4.147word.istitle()
… 1434 more positive …
… 505 more negative …
Weight?Feature
+5.622+1:word.lower():mayor
+4.073-1:word.lower():democratic
+3.844-1:word.lower():bosnian
+3.602+1:word.lower():developed
+3.543word.lower():korean
+3.308word[-3:]:can
+3.226-1:word.lower():soviet
+3.217word.lower():city
+3.179+1:word.lower():health
+3.172word.lower():cypriots
+3.000word.lower():britons
+2.857+1:word.lower():under
+2.841+1:word.lower():iraq
+2.737+1:word.lower():invaded
+2.619+1:word.lower():man
+2.601+1:word.lower():returned
+2.547-1:word.lower():islamic
+2.532+1:word.lower():did
+2.471+1:word.lower():also
+2.449word[-2:]:bs
+2.327word.lower():indians
+2.307word.lower():cypriot
+2.294word[-3:]:iot
+2.220word[-3:]:ots
+2.188-1:word.lower():panama
+2.159+1:word.lower():began
+2.109word[-3:]:ovy
+2.109word.lower():muscovy
+2.095+1:word.lower():countries
+2.067word[-2:]:ot
… 207 more positive …
… 40 more negative …
Weight?Feature
+6.149word.lower():katrina
+5.371word.lower():marburg
+4.334word.lower():rita
+3.535+1:word.lower():shot
+2.959word[-3:]:ita
+2.791word.lower():leukemia
+2.769word[-3:]:urg
+2.759word[-3:]:mia
+2.665word.lower():paul
+2.647+1:word.lower():strain
+2.595word[-2:]:N1
+2.552word.lower():ebola
+2.505word[-3:]:5N1
+2.505word.lower():h5n1
+2.505+1:word.lower():immunization
+2.454word[-3:]:aul
+2.444word.lower():danielle
+2.379+1:word.lower():lives
+2.349word.lower():acc
+2.349word[-3:]:ACC
+2.337-1:word.lower():often-deadly
+2.322-1:word.lower():7,000
+2.280word[-2:]:TB
+2.222+1:word.lower():epidemics
+2.174word[-2:]:rg
+2.158word.isupper()
+2.147+1:word.lower():should
+2.140-1:word.lower():case
+2.133word.lower():amur
+2.121+1:word.lower():correctly
… 242 more positive …
… 39 more negative …
Weight?Feature
+2.681word.lower():rita
+2.327word[-3:]:ita
+2.315+1:word.lower():outbreak
+1.944-1:word.lower():hurricanes
+1.909word[-2:]:ta
+1.747word.lower():flu
+1.670word[-2:]:lu
+1.654-1:word.lower():type
+1.624+1:word.lower():relief
+1.613-1:postag:NN
+1.572-1:word.istitle()
+1.570-1:word.lower():heart
+1.471+1:word.lower():last
+1.422+1:word.lower():slammed
+1.421-1:word.lower():jing
+1.421word.lower():jing
+1.400word.lower():katrina
+1.280+1:word.lower():says
+1.171word.lower():disease
+1.170-1:word.lower():hurricane
+1.153-1:word.lower():avian
+1.137word.lower():circumpolar
+1.121word[-3:]:Flu
+1.092-1:word.lower():antarctic
+1.068-1:word.lower():circumpolar
+1.066word[-3:]:ase
+1.051+1:word.lower():current
+1.050word[-3:]:ina
+1.045word.lower():current
+1.036word[-2:]:ba
… 91 more positive …
… 20 more negative …
Weight?Feature
+7.344word.lower():philippine
+6.075word.lower():mid-march
+5.812word.lower():hamas
+5.779-1:word.lower():rice
+5.629word.lower():al-qaida
+5.071word.lower():taleban
+4.756word.lower():taliban
+4.729-1:word.lower():senator
+4.723word.lower():reuters
+4.662word.lower():hezbollah
+4.618word.lower():university
+4.565word.lower():conocophillips
+4.295word.lower():boeing
+4.269word.lower():senate
+4.244word.lower():constantinople
+4.240word.lower():kindhearts
+4.141word.lower():boers
+4.092-1:word.lower():singh
+4.061word.lower():exxonmobil
+4.054-1:word.lower():nepal
+4.002word.lower():yukos
+3.997word.lower():munich
+3.969-1:word.lower():niger
+3.943word.lower():congress
+3.920word.lower():xinhua
+3.909word.lower():mcdonald
+3.907word.lower():daimlerchrysler
+3.845word.lower():convergence
+3.845-1:word.lower():israel
+3.824-1:word.lower():semi-autonomous
… 6796 more positive …
… 1476 more negative …
Weight?Feature
+3.981+1:word.lower():attained
+3.785+1:word.lower():reporter
+3.486-1:word.lower():associated
+3.463word.lower():singapore
+3.400word.lower():member-countries
+3.365-1:word.lower():decathlon
+3.360+1:word.lower():ohlmert
+3.343word.lower():times
+3.335word.lower():member-states
+3.282+1:word.lower():separating
+3.264-1:word.lower():&
+3.156+1:word.lower():mulgueta
+3.127word.lower():nations
+3.126word.lower():holiday
+3.099word.lower():decathlon
+3.067+1:word.lower():ms.
+3.063+1:word.lower():1947
+3.041word.lower():airlines
+3.029word.lower():washington
+2.900+1:word.lower():post
+2.884word.lower():relief
+2.880word.lower():protests
+2.877+1:word.lower():mil
+2.855word.lower():ohlmert
… 6749 more positive …
… 1545 more negative …
-3.007-1:word.lower():hamas
-3.224-1:word.lower():minister
-3.233word[-2:]:hn
-3.909word[-2:]:lf
-4.079word.lower():city
-4.283word.lower():secretary
Weight?Feature
+7.301word.lower():president
+6.125word.lower():obama
+5.647word.lower():senator
+5.367word.lower():greenspan
+5.325word.lower():vice
+4.824word.lower():western
+4.721word.lower():hall
+4.600word.lower():prime
+4.541word.lower():clinton
+4.510word.lower():frank
+4.383word.lower():cobain
+4.318word.lower():milosevic
+4.177word.lower():brent
+4.002word[-2:]:r.
+3.953word.lower():johnston
+3.919word.lower():spears
+3.823word.lower():zidane
+3.811word.lower():al-zarqawi
+3.796word.lower():mccain
+3.771word.lower():toure
+3.722word.lower():barghouti
+3.670word.lower():rice
+3.660+1:word.lower():extra
+3.641word.lower():friedan
+3.614word.lower():whittington
+3.596-1:word.lower():spain
+3.589word.lower():larose
+3.559word.lower():preval
+3.555word.lower():enkhbayar
… 6345 more positive …
… 1308 more negative …
-3.533word.lower():venezuela
Weight?Feature
+4.163word.lower():obama
+3.625+1:word.lower():advisor
+3.517word.lower():pressewednesday
+3.464+1:word.lower():timothy
+3.230+1:word.lower():gao
+3.191+1:word.lower():fighters
+3.102-1:word.lower():michael
+3.079word.lower():gates
+2.944-1:word.lower():david
+2.912-1:word.lower():davis
+2.906word.lower():ahmed
+2.879-1:word.lower():condoleezza
+2.850word.lower():laden
+2.782+1:word.lower():hui
+2.743-1:word.lower():bashar
+2.710+1:word.lower():atal
+2.675-1:word.lower():viktor
+2.572-1:word.lower():paul
+2.569word.lower():christians
+2.561+1:word.lower():convoy
+2.559word.lower():rice
+2.541+1:word.lower():legally
+2.525-1:word.lower():donald
+2.493word.lower():milosevic
+2.477word.lower():gration
+2.459+1:word.lower():saeb
+2.450word.lower():mcalpine
+2.441+1:word.lower():udi
… 5553 more positive …
… 1380 more negative …
-3.158-1:word.lower():sri
-4.190word[-3:]:day
Weight?Feature
+7.226word.lower():multi-candidate
+6.381word.lower():february
+6.335word.lower():january
+6.181word.lower():2000
+6.126word.lower():one-year
+5.950word.lower():weekend
+5.557+1:word.lower():week
+5.225word.lower():august
+5.199word.lower():december
+4.961word.lower():september
+4.783word.lower():april
+4.752word.lower():june
+4.652word.lower():1980s
+4.591word[-3:]:Day
+4.549word.lower():october
+4.548word.lower():november
+4.519word.lower():eucharist
+4.388-1:word.lower():week
+4.344word.lower():titan
+4.286word.lower():half-hour
+4.273word.lower():mid-afternoon
+4.251+1:word.lower():year
+4.237word.lower():midnight
+4.117word.lower():one-fourth
+4.097+1:word.lower():working-age
+4.016word.lower():quarter-century
+3.977word.lower():pre-season
+3.910word.lower():non-residents
+3.880word.lower():mid-week
+3.853-1:word.lower():cannes
… 3173 more positive …
… 856 more negative …
Weight?Feature
+4.467+1:word.lower():stocky
+4.098+1:word.lower():old
+4.080word.lower():working-age
+3.831word.lower():2000
+3.821word.lower():april
+3.654+1:word.lower():jose
+3.597-1:word.lower():this
+3.468+1:word.lower():reflected
+3.407+1:word.lower():month
+3.403-1:word.lower():past
+3.230word.lower():weekend
+3.164word.lower():evening
+3.053+1:word.lower():katrina
+3.035word.lower():january
+3.020+1:word.lower():population
+3.009+1:word.lower():ago
+2.806-1:word.lower():nov.
+2.757word[-3:]:.m.
+2.757word[-2:]:m.
+2.754+1:word.lower():removed
+2.748-1:word.lower():earlier
+2.734+1:word.lower():ukrainian
+2.730+1:word.lower():early
+2.727-1:word.lower():ecuador
+2.726-1:word.lower():uganda
+2.700word.lower():august
+2.695+1:word.lower():year
+2.650-1:word.lower():second
… 2169 more positive …
… 459 more negative …
-2.776word[-3:]:way
-2.900+1:word.lower():3

Improve the model with regularization

Puh, it looks like the CRF just remembering a lot of words. For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. To overcome this issue we can tune the parameters, especially the regularization parameters of the CRF algorithm. The $c_1$ and $c_2$ parameter of the CRF algorithm are the regularization parameters $\lambda_1$ and $\lambda_2$. While $c_1$ weights the $l_1$ regularization, the $c_2$ parameter weights the $l_2$ regularization. We know limit the number of features used by enforcing sparsity on the parameter vector $w$. To do this we increase the $l_1$-regularization parameter $c_1$.

crf = CRF(algorithm='lbfgs',
c1=10,
c2=0.1,
max_iterations=100,
all_possible_transitions=False)

pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

         precision    recall  f1-score   support
B-art       0.00      0.00      0.00       402
B-eve       0.80      0.27      0.40       308
B-geo       0.82      0.90      0.86     37644
B-gpe       0.95      0.92      0.94     15870
B-nat       0.69      0.09      0.16       201
B-org       0.78      0.67      0.72     20143
B-per       0.80      0.76      0.78     16990
B-tim       0.93      0.83      0.88     20333
I-art       0.00      0.00      0.00       297
I-eve       0.64      0.12      0.20       253
I-geo       0.81      0.73      0.77      7414
I-gpe       0.93      0.37      0.53       198
I-nat       0.00      0.00      0.00        51
I-org       0.75      0.76      0.75     16784
I-per       0.80      0.90      0.85     17251
I-tim       0.84      0.67      0.74      6528
O      		0.99      0.99      0.99    887908
avg / total 0.96      0.97      0.96   1048575


This looks quite nice.

crf.fit(X, y)

CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=False, averaging=None, c=None, c1=10, c2=0.1,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
max_linesearch=None, min_freq=None, model_filename=None,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose=False)


Now we look again at the features.

eli5.show_weights(crf, top=30)

O B-art I-art B-eve I-eve B-geo I-geo B-gpe I-gpe B-nat I-nat B-org I-org B-per I-per B-tim I-tim From \ To 4.037 2.614 0.0 2.167 0.0 2.069 0.0 1.64 0.0 1.788 0.0 2.589 0.0 4.301 0.0 2.546 0.0 -0.185 0.0 7.041 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.398 0.0 7.378 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.422 0.0 0.0 0.0 8.084 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.012 0.0 0.0 0.0 0.0 0.0 10.604 0.969 0.0 0.0 0.0 0.788 0.0 0.502 0.0 2.172 0.0 -0.991 0.0 0.0 0.0 0.0 0.0 7.889 -0.0 0.0 0.0 0.0 -0.005 0.0 -0.2 0.0 -0.144 0.0 1.064 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.568 0.0 0.0 1.227 0.0 1.479 0.0 0.0 0.0 -0.259 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.363 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.942 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.597 0.0 0.0 0.0 0.0 0.0 0.0 0.769 0.0 0.0 0.0 0.0 8.056 1.782 0.0 0.003 0.0 -0.344 0.0 0.0 0.0 0.0 -0.835 0.0 0.0 0.0 0.0 0.0 0.0 7.078 1.399 0.0 0.27 0.0 -0.102 0.0 0.0 0.0 0.0 0.762 0.0 0.526 0.0 0.0 0.0 1.458 0.0 0.0 6.393 0.0 0.0 0.138 0.0 0.0 0.0 0.0 -0.095 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.135 1.062 0.0 1.03 0.0 0.0 0.0 0.0 0.017 0.0 -0.573 0.0 0.0 0.0 0.0 0.0 0.084 0.0 0.0 8.243 -0.08 0.0 0.0 0.0 0.0 0.0 0.0 -0.903 0.0 0.0 0.0 -0.133 0.0 0.0 0.0 0.0 7.534
y=O top featuresy=B-art top featuresy=I-art top featuresy=B-eve top featuresy=I-eve top featuresy=B-geo top featuresy=I-geo top featuresy=B-gpe top featuresy=I-gpe top featuresy=B-nat top featuresy=I-nat top featuresy=B-org top featuresy=I-org top featuresy=B-per top featuresy=I-per top featuresy=B-tim top featuresy=I-tim top features
Weight?Feature
+5.227word[-2:]:N1
+4.445word.lower():last
+4.243word.lower():jewish
+4.026word.lower():hurricane
+3.933EOS
+3.923bias
+3.844word.lower():month
+3.764word.lower():trade
+3.480BOS
+3.445word.lower():year
+3.426word.lower():secretary
+3.419word.lower():kurdish
+3.313word.lower():christian
+3.313word.lower():israeli-palestinian
+3.191word.lower():chairman
+3.069postag[:2]:VB
+2.976word.lower():internet
+2.959-1:word.lower():prime
… 477 more positive …
… 338 more negative …
-2.819+1:word.lower():next
-2.977+1:word.lower():last
-2.980+1:word.lower():years
-2.990word.lower():afternoon
-3.013+1:word.lower():months
-3.197word.lower():quarter
-3.219word.lower():night
-3.325postag:NNP
-3.991word.lower():decade
-4.257word.lower():summer
-4.278word[-2:]:0s
-4.728word.lower():morning
Weight?Feature
+1.371word.lower():english
+1.043word[-3:]:ish
+0.760word[-3:]:ook
+0.706-1:postag[:2]:
+0.706-1:postag:
+0.600-1:postag:NN
+0.573postag:NNP
+0.464postag[:2]:NN
+0.455word.istitle()
+0.442word[-2:]:ok
+0.395-1:word.lower():"
+0.379word.isupper()
+0.376+1:word.isupper()
+0.255-1:postag:DT
+0.255-1:postag[:2]:DT
+0.197-1:word.lower():the
+0.161+1:postag:NNP
+0.094+1:postag[:2]:NN
+0.093bias
+0.062+1:word.istitle()
+0.001+1:postag[:2]:,
+0.001+1:postag:,
+0.001+1:word.lower():,
-0.029+1:postag[:2]:VB
-0.067-1:word.istitle()
-0.115-1:postag:IN
-0.115-1:postag[:2]:IN
-0.248word[-2:]:an
-0.348-1:postag:NNP
-0.512BOS
Weight?Feature
+0.675-1:postag:NNP
+0.653-1:word.istitle()
+0.520-1:postag[:2]:NN
+0.271word.istitle()
+0.171+1:postag[:2]:NN
+0.166+1:word.istitle()
+0.096-1:word.isupper()
+0.029bias
+0.028word.isupper()
+0.009postag[:2]:NN
+0.006+1:word.lower():.
+0.006+1:postag:.
+0.006+1:postag[:2]:.
+0.001+1:postag:NN
+0.000+1:postag[:2]:,
+0.000+1:word.lower():,
+0.000+1:postag:,
-0.087+1:postag:NNP
-0.287+1:postag[:2]:VB
Weight?Feature
+4.325-1:word.lower():war
+1.416word[-3:]:II
+1.416word.lower():ii
+1.415word[-2:]:II
+1.045+1:word.lower():open
+0.733+1:word.lower():war
+0.559word.isupper()
+0.547+1:word.istitle()
+0.515word.lower():world
+0.515word[-3:]:rld
+0.460word[-2:]:ld
+0.388word.istitle()
+0.369word[-2:]:an
+0.363-1:word.lower():the
+0.311-1:postag[:2]:VB
+0.282postag[:2]:NN
+0.192postag:NNP
+0.126-1:postag:CD
+0.126-1:postag[:2]:CD
+0.038+1:postag:NNP
+0.024word.lower():australian
+0.021-1:postag[:2]:DT
+0.021-1:postag:DT
+0.000-1:word.istitle()
+0.000bias
+0.000-1:postag:NNP
-0.028-1:postag[:2]:IN
-0.028-1:postag:IN
Weight?Feature
+1.639word.lower():games
+1.185word.lower():open
+1.182word[-3:]:pen
+0.934word[-3:]:Day
+0.906-1:word.lower():war
+0.888word.lower():day
+0.863-1:word.istitle()
+0.725word.isupper()
+0.674-1:word.lower():world
+0.579word[-3:]:mes
+0.537word[-3:]:War
+0.530word.lower():war
+0.496+1:word.lower():in
+0.469word[-2:]:ar
+0.450-1:postag[:2]:NN
+0.404word[-2:]:en
+0.270word.istitle()
+0.204postag:NNPS
+0.189-1:postag:NNP
+0.187+1:postag[:2]:NN
+0.153word[-3:]:Cup
+0.153word.lower():cup
+0.141word[-2:]:up
+0.011postag[:2]:NN
+0.002word[-2:]:ay
+0.000+1:word.istitle()
-0.003postag:NNP
-0.093+1:postag[:2]:VB
-0.236bias
Weight?Feature
+3.396word.lower():beijing
+2.833-1:word.lower():mr.
+2.719word.lower():israel
+2.656word.lower():iran
+2.425word.lower():britain
+2.231word.lower():east
+2.228word.lower():paris
+2.157word.lower():caribbean
+2.096word.lower():washington
+2.029word.lower():london
+2.005word.lower():ukraine
+1.956word.lower():republic
+1.943word.lower():france
+1.922word.lower():lebanon
+1.854word.lower():europe
+1.770word[-3:]:the
+1.744word[-2:]:ia
+1.672word.lower():taiwan
+1.649word.lower():china
+1.617word.lower():burma
+1.586word.istitle()
+1.486-1:word.lower():northern
+1.456word.lower():iraq
+1.455-1:word.lower():in
+1.443+1:word.lower():province
+1.435word.lower():u.s.
+1.415word[-3:]:.S.
+1.415word[-2:]:S.
… 313 more positive …
… 102 more negative …
-1.483word[-3:]:ion
-1.733word[-3:]:May
Weight?Feature
+2.929-1:word.lower():san
+2.782word.lower():airport
+2.180word.lower():island
+1.887-1:word.lower():of
+1.823-1:word.lower():gulf
+1.817word.lower():city
+1.770word[-3:]:ast
+1.655-1:word.lower():middle
+1.646-1:word.lower():new
+1.515word.lower():republic
+1.455word.lower():station
+1.429-1:word.lower():hong
+1.360word.lower():ocean
+1.357-1:word.lower():south
+1.229word[-3:]:nds
+1.101-1:word.lower():southern
+1.010-1:word.lower():eastern
+0.979word.lower():union
+0.919word[-2:]:ia
+0.908word.lower():states
+0.883word.istitle()
+0.877word[-2:]:ea
+0.829word.lower():kong
+0.814postag:NNP
+0.809-1:word.lower():western
+0.805word.lower():east
… 113 more positive …
… 46 more negative …
-0.802word[-2:]:ar
-1.001postag:NNS
-1.029-1:word.isupper()
-1.068-1:word.lower():u.s.
Weight?Feature
+4.359word.lower():niger
+4.050word.istitle()
+3.524word.lower():nepal
+3.114word.lower():afghan
+3.101word[-3:]:pal
+3.004word.lower():jordan
+2.819word[-3:]:ans
+2.661word.lower():poland
+2.626word.lower():korean
+2.461word[-3:]:ese
+2.401postag:NNS
+2.391word.lower():iraqi
+2.245word[-2:]:is
+2.237word.lower():german
+2.225word.lower():azerbaijan
+1.947word.lower():venezuelan
+1.911word.lower():spain
+1.903word.lower():lanka
+1.902word[-3:]:nka
+1.860word[-3:]:ger
+1.838word.lower():chechen
+1.831word.lower():american
+1.825word.lower():turkish
+1.825word[-2:]:an
+1.790word[-2:]:li
… 136 more positive …
… 52 more negative …
-1.830word[-2:]:ic
-1.981postag:NNPS
-2.021word.lower():jordanian
-2.826word.lower():european
-2.904word.lower():asian
Weight?Feature
+2.738-1:word.lower():bosnian
+2.330-1:postag:NNP
+2.036word.istitle()
+1.426-1:word.lower():north
+1.357word.lower():cypriots
+0.748postag[:2]:JJ
+0.617postag:JJ
+0.556-1:word.istitle()
+0.435word[-2:]:ot
+0.420word.lower():cypriot
+0.418-1:word.lower():turkish
+0.415word[-3:]:iot
+0.369word.lower():korea
+0.308-1:postag[:2]:NN
+0.296word[-3:]:rea
+0.284-1:word.lower():united
+0.163word[-2:]:ea
+0.152postag:NNS
+0.010+1:postag:NNP
+0.000word[-3:]:ots
+0.000+1:word.istitle()
-0.121-1:postag:JJ
-0.121-1:postag[:2]:JJ
-0.212postag[:2]:NN
-0.640bias
-1.454postag:NNP
Weight?Feature
+3.889word.lower():katrina
+2.306word[-2:]:N1
+2.063word.isupper()
+1.973word.lower():rita
+1.904word.lower():marburg
+1.710word[-3:]:urg
+1.702word[-3:]:ita
+1.686word[-2:]:rg
+1.580word[-3:]:5N1
+1.580word.lower():h5n1
+1.464word[-3:]:ina
+1.298word[-2:]:ta
+0.742word.lower():hurricane
+0.721word[-3:]:ane
+0.643word[-2:]:na
+0.636+1:word.lower():katrina
+0.617word[-2:]:ne
+0.536word.lower():aids
+0.535word[-3:]:IDS
+0.535word[-2:]:DS
+0.475+1:postag[:2]:NN
+0.461postag:NNP
+0.243postag[:2]:NN
+0.187bias
+0.148word.istitle()
+0.071+1:postag:NN
+0.067+1:postag[:2]:VB
+0.033-1:postag:IN
+0.033-1:postag[:2]:IN
Weight?Feature
+1.087-1:postag[:2]:NN
+0.750-1:word.lower():hurricane
+0.627word.lower():katrina
+0.538word[-3:]:ina
+0.497word[-2:]:na
+0.385-1:word.istitle()
Weight?Feature
+4.945word.lower():al-qaida
+4.719word.lower():philippine
+4.166word.lower():hamas
+3.536-1:word.lower():niger
+2.756-1:word.lower():senator
+2.640word.lower():xinhua
+2.626word[-3:]:The
+2.624word.lower():hezbollah
+2.621word.lower():western
+2.526-1:word.lower():mr.
+2.488word.lower():taleban
+2.245word[-3:]:ban
+2.212word.lower():congress
+2.182word.isupper()
+2.144word.lower():singapore
+2.128-1:word.lower():nepal
+2.049word.lower():reuters
+2.024-1:word.lower():olympic
+1.937word.lower():parliament
+1.854word.lower():justice
+1.794word[-3:]:sed
+1.776word.lower():taliban
+1.770word.lower():european
+1.727word.lower():navy
+1.683-1:word.lower():john
+1.675postag:NNPS
+1.668word.lower():boeing
+1.655word.lower():yukos
+1.637word.lower():senate
+1.576-1:word.lower():iraq
… 271 more positive …
… 99 more negative …
Weight?Feature
+2.216-1:word.lower():&
+2.035+1:word.lower():post
+1.987word[-3:]:for
+1.889word.lower():ministry
+1.712word.lower():department
+1.664-1:word.lower():european
+1.631-1:word.lower():u.s.
+1.559-1:word.lower():group
+1.464word[-3:]:ons
+1.393-1:word.lower():militant
+1.349-1:word.lower():india
+1.348word.lower():council
+1.348word[-3:]:cil
+1.286word[-3:]:try
+1.249word.lower():court
+1.233word.lower():bank
+1.225-1:word.lower():for
+1.108word.lower():group
+1.082word.lower():nations
+0.998+1:word.lower():hamas
+0.977word.lower():for
+0.972word.lower():times
+0.925-1:word.lower():people
+0.900-1:word.lower():democratic
+0.899word.lower():airlines
+0.884-1:word.lower():of
… 176 more positive …
… 89 more negative …
-1.085-1:word.lower():minister
-1.205+1:word.lower():station
-1.554word.lower():city
-2.368word.lower():secretary
Weight?Feature
+5.499word.lower():prime
+3.681word.lower():president
+3.665word.lower():western
+3.521word.lower():vice
+3.017BOS
+2.741word.lower():obama
+2.622word.lower():al-zarqawi
+2.514word.lower():senator
+2.491word[-2:]:r.
+2.287word[-2:]:s.
+2.089word.lower():secretary
+1.751+1:word.lower():administration
+1.722word.lower():clinton
+1.647word.lower():mr.
+1.647word[-3:]:Mr.
+1.555+1:word.lower():government
+1.345word.lower():rice
+1.326-1:word.lower():state
+1.313-1:word.lower():u.s.
+1.233word.lower():peter
+1.187word[-2:]:ez
+1.167postag:NNP
+1.154word[-2:]:ri
+1.123word[-3:]:yor
+1.093word[-2:]:ll
… 150 more positive …
… 121 more negative …
-1.080word[-2:]:th
-1.084word.lower():venezuela
-1.108word[-2:]:ew
-1.214word[-2:]:st
-1.460-1:word.lower():in
Weight?Feature
+1.624-1:word.lower():condoleezza
+1.445word.lower():rice
+1.322+1:word.lower():of
+1.309-1:postag:NN
+1.291+1:word.lower():reports
+1.135word[-2:]:ez
+1.123word.lower():obama
+1.102postag:NNP
+1.077-1:postag:NNP
+0.989-1:word.lower():minister
+0.951-1:postag[:2]:NN
+0.915-1:word.lower():bin
+0.687-1:word.lower():david
+0.684word[-3:]:son
+0.681word.lower():annan
+0.656word[-3:]:med
+0.644-1:word.lower():paul
+0.641word[-3:]:nan
+0.606-1:word.lower():michael
+0.580-1:word.lower():jose
+0.571-1:word.lower():secretary
+0.536word.lower():laden
… 80 more positive …
… 56 more negative …
-0.640word[-2:]:st
-0.663word[-2:]:ka
-0.753word[-2:]:ty
-0.950word[-3:]:nal
-0.968+1:postag[:2]:NN
-1.003bias
-1.215-1:word.lower():sri
-1.792word[-3:]:ion
Weight?Feature
+6.509word[-3:]:day
+4.569-1:word.lower():week
+4.254word[-3:]:Day
+3.817word.lower():february
+3.798-1:word.lower():month
+3.610word.lower():january
+3.455-1:word.lower():months
+3.421word.lower():august
+3.411word[-2:]:0s
+3.234+1:word.lower():week
+2.974+1:word.lower():year
+2.929word.lower():march
+2.928-1:word.lower():last posts = "/:year/:month/:title/"
+2.785word.lower():christmas
+2.728word.lower():june
+2.636word[-2:]:ay
+2.635-1:word.lower():early
+2.459word[-3:]:ber
+2.447word.lower():later
+2.410-1:word.lower():next
+2.401word[-3:]:uly
+2.398-1:word.lower():days
+2.386word.lower():several
+2.358word[-2:]:th
+2.349+1:word.lower():weeks
+2.345+1:word.lower():years
+2.290-1:word.lower():years
+2.204-1:word.lower():late
+2.186word.lower():since
+2.155word[-3:]:ear
… 230 more positive …
… 76 more negative …
Weight?Feature
+5.068word[-3:]:day
+2.456word[-3:]:.m.
+2.456word[-2:]:m.
+2.219word[-3:]:Day
+2.142word.lower():decades
+2.097word[-3:]:ber
+1.821word[-2:]:ay
+1.709word.lower():august
+1.650word.lower():march
+1.629+1:word.lower():months
+1.626word.isdigit()
+1.530-1:word.lower():january
+1.381+1:word.lower():years
+1.342word[-2:]:ry
+1.316postag:CD
+1.316postag[:2]:CD
+1.314word.lower():april
+1.308-1:word.lower():world
+1.254word.lower():evening
+1.218+1:word.lower():days
+1.194-1:word.lower():july
+1.189-1:word.lower():first
+1.074word[-3:]:des
+1.067word.lower():june
+1.065-1:word.lower():end
+0.977-1:word.lower():june
+0.929word[-3:]:alf
+0.924word[-2:]:ly
+0.893word.lower():january
+0.891-1:word.lower():december
… 112 more positive …
… 42 more negative …

As expected, we see, that the model stops to rely on words and uses the context more, as it generalizes better is more useful over multiple training instances. This is an effect of the $l_1$-regularization.

This is it for this time, but stay tuned for the next post, where we will look at named entity recognition with recurrent neural networks.

© depends-on-the-definition 2017-2022