This is the second post in my series about named entity recognition. If you haven’t seen the first one, have a look now. Last time we started by memorizing entities for words and then used a simple classification model to improve the results a bit. This model also used context properties and the structure of the word in question. But the results where not overwhelmingly good, so now we’re going to look into a more sophisticated algorithm, a so called conditional random field (CRF).

We denote x = (x_1,\dots, x_m) as the input sequence, i.e. the words of a sentence and s = (s_1,\dots, s_m) as the sequence of output states, i.e. the named entity tags. In conditional random fields we model the conditional probability

    \[p(s_1,\dots,s_m|x_1,\dots,x_m).\]

We do this by define a feature map

    \[\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d\]

that maps an entire input sequence x paired with an entire state sequence s to some d-dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector w\in\mathbb{R}^d

    \[p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},\]

where s^\prime ranges over all possible output sequences. For the estimation of w, we assume that we have a set of n labeled examples \{(x^i, s^i)\}_{i=1}^n. Now we define the regularized log-likelihood function L

    \[L(w) = \sum_{i=1}^n \log p(s^i|x^i; w) - \frac{\lambda_2}{2}\|w\|_2^2 - \lambda_1 \|w\|_1.\]

The terms \frac{\lambda_2}{2}\|w\|_2^2 and \lambda_1 \|w\|_1 forces the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters \lambda_2 and \lambda_1 allows to enforce more or less regularization. The parameter vector w^* is then estimated as

    \[w^* = \text{arg max}_{w\in \mathbb{R}^d} L(w)\]

If we estimated the vector w^*, we can find the most likely tag a sentence s^* for a sentence x by

    \[s^* = \text{arg max}_{s} p(s|x; w^*).\]

Now we want to apply this model. Let’s start by loading the data.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
In [2]:
data = data.fillna(method="ffill")
In [3]:
data.tail(10)
Out[3]:
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [4]:
words = list(set(data["Word"].values))
In [5]:
n_words = len(words); n_words
Out[5]:
35178

So we have 47959 sentences containing 35178 different words. We change the SentenceGetter class from last post a little and use it to retrieve sentences with their labels.

In [16]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
In [17]:
getter = SentenceGetter(data)
In [19]:
sent = getter.get_next()

This is how a sentence looks now.

In [20]:
print(sent)
[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

Okay, that looks like expected, now get all sentences.

In [21]:
sentences = getter.sentences

Now we craft a set of features and prepare the dataset.

In [22]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]
In [23]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

Now we can initialize the algorithm. We use the conditional random field (CRF) implementation provided by sklearn-crfsuite.

In [24]:
from sklearn_crfsuite import CRF

crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)

Okay, let’s look if it works. Like last time, we performe a 5-fold cross-validation.

In [27]:
from sklearn.cross_validation import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
In [26]:
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.

In [28]:
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)
             precision    recall  f1-score   support

      B-art       0.37      0.11      0.17       402
      B-eve       0.52      0.35      0.42       308
      B-geo       0.85      0.90      0.88     37644
      B-gpe       0.97      0.94      0.95     15870
      B-nat       0.66      0.37      0.47       201
      B-org       0.78      0.72      0.75     20143
      B-per       0.84      0.81      0.82     16990
      B-tim       0.93      0.88      0.90     20333
      I-art       0.11      0.03      0.04       297
      I-eve       0.34      0.21      0.26       253
      I-geo       0.82      0.79      0.80      7414
      I-gpe       0.92      0.55      0.69       198
      I-nat       0.61      0.27      0.38        51
      I-org       0.81      0.79      0.80     16784
      I-per       0.84      0.89      0.87     17251
      I-tim       0.83      0.76      0.80      6528
          O       0.99      0.99      0.99    887908

avg / total       0.97      0.97      0.97   1048575

This looks like a good start. We easily beat the results from the last post.

In [29]:
crf.fit(X, y)
Out[29]:
CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to performe the investigation.

In [30]:
import eli5
In [31]:
eli5.show_weights(crf, top=30)
Out[31]:
From \ ToOB-artI-artB-eveI-eveB-geoI-geoB-gpeI-gpeB-natI-natB-orgI-orgB-perI-perB-timI-tim
O4.290.8790.01.5750.02.0920.01.3870.01.6050.02.4970.04.170.02.9860.0
B-art-0.0140.08.4420.00.0-0.3980.00.00.00.00.00.5160.0-0.8440.00.3360.0
I-art-0.6510.08.040.00.0-0.7020.00.00.00.00.00.00.00.0160.0-0.6840.0
B-eve-0.7530.00.00.07.9560.00.00.00.00.00.00.00.00.00.00.5720.0
I-eve-0.3240.00.00.07.3410.00.00.00.00.00.00.00.00.00.0-0.6210.0
B-geo0.6770.7520.00.5450.00.08.7520.5790.00.00.01.1550.01.1430.02.3440.0
I-geo-0.4690.8220.00.00.00.07.424-1.3660.00.00.0-0.0740.01.3310.01.0330.0
B-gpe0.679-1.6090.0-0.320.00.6810.00.07.4850.00.02.050.01.4590.00.7670.0
I-gpe-0.2980.00.00.00.0-1.0870.00.06.3370.00.00.00.00.1480.00.00.0
B-nat-1.1080.00.00.00.00.6250.00.00.00.07.0670.00.0-0.3050.0-0.4130.0
I-nat-1.9790.00.00.00.00.00.00.00.00.05.1970.00.01.1880.00.00.0
B-org0.0511.320.00.00.0-0.3310.00.4470.00.00.00.07.1091.0540.00.2550.0
I-org-0.2420.00.00.00.0-1.5620.00.5730.00.00.00.07.2361.6390.00.4210.0
B-per0.3640.00.00.00.00.7230.00.7340.02.1760.02.4050.00.07.1461.1650.0
I-per0.180.00.00.00.0-2.0720.0-1.5680.00.00.0-0.3410.00.06.2991.0550.0
B-tim0.286-1.0790.00.2490.0-0.0830.0-1.3380.00.0610.0-0.1480.01.3380.00.07.245
I-tim-0.2630.00.00.0720.0-0.110.0-1.4370.00.00.0-0.3740.01.8540.00.07.069
y=O

 

top features

y=B-art

 

top features

y=I-art

 

top features

y=B-eve

 

top features

y=I-eve

 

top features

y=B-geo

 

top features

y=I-geo

 

top features

y=B-gpe

 

top features

y=I-gpe

 

top features

y=B-nat

 

top features

y=I-nat

 

top features

y=B-org

 

top features

y=I-org

 

top features

y=B-per

 

top features

y=I-per

 

top features

y=B-tim

 

top features

y=I-tim

 

top features

Weight?Feature
+8.012word.lower():last
+7.999word.lower():month
+5.813word.lower():chairman
+5.612word.lower():columbia
+5.555word.lower():year
+5.232word.lower():week
+5.146word.lower():months
+5.067word.lower():internet
+4.833word.lower():weeks
+4.726word.lower():after
+4.684word.lower():republicans
+4.558word[-3:]:And
+4.436word.lower():ambassador
+4.406word.lower():chief
+4.383word.lower():trade
+4.344word.lower():early
+4.272word.lower():years
+4.216+1:word.lower():americans
+4.140word.lower():tourism
+4.127+1:word.lower():american
+4.079word.lower():christian
+4.075word.lower():spokesman
+4.060word[-3:]:De
+4.060word[-2:]:De
… 9204 more positive …
… 5208 more negative …
-4.091word[-2:]:0s
-4.158word.lower():afternoon
-4.447word.lower():palestinian
-4.515word.lower():summer
-4.607word.lower():morning
-4.801word.lower():multi-party
Weight?Feature
+5.369word.lower():twitter
+4.858word.lower():spaceshipone
+4.294word.lower():nevirapine
+4.271+1:word.lower():enkhbayar
+4.263+1:word.lower():boots
+3.893word.lower():english
+3.802-1:word.lower():engine
+3.655word[-3:]:One
+3.588-1:word.lower():film
+3.540word.lower():russian
+3.499word.lower():canal
+3.397+1:word.lower():al-arabiya
+3.345-1:word.lower():adumim
+3.237word.lower():sopranos
+3.186-1:word.lower():to
+3.150word.lower():spanish
+3.130-1:word.lower():shown
+3.014word.lower():economics
+3.006-1:word.lower():tamilnet
+2.997word.lower():frankenstadion
+2.973word.lower():settlement
+2.936word[-2:]:00
+2.919word.lower():dollar
+2.889-1:word.lower():republic
+2.889+1:word.lower():helicopters
+2.877+1:word.lower():search
+2.875-1:word.lower():program
+2.831word.lower():endeavor
+2.711word[-3:]:vor
+2.685word.lower():sidnaya
… 957 more positive …
… 81 more negative …
Weight?Feature
+3.025-1:word.lower():boeing
+2.553+1:word.lower():gained
+2.473+1:word.lower():came
+2.418-1:word.lower():cajun
+2.297word.lower():notice
+2.260word.lower():constitution
+2.112word.lower():flowers
+2.109+1:word.lower():times
+2.072+1:word.lower():marks
+2.056word.lower():a
+2.048+1:word.lower():teshome
+1.980+1:word.lower():treaty
+1.876+1:word.lower():expands
+1.875+1:word.lower():reports
+1.866-1:word.lower():dignity
+1.859word.lower():dome
+1.852+1:word.lower():early
+1.844+1:word.lower():roses
+1.805-1:word.lower():jerusalem
+1.800-1:word.lower():balad
+1.793+1:word.lower():outside
+1.779word.lower():monument
+1.774-1:word.lower():baghdad
+1.765-1:word.lower():beijing
+1.757+1:word.lower():rival
+1.747-1:word.lower():hitler
+1.668word[-3:]:One
+1.667word.lower():lies
+1.660word.lower():declaration
+1.645word.lower():mustard
… 882 more positive …
… 81 more negative …
Weight?Feature
+4.333word.lower():games
+4.263word.lower():ramadan
+4.160-1:word.lower():falklands
+3.501-1:word.lower():typhoon
+3.484word[-3:]:mes
+3.050+1:word.lower():dean
+3.046+1:word.lower():men
+3.028-1:word.lower():wars
+2.942-1:word.lower():happy
+2.938-1:word.lower():solemn
+2.915word.lower():hopman
+2.899word.lower():katrina
+2.846word.lower():olympic
+2.843word[-3:]:pic
+2.758-1:word.lower():war
+2.745word.lower():parma
+2.714-1:word.lower():midnight
+2.596word.lower():australian
+2.570-1:word.lower():2002
+2.547+1:word.lower():security
+2.518+1:word.lower():sabbath
+2.454+1:word.lower():open
+2.446+1:word.lower():event
+2.442word.lower():passover
+2.433-1:word.lower():nazi
+2.409+1:word.lower():ends
+2.390word.lower():holocaust
+2.350-1:word.lower():reigning
+2.262word[-3:]:mme
+2.262word.lower():somme
… 437 more positive …
… 49 more negative …
Weight?Feature
+4.329+1:word.lower():mascots
+3.603word.lower():games
+3.022+1:word.lower():era
+2.756word.lower():series
+2.577word.lower():dean
+2.509+1:word.lower():rally
+2.508+1:word.lower():caused
+2.504+1:word.lower():disaster
+2.441word.lower():sabbath
+2.426+1:word.lower():tore
+2.420+1:word.lower():without
+2.230-1:word.lower():jewish
+2.220+1:word.lower():now
+2.216+1:word.lower():project
+2.164+1:word.lower():suicide
+2.112-1:word.lower():awareness
+1.940+1:word.lower():holiday
+1.916+1:word.lower():peace
+1.880word[-3:]:ean
+1.861-1:word.lower():hurricane
+1.831+1:word.lower():even
+1.828+1:word.lower():finals
+1.762word.lower():conference
+1.760-1:word.lower():typhoon
+1.753-1:word.lower():may
+1.743+1:word.lower():tennis
+1.712-1:word.lower():rights
+1.702word.lower():year
+1.699+1:word.lower():olympics
+1.696word.lower():awareness
… 393 more positive …
… 64 more negative …
Weight?Feature
+6.238word.lower():mid-march
+6.002word.lower():caribbean
+5.503word.lower():martian
+5.446word.lower():beijing
+5.086word.lower():persian
+4.737-1:word.lower():hamas
+4.521-1:word.lower():mr.
+4.509word.lower():balkans
+4.362-1:word.lower():serb
+4.310word.lower():quake-zone
+4.224word.lower():philippines
+4.192word.lower():burma
+4.169+1:word.lower():phoned
+4.167word.lower():washington
+4.152word.lower():france
+4.137word.lower():paris
+4.131-1:word.lower():taleban
+4.016-1:word.lower():bordeaux
+3.943word.lower():mars
+3.900+1:word.lower():moqtada
+3.886-1:word.lower():cypriot
+3.870word.lower():mid-june
+3.837word.lower():wheeler
+3.788word.lower():pearl
+3.744-1:word.lower():malaysian
+3.698word.lower():athens
+3.616word.lower():séances
+3.616word.lower():port-au-prince
+3.589word.lower():christians
… 5949 more positive …
… 1365 more negative …
-4.659word[-3:]:The
Weight?Feature
+4.211word.lower():led-invasion
+4.151word.lower():holiday
+4.065word.lower():caribbean
+3.651+1:word.lower():possessions
+3.446+1:word.lower():regional
+3.430+1:word.lower():french
+3.374-1:word.lower():nahr
+3.296-1:word.lower():tokugawa
+3.296word.lower():shogunate
+3.232word.lower():restaurant
+3.127word.lower():island
+3.063word.lower():autonomy
+3.059+1:word.lower():produced
+3.054-1:word.lower():kennedy
+2.992-1:word.lower():christmas
+2.890word.lower():ocean
+2.885word.lower():east
+2.852+1:word.lower():block
+2.826-1:word.lower():sumatran
+2.745-1:word.lower():surma
+2.721-1:word.lower():john
+2.675word.lower():subway
+2.645+1:word.lower():crude
+2.635+1:word.lower():service
+2.623+1:word.lower():holidays
+2.593word.lower():lions
+2.482+1:word.lower():islamic
+2.409+1:word.lower():crisis
… 2989 more positive …
… 525 more negative …
-2.367word[-3:]:ost
-2.493word[-3:]:day
Weight?Feature
+6.735word.lower():afghan
+6.602word.lower():niger
+6.219word.lower():nepal
+5.432word.lower():spaniard
+5.391word.lower():azerbaijan
+5.138word.lower():iranian
+5.127word.lower():mexican
+5.080word.lower():argentine
+4.926word.lower():gibraltar
+4.829word.lower():iraqi
+4.706word.lower():spaniards
+4.662word.lower():croats
+4.638word.lower():venezuelan
+4.599word.lower():cuban
+4.526word.lower():korean
+4.526word.lower():polish
+4.480word.lower():aussies
+4.313word.lower():bahamas
+4.301word.lower():syrian
+4.280word.lower():andorra
+4.278word.lower():jordan
+4.271word.lower():turkish
+4.234word.lower():madagonia
+4.226word.lower():chechen
+4.224word.lower():chilean
+4.215word.lower():kenyan
+4.209word.lower():irish
+4.206word.lower():egyptian
+4.191word.lower():palestinians
+4.147word.istitle()
… 1434 more positive …
… 505 more negative …
Weight?Feature
+5.622+1:word.lower():mayor
+4.073-1:word.lower():democratic
+3.844-1:word.lower():bosnian
+3.602+1:word.lower():developed
+3.543word.lower():korean
+3.308word[-3:]:can
+3.226-1:word.lower():soviet
+3.217word.lower():city
+3.179+1:word.lower():health
+3.172word.lower():cypriots
+3.000word.lower():britons
+2.857+1:word.lower():under
+2.841+1:word.lower():iraq
+2.737+1:word.lower():invaded
+2.619+1:word.lower():man
+2.601+1:word.lower():returned
+2.547-1:word.lower():islamic
+2.532+1:word.lower():did
+2.471+1:word.lower():also
+2.449word[-2:]:bs
+2.327word.lower():indians
+2.307word.lower():cypriot
+2.294word[-3:]:iot
+2.220word[-3:]:ots
+2.188-1:word.lower():panama
+2.159+1:word.lower():began
+2.109word[-3:]:ovy
+2.109word.lower():muscovy
+2.095+1:word.lower():countries
+2.067word[-2:]:ot
… 207 more positive …
… 40 more negative …
Weight?Feature
+6.149word.lower():katrina
+5.371word.lower():marburg
+4.334word.lower():rita
+3.535+1:word.lower():shot
+2.959word[-3:]:ita
+2.791word.lower():leukemia
+2.769word[-3:]:urg
+2.759word[-3:]:mia
+2.665word.lower():paul
+2.647+1:word.lower():strain
+2.595word[-2:]:N1
+2.552word.lower():ebola
+2.505word[-3:]:5N1
+2.505word.lower():h5n1
+2.505+1:word.lower():immunization
+2.454word[-3:]:aul
+2.444word.lower():danielle
+2.379+1:word.lower():lives
+2.349word.lower():acc
+2.349word[-3:]:ACC
+2.337-1:word.lower():often-deadly
+2.322-1:word.lower():7,000
+2.280word[-2:]:TB
+2.222+1:word.lower():epidemics
+2.174word[-2:]:rg
+2.158word.isupper()
+2.147+1:word.lower():should
+2.140-1:word.lower():case
+2.133word.lower():amur
+2.121+1:word.lower():correctly
… 242 more positive …
… 39 more negative …
Weight?Feature
+2.681word.lower():rita
+2.327word[-3:]:ita
+2.315+1:word.lower():outbreak
+1.944-1:word.lower():hurricanes
+1.909word[-2:]:ta
+1.747word.lower():flu
+1.670word[-2:]:lu
+1.654-1:word.lower():type
+1.624+1:word.lower():relief
+1.613-1:postag:NN
+1.572-1:word.istitle()
+1.570-1:word.lower():heart
+1.471+1:word.lower():last
+1.422+1:word.lower():slammed
+1.421-1:word.lower():jing
+1.421word.lower():jing
+1.400word.lower():katrina
+1.280+1:word.lower():says
+1.171word.lower():disease
+1.170-1:word.lower():hurricane
+1.153-1:word.lower():avian
+1.137word.lower():circumpolar
+1.121word[-3:]:Flu
+1.092-1:word.lower():antarctic
+1.068-1:word.lower():circumpolar
+1.066word[-3:]:ase
+1.051+1:word.lower():current
+1.050word[-3:]:ina
+1.045word.lower():current
+1.036word[-2:]:ba
… 91 more positive …
… 20 more negative …
Weight?Feature
+7.344word.lower():philippine
+6.075word.lower():mid-march
+5.812word.lower():hamas
+5.779-1:word.lower():rice
+5.629word.lower():al-qaida
+5.071word.lower():taleban
+4.756word.lower():taliban
+4.729-1:word.lower():senator
+4.723word.lower():reuters
+4.662word.lower():hezbollah
+4.618word.lower():university
+4.565word.lower():conocophillips
+4.295word.lower():boeing
+4.269word.lower():senate
+4.244word.lower():constantinople
+4.240word.lower():kindhearts
+4.141word.lower():boers
+4.092-1:word.lower():singh
+4.061word.lower():exxonmobil
+4.054-1:word.lower():nepal
+4.002word.lower():yukos
+3.997word.lower():munich
+3.969-1:word.lower():niger
+3.943word.lower():congress
+3.920word.lower():xinhua
+3.909word.lower():mcdonald
+3.907word.lower():daimlerchrysler
+3.845word.lower():convergence
+3.845-1:word.lower():israel
+3.824-1:word.lower():semi-autonomous
… 6796 more positive …
… 1476 more negative …
Weight?Feature
+3.981+1:word.lower():attained
+3.785+1:word.lower():reporter
+3.486-1:word.lower():associated
+3.463word.lower():singapore
+3.400word.lower():member-countries
+3.365-1:word.lower():decathlon
+3.360+1:word.lower():ohlmert
+3.343word.lower():times
+3.335word.lower():member-states
+3.282+1:word.lower():separating
+3.264-1:word.lower():&
+3.156+1:word.lower():mulgueta
+3.127word.lower():nations
+3.126word.lower():holiday
+3.099word.lower():decathlon
+3.067+1:word.lower():ms.
+3.063+1:word.lower():1947
+3.041word.lower():airlines
+3.029word.lower():washington
+2.900+1:word.lower():post
+2.884word.lower():relief
+2.880word.lower():protests
+2.877+1:word.lower():mil
+2.855word.lower():ohlmert
… 6749 more positive …
… 1545 more negative …
-3.007-1:word.lower():hamas
-3.224-1:word.lower():minister
-3.233word[-2:]:hn
-3.909word[-2:]:lf
-4.079word.lower():city
-4.283word.lower():secretary
Weight?Feature
+7.301word.lower():president
+6.125word.lower():obama
+5.647word.lower():senator
+5.367word.lower():greenspan
+5.325word.lower():vice
+4.824word.lower():western
+4.721word.lower():hall
+4.600word.lower():prime
+4.541word.lower():clinton
+4.510word.lower():frank
+4.383word.lower():cobain
+4.318word.lower():milosevic
+4.177word.lower():brent
+4.002word[-2:]:r.
+3.953word.lower():johnston
+3.919word.lower():spears
+3.823word.lower():zidane
+3.811word.lower():al-zarqawi
+3.796word.lower():mccain
+3.771word.lower():toure
+3.722word.lower():barghouti
+3.670word.lower():rice
+3.660+1:word.lower():extra
+3.641word.lower():friedan
+3.614word.lower():whittington
+3.596-1:word.lower():spain
+3.589word.lower():larose
+3.559word.lower():preval
+3.555word.lower():enkhbayar
… 6345 more positive …
… 1308 more negative …
-3.533word.lower():venezuela
Weight?Feature
+4.163word.lower():obama
+3.625+1:word.lower():advisor
+3.517word.lower():pressewednesday
+3.464+1:word.lower():timothy
+3.230+1:word.lower():gao
+3.191+1:word.lower():fighters
+3.102-1:word.lower():michael
+3.079word.lower():gates
+2.944-1:word.lower():david
+2.912-1:word.lower():davis
+2.906word.lower():ahmed
+2.879-1:word.lower():condoleezza
+2.850word.lower():laden
+2.782+1:word.lower():hui
+2.743-1:word.lower():bashar
+2.710+1:word.lower():atal
+2.675-1:word.lower():viktor
+2.572-1:word.lower():paul
+2.569word.lower():christians
+2.561+1:word.lower():convoy
+2.559word.lower():rice
+2.541+1:word.lower():legally
+2.525-1:word.lower():donald
+2.493word.lower():milosevic
+2.477word.lower():gration
+2.459+1:word.lower():saeb
+2.450word.lower():mcalpine
+2.441+1:word.lower():udi
… 5553 more positive …
… 1380 more negative …
-3.158-1:word.lower():sri
-4.190word[-3:]:day
Weight?Feature
+7.226word.lower():multi-candidate
+6.381word.lower():february
+6.335word.lower():january
+6.181word.lower():2000
+6.126word.lower():one-year
+5.950word.lower():weekend
+5.557+1:word.lower():week
+5.225word.lower():august
+5.199word.lower():december
+4.961word.lower():september
+4.783word.lower():april
+4.752word.lower():june
+4.652word.lower():1980s
+4.591word[-3:]:Day
+4.549word.lower():october
+4.548word.lower():november
+4.519word.lower():eucharist
+4.388-1:word.lower():week
+4.344word.lower():titan
+4.286word.lower():half-hour
+4.273word.lower():mid-afternoon
+4.251+1:word.lower():year
+4.237word.lower():midnight
+4.117word.lower():one-fourth
+4.097+1:word.lower():working-age
+4.016word.lower():quarter-century
+3.977word.lower():pre-season
+3.910word.lower():non-residents
+3.880word.lower():mid-week
+3.853-1:word.lower():cannes
… 3173 more positive …
… 856 more negative …
Weight?Feature
+4.467+1:word.lower():stocky
+4.098+1:word.lower():old
+4.080word.lower():working-age
+3.831word.lower():2000
+3.821word.lower():april
+3.654+1:word.lower():jose
+3.597-1:word.lower():this
+3.468+1:word.lower():reflected
+3.407+1:word.lower():month
+3.403-1:word.lower():past
+3.230word.lower():weekend
+3.164word.lower():evening
+3.053+1:word.lower():katrina
+3.035word.lower():january
+3.020+1:word.lower():population
+3.009+1:word.lower():ago
+2.806-1:word.lower():nov.
+2.757word[-3:]:.m.
+2.757word[-2:]:m.
+2.754+1:word.lower():removed
+2.748-1:word.lower():earlier
+2.734+1:word.lower():ukrainian
+2.730+1:word.lower():early
+2.727-1:word.lower():ecuador
+2.726-1:word.lower():uganda
+2.700word.lower():august
+2.695+1:word.lower():year
+2.650-1:word.lower():second
… 2169 more positive …
… 459 more negative …
-2.776word[-3:]:way
-2.900+1:word.lower():3

Puh, it looks like the CRF just remembering a lot of words. For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. To overcome this issue we can tune the paramters, especially the regularization paramters of the CRF algorithm. The c1 and c2 parameter of the CRF algorithm are the regularization parameters \lambda_1 and \lambda_2. While c1 weights the l_1 regularization, the c2 parameter weights the l_2 regularization. We know limit the number of features used by enforcing sparsity on the parameter vector w. To do this we increase the l_1-regularization parameter c1.

In [42]:
crf = CRF(algorithm='lbfgs',
          c1=10,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)
In [43]:
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
In [47]:
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)
             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00       402
      B-eve       0.80      0.27      0.40       308
      B-geo       0.82      0.90      0.86     37644
      B-gpe       0.95      0.92      0.94     15870
      B-nat       0.69      0.09      0.16       201
      B-org       0.78      0.67      0.72     20143
      B-per       0.80      0.76      0.78     16990
      B-tim       0.93      0.83      0.88     20333
      I-art       0.00      0.00      0.00       297
      I-eve       0.64      0.12      0.20       253
      I-geo       0.81      0.73      0.77      7414
      I-gpe       0.93      0.37      0.53       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.75      0.76      0.75     16784
      I-per       0.80      0.90      0.85     17251
      I-tim       0.84      0.67      0.74      6528
          O       0.99      0.99      0.99    887908

avg / total       0.96      0.97      0.96   1048575

/home/private/envs/dev/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

This looks quite nice.

In [45]:
crf.fit(X, y)
Out[45]:
CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=False, averaging=None, c=None, c1=10, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

Now we look again at the features.

In [46]:
eli5.show_weights(crf, top=30)
Out[46]:
From \ ToOB-artI-artB-eveI-eveB-geoI-geoB-gpeI-gpeB-natI-natB-orgI-orgB-perI-perB-timI-tim
O4.0372.6140.02.1670.02.0690.01.640.01.7880.02.5890.04.3010.02.5460.0
B-art-0.1850.07.0410.00.00.00.00.00.00.00.00.00.00.00.00.00.0
I-art-0.3980.07.3780.00.00.00.00.00.00.00.00.00.00.00.00.00.0
B-eve-0.4220.00.00.08.0840.00.00.00.00.00.00.00.00.00.00.00.0
I-eve0.00.00.00.07.190.00.00.00.00.00.00.00.00.00.00.00.0
B-geo1.0120.00.00.00.00.010.6040.9690.00.00.00.7880.00.5020.02.1720.0
I-geo-0.9910.00.00.00.00.07.889-0.00.00.00.0-0.0050.0-0.20.0-0.1440.0
B-gpe1.0640.00.00.00.00.00.00.04.5680.00.01.2270.01.4790.00.00.0
I-gpe-0.2590.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
B-nat-0.3630.00.00.00.00.00.00.00.00.06.9420.00.00.00.00.00.0
I-nat0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
B-org0.5970.00.00.00.00.00.00.7690.00.00.00.08.0561.7820.00.0030.0
I-org-0.3440.00.00.00.0-0.8350.00.00.00.00.00.07.0781.3990.00.270.0
B-per-0.1020.00.00.00.00.7620.00.5260.00.00.01.4580.00.06.3930.00.0
I-per0.1380.00.00.00.0-0.0950.00.00.00.00.00.00.00.06.1351.0620.0
B-tim1.030.00.00.00.00.0170.0-0.5730.00.00.00.00.00.0840.00.08.243
I-tim-0.080.00.00.00.00.00.0-0.9030.00.00.0-0.1330.00.00.00.07.534
y=O

 

top features

y=B-art

 

top features

y=I-art

 

top features

y=B-eve

 

top features

y=I-eve

 

top features

y=B-geo

 

top features

y=I-geo

 

top features

y=B-gpe

 

top features

y=I-gpe

 

top features

y=B-nat

 

top features

y=I-nat

 

top features

y=B-org

 

top features

y=I-org

 

top features

y=B-per

 

top features

y=I-per

 

top features

y=B-tim

 

top features

y=I-tim

 

top features

Weight?Feature
+5.227word[-2:]:N1
+4.445word.lower():last
+4.243word.lower():jewish
+4.026word.lower():hurricane
+3.933EOS
+3.923bias
+3.844word.lower():month
+3.764word.lower():trade
+3.480BOS
+3.445word.lower():year
+3.426word.lower():secretary
+3.419word.lower():kurdish
+3.313word.lower():christian
+3.313word.lower():israeli-palestinian
+3.191word.lower():chairman
+3.069postag[:2]:VB
+2.976word.lower():internet
+2.959-1:word.lower():prime
… 477 more positive …
… 338 more negative …
-2.819+1:word.lower():next
-2.977+1:word.lower():last
-2.980+1:word.lower():years
-2.990word.lower():afternoon
-3.013+1:word.lower():months
-3.197word.lower():quarter
-3.219word.lower():night
-3.325postag:NNP
-3.991word.lower():decade
-4.257word.lower():summer
-4.278word[-2:]:0s
-4.728word.lower():morning
Weight?Feature
+1.371word.lower():english
+1.043word[-3:]:ish
+0.760word[-3:]:ook
+0.706-1:postag[:2]:“
+0.706-1:postag:“
+0.600-1:postag:NN
+0.573postag:NNP
+0.464postag[:2]:NN
+0.455word.istitle()
+0.442word[-2:]:ok
+0.395-1:word.lower():”
+0.379word.isupper()
+0.376+1:word.isupper()
+0.255-1:postag:DT
+0.255-1:postag[:2]:DT
+0.197-1:word.lower():the
+0.161+1:postag:NNP
+0.094+1:postag[:2]:NN
+0.093bias
+0.062+1:word.istitle()
+0.001+1:postag[:2]:,
+0.001+1:postag:,
+0.001+1:word.lower():,
-0.029+1:postag[:2]:VB
-0.067-1:word.istitle()
-0.115-1:postag:IN
-0.115-1:postag[:2]:IN
-0.248word[-2:]:an
-0.348-1:postag:NNP
-0.512BOS
Weight?Feature
+0.675-1:postag:NNP
+0.653-1:word.istitle()
+0.520-1:postag[:2]:NN
+0.271word.istitle()
+0.171+1:postag[:2]:NN
+0.166+1:word.istitle()
+0.096-1:word.isupper()
+0.029bias
+0.028word.isupper()
+0.009postag[:2]:NN
+0.006+1:word.lower():.
+0.006+1:postag:.
+0.006+1:postag[:2]:.
+0.001+1:postag:NN
+0.000+1:postag[:2]:,
+0.000+1:word.lower():,
+0.000+1:postag:,
-0.087+1:postag:NNP
-0.287+1:postag[:2]:VB
Weight?Feature
+4.325-1:word.lower():war
+1.416word[-3:]:II
+1.416word.lower():ii
+1.415word[-2:]:II
+1.045+1:word.lower():open
+0.733+1:word.lower():war
+0.559word.isupper()
+0.547+1:word.istitle()
+0.515word.lower():world
+0.515word[-3:]:rld
+0.460word[-2:]:ld
+0.388word.istitle()
+0.369word[-2:]:an
+0.363-1:word.lower():the
+0.311-1:postag[:2]:VB
+0.282postag[:2]:NN
+0.192postag:NNP
+0.126-1:postag:CD
+0.126-1:postag[:2]:CD
+0.038+1:postag:NNP
+0.024word.lower():australian
+0.021-1:postag[:2]:DT
+0.021-1:postag:DT
+0.000-1:word.istitle()
+0.000bias
+0.000-1:postag:NNP
-0.028-1:postag[:2]:IN
-0.028-1:postag:IN
Weight?Feature
+1.639word.lower():games
+1.185word.lower():open
+1.182word[-3:]:pen
+0.934word[-3:]:Day
+0.906-1:word.lower():war
+0.888word.lower():day
+0.863-1:word.istitle()
+0.725word.isupper()
+0.674-1:word.lower():world
+0.579word[-3:]:mes
+0.537word[-3:]:War
+0.530word.lower():war
+0.496+1:word.lower():in
+0.469word[-2:]:ar
+0.450-1:postag[:2]:NN
+0.404word[-2:]:en
+0.270word.istitle()
+0.204postag:NNPS
+0.189-1:postag:NNP
+0.187+1:postag[:2]:NN
+0.153word[-3:]:Cup
+0.153word.lower():cup
+0.141word[-2:]:up
+0.011postag[:2]:NN
+0.002word[-2:]:ay
+0.000+1:word.istitle()
-0.003postag:NNP
-0.093+1:postag[:2]:VB
-0.236bias
Weight?Feature
+3.396word.lower():beijing
+2.833-1:word.lower():mr.
+2.719word.lower():israel
+2.656word.lower():iran
+2.425word.lower():britain
+2.231word.lower():east
+2.228word.lower():paris
+2.157word.lower():caribbean
+2.096word.lower():washington
+2.029word.lower():london
+2.005word.lower():ukraine
+1.956word.lower():republic
+1.943word.lower():france
+1.922word.lower():lebanon
+1.854word.lower():europe
+1.770word[-3:]:the
+1.744word[-2:]:ia
+1.672word.lower():taiwan
+1.649word.lower():china
+1.617word.lower():burma
+1.586word.istitle()
+1.486-1:word.lower():northern
+1.456word.lower():iraq
+1.455-1:word.lower():in
+1.443+1:word.lower():province
+1.435word.lower():u.s.
+1.415word[-3:]:.S.
+1.415word[-2:]:S.
… 313 more positive …
… 102 more negative …
-1.483word[-3:]:ion
-1.733word[-3:]:May
Weight?Feature
+2.929-1:word.lower():san
+2.782word.lower():airport
+2.180word.lower():island
+1.887-1:word.lower():of
+1.823-1:word.lower():gulf
+1.817word.lower():city
+1.770word[-3:]:ast
+1.655-1:word.lower():middle
+1.646-1:word.lower():new
+1.515word.lower():republic
+1.455word.lower():station
+1.429-1:word.lower():hong
+1.360word.lower():ocean
+1.357-1:word.lower():south
+1.229word[-3:]:nds
+1.101-1:word.lower():southern
+1.010-1:word.lower():eastern
+0.979word.lower():union
+0.919word[-2:]:ia
+0.908word.lower():states
+0.883word.istitle()
+0.877word[-2:]:ea
+0.829word.lower():kong
+0.814postag:NNP
+0.809-1:word.lower():western
+0.805word.lower():east
… 113 more positive …
… 46 more negative …
-0.802word[-2:]:ar
-1.001postag:NNS
-1.029-1:word.isupper()
-1.068-1:word.lower():u.s.
Weight?Feature
+4.359word.lower():niger
+4.050word.istitle()
+3.524word.lower():nepal
+3.114word.lower():afghan
+3.101word[-3:]:pal
+3.004word.lower():jordan
+2.819word[-3:]:ans
+2.661word.lower():poland
+2.626word.lower():korean
+2.461word[-3:]:ese
+2.401postag:NNS
+2.391word.lower():iraqi
+2.245word[-2:]:is
+2.237word.lower():german
+2.225word.lower():azerbaijan
+1.947word.lower():venezuelan
+1.911word.lower():spain
+1.903word.lower():lanka
+1.902word[-3:]:nka
+1.860word[-3:]:ger
+1.838word.lower():chechen
+1.831word.lower():american
+1.825word.lower():turkish
+1.825word[-2:]:an
+1.790word[-2:]:li
… 136 more positive …
… 52 more negative …
-1.830word[-2:]:ic
-1.981postag:NNPS
-2.021word.lower():jordanian
-2.826word.lower():european
-2.904word.lower():asian
Weight?Feature
+2.738-1:word.lower():bosnian
+2.330-1:postag:NNP
+2.036word.istitle()
+1.426-1:word.lower():north
+1.357word.lower():cypriots
+0.748postag[:2]:JJ
+0.617postag:JJ
+0.556-1:word.istitle()
+0.435word[-2:]:ot
+0.420word.lower():cypriot
+0.418-1:word.lower():turkish
+0.415word[-3:]:iot
+0.369word.lower():korea
+0.308-1:postag[:2]:NN
+0.296word[-3:]:rea
+0.284-1:word.lower():united
+0.163word[-2:]:ea
+0.152postag:NNS
+0.010+1:postag:NNP
+0.000word[-3:]:ots
+0.000+1:word.istitle()
-0.121-1:postag:JJ
-0.121-1:postag[:2]:JJ
-0.212postag[:2]:NN
-0.640bias
-1.454postag:NNP
Weight?Feature
+3.889word.lower():katrina
+2.306word[-2:]:N1
+2.063word.isupper()
+1.973word.lower():rita
+1.904word.lower():marburg
+1.710word[-3:]:urg
+1.702word[-3:]:ita
+1.686word[-2:]:rg
+1.580word[-3:]:5N1
+1.580word.lower():h5n1
+1.464word[-3:]:ina
+1.298word[-2:]:ta
+0.742word.lower():hurricane
+0.721word[-3:]:ane
+0.643word[-2:]:na
+0.636+1:word.lower():katrina
+0.617word[-2:]:ne
+0.536word.lower():aids
+0.535word[-3:]:IDS
+0.535word[-2:]:DS
+0.475+1:postag[:2]:NN
+0.461postag:NNP
+0.243postag[:2]:NN
+0.187bias
+0.148word.istitle()
+0.071+1:postag:NN
+0.067+1:postag[:2]:VB
+0.033-1:postag:IN
+0.033-1:postag[:2]:IN
Weight?Feature
+1.087-1:postag[:2]:NN
+0.750-1:word.lower():hurricane
+0.627word.lower():katrina
+0.538word[-3:]:ina
+0.497word[-2:]:na
+0.385-1:word.istitle()
Weight?Feature
+4.945word.lower():al-qaida
+4.719word.lower():philippine
+4.166word.lower():hamas
+3.536-1:word.lower():niger
+2.756-1:word.lower():senator
+2.640word.lower():xinhua
+2.626word[-3:]:The
+2.624word.lower():hezbollah
+2.621word.lower():western
+2.526-1:word.lower():mr.
+2.488word.lower():taleban
+2.245word[-3:]:ban
+2.212word.lower():congress
+2.182word.isupper()
+2.144word.lower():singapore
+2.128-1:word.lower():nepal
+2.049word.lower():reuters
+2.024-1:word.lower():olympic
+1.937word.lower():parliament
+1.854word.lower():justice
+1.794word[-3:]:sed
+1.776word.lower():taliban
+1.770word.lower():european
+1.727word.lower():navy
+1.683-1:word.lower():john
+1.675postag:NNPS
+1.668word.lower():boeing
+1.655word.lower():yukos
+1.637word.lower():senate
+1.576-1:word.lower():iraq
… 271 more positive …
… 99 more negative …
Weight?Feature
+2.216-1:word.lower():&
+2.035+1:word.lower():post
+1.987word[-3:]:for
+1.889word.lower():ministry
+1.712word.lower():department
+1.664-1:word.lower():european
+1.631-1:word.lower():u.s.
+1.559-1:word.lower():group
+1.464word[-3:]:ons
+1.393-1:word.lower():militant
+1.349-1:word.lower():india
+1.348word.lower():council
+1.348word[-3:]:cil
+1.286word[-3:]:try
+1.249word.lower():court
+1.233word.lower():bank
+1.225-1:word.lower():for
+1.108word.lower():group
+1.082word.lower():nations
+0.998+1:word.lower():hamas
+0.977word.lower():for
+0.972word.lower():times
+0.925-1:word.lower():people
+0.900-1:word.lower():democratic
+0.899word.lower():airlines
+0.884-1:word.lower():of
… 176 more positive …
… 89 more negative …
-1.085-1:word.lower():minister
-1.205+1:word.lower():station
-1.554word.lower():city
-2.368word.lower():secretary
Weight?Feature
+5.499word.lower():prime
+3.681word.lower():president
+3.665word.lower():western
+3.521word.lower():vice
+3.017BOS
+2.741word.lower():obama
+2.622word.lower():al-zarqawi
+2.514word.lower():senator
+2.491word[-2:]:r.
+2.287word[-2:]:s.
+2.089word.lower():secretary
+1.751+1:word.lower():administration
+1.722word.lower():clinton
+1.647word.lower():mr.
+1.647word[-3:]:Mr.
+1.555+1:word.lower():government
+1.345word.lower():rice
+1.326-1:word.lower():state
+1.313-1:word.lower():u.s.
+1.233word.lower():peter
+1.187word[-2:]:ez
+1.167postag:NNP
+1.154word[-2:]:ri
+1.123word[-3:]:yor
+1.093word[-2:]:ll
… 150 more positive …
… 121 more negative …
-1.080word[-2:]:th
-1.084word.lower():venezuela
-1.108word[-2:]:ew
-1.214word[-2:]:st
-1.460-1:word.lower():in
Weight?Feature
+1.624-1:word.lower():condoleezza
+1.445word.lower():rice
+1.322+1:word.lower():of
+1.309-1:postag:NN
+1.291+1:word.lower():reports
+1.135word[-2:]:ez
+1.123word.lower():obama
+1.102postag:NNP
+1.077-1:postag:NNP
+0.989-1:word.lower():minister
+0.951-1:postag[:2]:NN
+0.915-1:word.lower():bin
+0.687-1:word.lower():david
+0.684word[-3:]:son
+0.681word.lower():annan
+0.656word[-3:]:med
+0.644-1:word.lower():paul
+0.641word[-3:]:nan
+0.606-1:word.lower():michael
+0.580-1:word.lower():jose
+0.571-1:word.lower():secretary
+0.536word.lower():laden
… 80 more positive …
… 56 more negative …
-0.640word[-2:]:st
-0.663word[-2:]:ka
-0.753word[-2:]:ty
-0.950word[-3:]:nal
-0.968+1:postag[:2]:NN
-1.003bias
-1.215-1:word.lower():sri
-1.792word[-3:]:ion
Weight?Feature
+6.509word[-3:]:day
+4.569-1:word.lower():week
+4.254word[-3:]:Day
+3.817word.lower():february
+3.798-1:word.lower():month
+3.610word.lower():january
+3.455-1:word.lower():months
+3.421word.lower():august
+3.411word[-2:]:0s
+3.234+1:word.lower():week
+2.974+1:word.lower():year
+2.929word.lower():march
+2.928-1:word.lower():last
+2.785word.lower():christmas
+2.728word.lower():june
+2.636word[-2:]:ay
+2.635-1:word.lower():early
+2.459word[-3:]:ber
+2.447word.lower():later
+2.410-1:word.lower():next
+2.401word[-3:]:uly
+2.398-1:word.lower():days
+2.386word.lower():several
+2.358word[-2:]:th
+2.349+1:word.lower():weeks
+2.345+1:word.lower():years
+2.290-1:word.lower():years
+2.204-1:word.lower():late
+2.186word.lower():since
+2.155word[-3:]:ear
… 230 more positive …
… 76 more negative …
Weight?Feature
+5.068word[-3:]:day
+2.456word[-3:]:.m.
+2.456word[-2:]:m.
+2.219word[-3:]:Day
+2.142word.lower():decades
+2.097word[-3:]:ber
+1.821word[-2:]:ay
+1.709word.lower():august
+1.650word.lower():march
+1.629+1:word.lower():months
+1.626word.isdigit()
+1.530-1:word.lower():january
+1.381+1:word.lower():years
+1.342word[-2:]:ry
+1.316postag:CD
+1.316postag[:2]:CD
+1.314word.lower():april
+1.308-1:word.lower():world
+1.254word.lower():evening
+1.218+1:word.lower():days
+1.194-1:word.lower():july
+1.189-1:word.lower():first
+1.074word[-3:]:des
+1.067word.lower():june
+1.065-1:word.lower():end
+0.977-1:word.lower():june
+0.929word[-3:]:alf
+0.924word[-2:]:ly
+0.893word.lower():january
+0.891-1:word.lower():december
… 112 more positive …
… 42 more negative …

As expected, we see, that the model stops to rely on words and uses the context more, as it generalizes better is more useful over multiple training instances. This is an effect of the l_1-regularization.

This is it for this time, but stay tuned for the next post, where we will look at named entity recognition with recurrent neural networks.

You might also be interested in: