An important part of every machine learning project is the proper evaluation of the performance of the system. In this post I will show you how evaluate sequence models with token-based labels. This is especially tricky because:

  • some entity types occure more often then others
  • entities can span multiple tokens.
    The first problem is solved by picking the right metric. We will see what to kinds of metrics are suitable for this. The second problem is solved by agregating the token-level predictions in the right way. For this we first learn about label schemas for multi-token entity recognition.

The label schema

Suppose we are given the following tokenized sentence

Hawking was a Fellow of the Royal Society, a lifetime member of the Pontifical Academy of Sciences, and a recipient of the Presidential Medal of Freedom, the highest civilian award in the United States.

In [1]:
sentence = ["Hawking", "was", "a", "Fellow", "of", "the", "Royal", "Society", ",", "a", "lifetime", "member",
            "of", "the", "Pontifical", "Academy", "of", "Sciences", ",", "and", "a", "recipient", "of",
            "the", "Presidential", "Medal", "of", "Freedom", ",", "the", "highest", "civilian", "award",
            "in", "the", "United", "States", "."]

We also have the tags “per” for person, “org” for organization, “geo” for geo-political unit and “O” for no entity. So we would probably label the sentence as follows:

In [12]:
labels = ["per", "O", "O", "O", "O", "O", "org", "org", "O", "O", "O", "O", "O", "O", "org",
          "org", "org", "org", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O",
          "O", "O", "O", "O", "O", "geo", "geo", "O"]

But how would we now identify which tokens belong to the same entity? Is Pontifical Academy of Sciences one entity or more? This is why one modifies the label schema to capture the beginning of a entity and the continuation of a entity. We add a B- in the beginning of the entity type if this is the beginning of the entity and a I- if the token represents the continuation of a entity. Before every I- label there has to be a B- or I- labeled token. This means we don’t allow discontinous entities. This problem would be solved by relationship extration and we don’t cover it here. The described label schema is called BIO-schema. So our label sequence now looks like this.

In [27]:
labels_bio = ["B-per", "O", "O", "O", "O", "O", "B-org", "I-org", "O", "O", "O", "O", "O", "O",
              "B-org", "I-org", "I-org", "I-org", "O", "O", "O", "O", "O", "O", "O", "O", "O",
              "O", "O", "O", "O", "O", "O", "O", "O", "B-geo", "I-geo", "O"]

The right metrics

In the above example, you notice, that the “O” label is the most common. So what accuracy would we get if we always predict “O”?

In [20]:
pred_O = ["O" for _ in labels_bio]; print(pred_O)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
In [21]:
correct_cnt = 0
for t, p in zip(labels_bio, pred_O):
    if t == p:
        correct_cnt +=1
accuracy = correct_cnt/len(labels_bio)
print("Accuracy: {:.1%}".format(accuracy))
Accuracy: 76.3%

This is a pretty high score and quite misleading! Often the accuracy would even be 95% or higher without the model learning anything. To cope with this problem one uses precision, recall and F1-score per class. Let’s see how to compute them. They are computed using the following four numbers:

  • true positives (tp): number of labels of a class that are predicted correctly
  • false positives (fp): number of predictions of a class that are wrongly predicted
  • false negatives (fn): number of predictions that predict a class but are not labeled as belonging to the class

Then precision if definied as:

    \[precision = \frac{tp}{tp+fp}\]

and recall is defined as:

    \[recall = \frac{tp}{tp+fn}.\]

Further the F1-score is defined as the harmonic mean of precision and recall:

    \[\text{f1-score} = 2\cdot\frac{precision \cdot recall}{precision + recall}.\]

Put it all together

To understand out model performance we now consider only full entities by their BIO-tag as correct and then we compute the class-wise precision, recall and F1-score. Luckily there is the neat python package seqeval that does this for us in a standardized way.

In [24]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
In [28]:
print(classification_report(labels_bio, pred_O))
             precision    recall  f1-score   support

        per       0.00      0.00      0.00         1
        org       0.00      0.00      0.00         2
        geo       0.00      0.00      0.00         1

avg / total       0.00      0.00      0.00         4

Let’s look at a little smarter model:

In [35]:
pred_smarter = ["B-per", "O", "O", "O", "O", "O", "B-org", "B-org", "O", "O", "O", "O", "O",
                "O", "B-org", "I-org", "B-org", "I-org", "O", "O", "O", "O", "O", "O", "O",
                "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-geo", "I-geo", "O"]
In [36]:
print(classification_report(labels_bio, pred_smarter))
             precision    recall  f1-score   support

        per       1.00      1.00      1.00         1
        org       0.00      0.00      0.00         2
        geo       1.00      1.00      1.00         1

avg / total       0.50      0.50      0.50         4

This way you can get a proper understanding of you sequence model performance. Here you can see how this approach is applied to named entity recognition.

You might also be interested in: