In 2018 we saw the rise of pretraining and fine-tuning in natural language processing. Large neural networks have been trained on general tasks like language modelling and then fine-tuned for classification tasks. One of the latest milestones in this development is the release of BERT. BERT is a model that broke several records for how well models can handle language-based tasks. The model is based on a transformer architecture for “Attention is all you need”. They pre-trained it in a bidirectional way on several language modelling tasks. So probably the new slogan should read “Attention and pre-training is all you need”. If you want more details about the model and it’s pre-training, you find some resources at the end of this post.

This is a new post in my NER series. I will show you how you can fine-tune the Bert model to do state-of-the art named entity recognition (NER) in python with pytorch. First you install the pytorch bert package by huggingface with:

pip install pytorch-pretrained-bert==0.4.0

Now you have access to the pre-trained Bert models and the pytorch wrappers we will use here. Since some of you noticed problems with the below code using newer versions of pytorch-pretrained-bert, I recommend using version 0.4.0 and python >=3.6 for now.

We use the data set, you already know from my previous posts about named entity recognition.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm, trange

data.tail(10)

Out[1]:
Sentence #WordPOSTag
1048565Sentence: 47958impactNNO
1048566Sentence: 47958..O
1048567Sentence: 47959IndianJJB-gpe
1048568Sentence: 47959forcesNNSO
1048569Sentence: 47959saidVBDO
1048570Sentence: 47959theyPRPO
1048571Sentence: 47959respondedVBDO
1048572Sentence: 47959toTOO
1048573Sentence: 47959theDTO
1048574Sentence: 47959attackNNO
In [2]:
class SentenceGetter(object):

def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
s["POS"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]

def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None

In [3]:
getter = SentenceGetter(data)


This is how the sentences in the dataset look like.

In [4]:
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

Out[4]:
'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

The sentences are annotated with the BIO-schema and the labels look like this.

In [5]:
labels = [[s[2] for s in sent] for sent in getter.sentences]
print(labels[0])

['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']

In [6]:
tags_vals = list(set(data["Tag"].values))
tag2idx = {t: i for i, t in enumerate(tags_vals)}


# Apply Bert

## Prepare the sentences and labels

Before we can start fine-tuning the model, we have to prepare the data set for the use with pytorch and bert.

In [7]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig

Using TensorFlow backend.


Here we fix some configurations. We will limit our sequence length to 75 tokens and we will use a batch size of 32 as suggested by the Bert paper. Note, that Bert natively supports sequences of up to 512 tokens.

In [8]:
MAX_LEN = 75
bs = 32

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

In [10]:
torch.cuda.get_device_name(0)

Out[10]:
'GeForce GTX 1080 Ti'

The Bert implementation comes with a pretrained tokenizer and a definied vocabulary. We load the one related to the smallest pre-trained model bert-base-uncased. Try also the cased variate since it is well suited for NER.

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

12/09/2018 21:12:26 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/tobias/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


Now we tokenize all sentences

In [12]:
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print(tokenized_texts[0])

['thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'london', 'to', 'protest', 'the', 'war', 'in', 'iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'british', 'troops', 'from', 'that', 'country', '.']


Next, we cut and pad the token and label sequences to our desired length.

In [13]:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],

In [14]:
tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels],
dtype="long", truncating="post")


The Bert model supports something called attention_mask, which is similar to the masking in keras. So here we create the mask to ignore the padded elements in the sequences.

In [15]:
attention_masks = [[float(i>0) for i in ii] for ii in input_ids]


Now we split the dataset to use 10% to validate the model.

In [16]:
tr_inputs, val_inputs, tr_tags, val_tags = train_test_split(input_ids, tags,
random_state=2018, test_size=0.1)
random_state=2018, test_size=0.1)


Since we’re operating in pytorch, we have to convert the dataset to torch tensors.

In [17]:
tr_inputs = torch.tensor(tr_inputs)
val_inputs = torch.tensor(val_inputs)
tr_tags = torch.tensor(tr_tags)
val_tags = torch.tensor(val_tags)


The last step is to define the dataloaders. We shuffle the data at training time with the RandomSampler and at test time we just pass them sequentially with the SequentialSampler.

In [18]:
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)

valid_sampler = SequentialSampler(valid_data)


## Setup the Bert model for finetuning

The pytorch-pretrained-bert package provides a BertForTokenClassification class for token-level predictions. BertForTokenClassification is a fine-tuning model that wraps BertModel and adds token-level classifier on top of the BertModel. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence. We load the pre-trained bert-base-uncased model and provide the number of possible labels.

In [19]:
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(tag2idx))

12/09/2018 21:12:46 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /home/tobias/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
12/09/2018 21:12:46 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /home/tobias/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpsexefyt8
12/09/2018 21:12:48 - INFO - pytorch_pretrained_bert.modeling -   Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}

12/09/2018 21:12:50 - INFO - pytorch_pretrained_bert.modeling -   Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
12/09/2018 21:12:50 - INFO - pytorch_pretrained_bert.modeling -   Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.gamma', 'cls.predictions.transform.LayerNorm.beta', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']


Now we have to pass the model parameters to the GPU.

In [20]:
model.cuda();


Before we can start the fine-tuning process, we have to setup the optimizer and add the parameters it should update. A common choice is the Adam optimizer. We also add some weight_decay as regularization to the main weight matrices. If you have limited resources, you can also try to just train the linear classifier on top of Bert and keep all other weights fixed. This will still give you a good performance.

In [21]:
FULL_FINETUNING = True
if FULL_FINETUNING:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
]
else:
param_optimizer = list(model.classifier.named_parameters())
optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]


## Finetune Bert

First we define some metrics, we want to track while training. We use the f1_score from the seqeval package. You can find more details here. And we use simple accuracy on a token level comparable to the accuracy in keras.

In [22]:
from seqeval.metrics import f1_score

def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=2).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)


Finally, we can fine-tune the model. A few epochs should be enough. The paper suggest 3-4 epochs.

In [23]:
epochs = 5

for _ in trange(epochs, desc="Epoch"):
# TRAIN loop
model.train()
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
batch = tuple(t.to(device) for t in batch)
# forward pass
loss = model(b_input_ids, token_type_ids=None,
# backward pass
loss.backward()
# track train loss
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
# update parameters
optimizer.step()
# print train loss per epoch
print("Train loss: {}".format(tr_loss/nb_tr_steps))
# VALIDATION on validation set
model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
predictions , true_labels = [], []
batch = tuple(t.to(device) for t in batch)

tmp_eval_loss = model(b_input_ids, token_type_ids=None,
logits = model(b_input_ids, token_type_ids=None,
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
true_labels.append(label_ids)

tmp_eval_accuracy = flat_accuracy(logits, label_ids)

eval_loss += tmp_eval_loss.mean().item()
eval_accuracy += tmp_eval_accuracy

nb_eval_examples += b_input_ids.size(0)
nb_eval_steps += 1
eval_loss = eval_loss/nb_eval_steps
print("Validation loss: {}".format(eval_loss))
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
pred_tags = [tags_vals[p_i] for p in predictions for p_i in p]
valid_tags = [tags_vals[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
print("F1-Score: {}".format(f1_score(pred_tags, valid_tags)))

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]
Train loss: 0.09842512894109764
Validation loss: 0.05414908993989229
Validation Accuracy: 0.9827158730158724

Epoch:  20%|██        | 1/5 [06:33<26:12, 393.01s/it]
F1-Score: 0.6959586046926577
Train loss: 0.048297671556991946
Validation loss: 0.04440684294948975
Validation Accuracy: 0.9861757936507938

Epoch:  40%|████      | 2/5 [13:06<19:39, 393.02s/it]
F1-Score: 0.7563591786699357
Train loss: 0.03808436373537275
Validation loss: 0.04050858244299889
Validation Accuracy: 0.9878690476190479

Epoch:  60%|██████    | 3/5 [19:39<13:06, 393.04s/it]
F1-Score: 0.7774269928966061
Train loss: 0.031488433757105125
Validation loss: 0.040281765162944794
Validation Accuracy: 0.9882630952380962

Epoch:  80%|████████  | 4/5 [26:13<06:33, 393.28s/it]
F1-Score: 0.7858928806421897
Train loss: 0.026683800776466298
Validation loss: 0.04160652831196785
Validation Accuracy: 0.9879075396825402

Epoch: 100%|██████████| 5/5 [32:47<00:00, 393.43s/it]
F1-Score: 0.7854737023922619


Note, that already after the first epoch we get a better performance than in all my previous posts on the topic.

## Evaluate the model

In [24]:
model.eval()
predictions = []
true_labels = []
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
batch = tuple(t.to(device) for t in batch)

tmp_eval_loss = model(b_input_ids, token_type_ids=None,
logits = model(b_input_ids, token_type_ids=None,

logits = logits.detach().cpu().numpy()
predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
label_ids = b_labels.to('cpu').numpy()
true_labels.append(label_ids)
tmp_eval_accuracy = flat_accuracy(logits, label_ids)

eval_loss += tmp_eval_loss.mean().item()
eval_accuracy += tmp_eval_accuracy

nb_eval_examples += b_input_ids.size(0)
nb_eval_steps += 1

pred_tags = [[tags_vals[p_i] for p_i in p] for p in predictions]
valid_tags = [[tags_vals[l_ii] for l_ii in l_i] for l in true_labels for l_i in l ]
print("Validation loss: {}".format(eval_loss/nb_eval_steps))
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
print("Validation F1-Score: {}".format(f1_score(pred_tags, valid_tags)))

Validation loss: 0.04160652831196785
Validation Accuracy: 0.9879075396825402
Validation F1-Score: 0.7854737023922619


As you can see, this works quite amazing! This approach will give you very strong performing models for named entity recognition. Since Bert is available as a multilingual model in 102 languages, you can use it for a wide variety of tasks. Try it for your problems and let me know how it works for you. If you want to interpret your named entity system, check out this post.