Named entity recognition series:
- Introduction To Named Entity Recognition In Python
- Named Entity Recognition With Conditional Random Fields In Python
- Guide To Sequence Tagging With Neural Networks In Python
- Sequence Tagging With A LSTM-CRF
- Enhancing LSTMs With Character Embeddings For Named Entity Recognition
- State-Of-The-Art Named Entity Recognition With Residual LSTM And ELMo
- Evaluate Sequence Models In Python
- Named Entity Recognition with Bert
- Interpretable Named entity recognition with keras and LIME
In 2018 we saw the rise of pretraining and fine-tuning in natural language processing. Large neural networks have been trained on general tasks like language modelling and then fine-tuned for classification tasks. One of the latest milestones in this development is the release of BERT. BERT is a model that broke several records for how well models can handle language-based tasks. The model is based on a transformer architecture for “Attention is all you need”. They pre-trained it in a bidirectional way on several language modelling tasks. So probably the new slogan should read “Attention and pre-training is all you need”. If you want more details about the model and it’s pre-training, you find some resources at the end of this post.
This is a new post in my NER series. I will show you how you can fine-tune the Bert model to do state-of-the art named entity recognition (NER) in python with pytorch. First you install the pytorch bert package by huggingface with:
pip install pytorch-pretrained-bert==0.4.0
Now you have access to the pre-trained Bert models and the pytorch wrappers we will use here. Since some of you noticed problems with the below code using newer versions of pytorch-pretrained-bert, I recommend using version 0.4.0 and python >=3.6 for now.
Load the data
We use the data set, you already know from my previous posts about named entity recognition.
import pandas as pd
import numpy as np
from tqdm import tqdm, trange
data = pd.read_csv("ner_dataset.csv", encoding="latin1").fillna(method="ffill")
data.tail(10)
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
s["POS"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
getter = SentenceGetter(data)
This is how the sentences in the dataset look like.
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]
The sentences are annotated with the BIO-schema and the labels look like this.
labels = [[s[2] for s in sent] for sent in getter.sentences]
print(labels[0])
tags_vals = list(set(data["Tag"].values))
tag2idx = {t: i for i, t in enumerate(tags_vals)}
Apply Bert



Prepare the sentences and labels
Before we can start fine-tuning the model, we have to prepare the data set for the use with pytorch and bert.
import torch
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertForTokenClassification, BertAdam
Here we fix some configurations. We will limit our sequence length to 75 tokens and we will use a batch size of 32 as suggested by the Bert paper. Note, that Bert natively supports sequences of up to 512 tokens.
MAX_LEN = 75
bs = 32
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)
The Bert implementation comes with a pretrained tokenizer and a definied vocabulary. We load the one related to the smallest pre-trained model bert-base-uncased. Try also the cased variate since it is well suited for NER.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
Now we tokenize all sentences
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print(tokenized_texts[0])
Next, we cut and pad the token and label sequences to our desired length.
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels],
maxlen=MAX_LEN, value=tag2idx["O"], padding="post",
dtype="long", truncating="post")
The Bert model supports something called attention_mask, which is similar to the masking in keras. So here we create the mask to ignore the padded elements in the sequences.
attention_masks = [[float(i>0) for i in ii] for ii in input_ids]
Now we split the dataset to use 10% to validate the model.
tr_inputs, val_inputs, tr_tags, val_tags = train_test_split(input_ids, tags,
random_state=2018, test_size=0.1)
tr_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids,
random_state=2018, test_size=0.1)
Since we’re operating in pytorch, we have to convert the dataset to torch tensors.
tr_inputs = torch.tensor(tr_inputs)
val_inputs = torch.tensor(val_inputs)
tr_tags = torch.tensor(tr_tags)
val_tags = torch.tensor(val_tags)
tr_masks = torch.tensor(tr_masks)
val_masks = torch.tensor(val_masks)
The last step is to define the dataloaders. We shuffle the data at training time with the RandomSampler and at test time we just pass them sequentially with the SequentialSampler.
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=bs)
valid_data = TensorDataset(val_inputs, val_masks, val_tags)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=bs)
Setup the Bert model for finetuning
The pytorch-pretrained-bert package provides a BertForTokenClassification class for token-level predictions. BertForTokenClassification is a fine-tuning model that wraps BertModel and adds token-level classifier on top of the BertModel. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence. We load the pre-trained bert-base-uncased model and provide the number of possible labels.
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(tag2idx))
Now we have to pass the model parameters to the GPU.
model.cuda();
Before we can start the fine-tuning process, we have to setup the optimizer and add the parameters it should update. A common choice is the Adam optimizer. We also add some weight_decay as regularization to the main weight matrices. If you have limited resources, you can also try to just train the linear classifier on top of Bert and keep all other weights fixed. This will still give you a good performance.
FULL_FINETUNING = True
if FULL_FINETUNING:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
]
else:
param_optimizer = list(model.classifier.named_parameters())
optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)
Finetune Bert
First we define some metrics, we want to track while training. We use the f1_score from the seqeval package. You can find more details here. And we use simple accuracy on a token level comparable to the accuracy in keras.
from seqeval.metrics import f1_score
def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=2).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
Finally, we can fine-tune the model. A few epochs should be enough. The paper suggest 3-4 epochs.
epochs = 5
max_grad_norm = 1.0
for _ in trange(epochs, desc="Epoch"):
# TRAIN loop
model.train()
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
for step, batch in enumerate(train_dataloader):
# add batch to gpu
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
# forward pass
loss = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask, labels=b_labels)
# backward pass
loss.backward()
# track train loss
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
# gradient clipping
torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
# update parameters
optimizer.step()
model.zero_grad()
# print train loss per epoch
print("Train loss: {}".format(tr_loss/nb_tr_steps))
# VALIDATION on validation set
model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
predictions , true_labels = [], []
for batch in valid_dataloader:
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
tmp_eval_loss = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask, labels=b_labels)
logits = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask)
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
true_labels.append(label_ids)
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_loss += tmp_eval_loss.mean().item()
eval_accuracy += tmp_eval_accuracy
nb_eval_examples += b_input_ids.size(0)
nb_eval_steps += 1
eval_loss = eval_loss/nb_eval_steps
print("Validation loss: {}".format(eval_loss))
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
pred_tags = [tags_vals[p_i] for p in predictions for p_i in p]
valid_tags = [tags_vals[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
print("F1-Score: {}".format(f1_score(pred_tags, valid_tags)))
Note, that already after the first epoch we get a better performance than in all my previous posts on the topic.
Evaluate the model
model.eval()
predictions = []
true_labels = []
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for batch in valid_dataloader:
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
tmp_eval_loss = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask, labels=b_labels)
logits = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask)
logits = logits.detach().cpu().numpy()
predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
label_ids = b_labels.to('cpu').numpy()
true_labels.append(label_ids)
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_loss += tmp_eval_loss.mean().item()
eval_accuracy += tmp_eval_accuracy
nb_eval_examples += b_input_ids.size(0)
nb_eval_steps += 1
pred_tags = [[tags_vals[p_i] for p_i in p] for p in predictions]
valid_tags = [[tags_vals[l_ii] for l_ii in l_i] for l in true_labels for l_i in l ]
print("Validation loss: {}".format(eval_loss/nb_eval_steps))
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
print("Validation F1-Score: {}".format(f1_score(pred_tags, valid_tags)))
As you can see, this works quite amazing! This approach will give you very strong performing models for named entity recognition. Since Bert is available as a multilingual model in 102 languages, you can use it for a wide variety of tasks. Try it for your problems and let me know how it works for you. If you want to interpret your named entity system, check out this post.
Resources:
- Beautifully illustrated explanation of Bert, ELMo and ULM-Fit: https://jalammar.github.io/illustrated-bert/
- The original Bert Paper: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/pdf/1810.04805.pdf)
- Documentation of pytorch-pretrained-bert: https://github.com/huggingface/pytorch-pretrained-BERT
License:
All code is released under MIT License.
You might also be interested in:
- Enhancing LSTMs with character embeddings for Named entity recognition: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
- Guide to sequence tagging with neural networks in python: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
- State-of-the-art named entity recognition with residual LSTM and ELMo: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
12/18/2018 at 1:44 am
How is getting 78% validation f1 equivalent to the architecture “working perfectly” for NER? I thought state-of the art NER f1 is around 92%
12/19/2018 at 3:22 pm
Hi Octavia,
you are right, maybe this is not the best possible result on this data set. I update it in the article. But please note, that this is not the Conll03 NER data set, where the state of the art is around 92%. You can try to tune the model to achieve better performance. But for me it was better than everything else on this data set. Let me know your experiences.
12/30/2018 at 4:59 pm
Hi Tobias,
Thanks for sharing such an state of the art implementation with us.
I really like the way you have explained all the stuffs in all of the posts.
Actually i have collected lot of legal data ,so i was thinking to finetune
bert encoder part with free text and then do the ner part for legal data, but i only have 8gb gpu…so i want to know how much memory does your model took…..
and please share more bert posts (‘-‘)
once agains thanks for sharing.
12/31/2018 at 10:12 am
Hi Amit,
I’m happy you like it!
The model provided by Huggingface is already pretrained on free text and you can fine-tune it for your dataset with the code in the post. 8gb GPU memory should be enough. If you also want to pretrain it on your texts, you can check the documentation by huggingface. Maybe I’ll do a post about it soon.
Let me know how it works for you! 🙂
01/03/2019 at 6:04 pm
Hi,
I would like to try this for Tweets. will this work?
Let me know your suggestions.
01/04/2019 at 7:33 am
Hi,
I don’t know exactly what you want to do with tweets. But to some extend it will work. You could also try to fine-tune the language model to your tweets first and then fine-tune the actual task.
01/08/2019 at 9:20 am
why not continue to use keras to implement this project? not familar with pytorch
01/08/2019 at 7:15 pm
I wanted to do a post with pytorch. It’s more flexible than keras and it’s more straightforward to do certain things. Maybe I’ll do a post with keras in the future.
06/26/2019 at 1:19 pm
please do!
01/08/2019 at 6:38 pm
hi,
Thanks for sharing this great work!
Is it possible to save this model as .h5 standalone model?
Thanks
01/08/2019 at 7:12 pm
Hi, I’m happy you like it!
I don’t know how to save it as .h5 file. But you can use [cc lang=”python”]torch.save(model.bert.state_dict(), output_model_file)[/cc] to save the model weights.
01/14/2019 at 5:18 pm
I get a total size of ~438 Mb for the state dictionary file. Just saying.
01/09/2019 at 10:12 am
Hey! Thanks for this cool tutorial!
I think there might be a problem with your mapping of the tokens to the labels. The BertTokenizer chunks words differently than just on whitespace. E.g. “Henson” could be chunked into [“Hen”, “##son”]. Your current implementation would thus create an offset to the next label which might be the reason for the bad results mentioned above. In the paper they add an additional label for these cases: ‘X’.
I have adjusted your code to tokenize on the data-token level and then subsequently join the lists:
“`
Jim I-PER
Henson I-PER
was O
a O
puppeteer O
=>
[
[[‘Jim’], [‘I-PER’]]
[[‘Hen’, ‘##son’], [‘I-PER’, ‘X’]]
[[‘was’], [‘O’]]
[[‘a’], [‘O’]]
[[‘puppet’, ‘##eer’], [‘O’, ‘X]]
]
=>
[‘Jim’, ‘Hen’, ‘##son’, ‘was’, ‘a’, ‘puppet’, ‘##eer’],
[‘I-PER’, ‘I-PER’, ‘X’, ‘O’, ‘O’, ‘O’, ‘X’]
“`
as mentioned in the paper
01/09/2019 at 4:27 pm
Hey! I’m happy you like it! I used the uncased model and thus the uncased BertTokenizer which in fact tokenizes by whitespace. But your implementation would work for the cased bert model.
Nice!
01/21/2019 at 3:37 pm
BertTokenizer does NOT tokenize by whitespace, see https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py#L95
It tokenizes using sub-word units. As Jonas noted, your code does not correctly align labels with Bert tokens.
01/21/2019 at 4:02 pm
The basic version does not tokenize by whitespace. If you use the pretrained BertTokenizer for the uncased model (as I did) by
[cc lang=”python”]BertTokenizer.from_pretrained(‘bert-base-uncased’)[/cc]
you get tokenization by whitespace as you can see in my post. If you use the cased model then it uses sub-word units.
02/13/2019 at 11:26 am
Hi Tobias,
thank you very much for sharing this great tutorial. I am trying to fine-tune the model for a NER dataset in Italian. I have checked the tokenization and it looks like
“`
BertTokenizer.from_pretrained(‘bert-base-uncased’)
“`
does not tokenize by whitespace! I have checked it by
“`
check = [len(sent) == len(lab) for sent, lab in zip(tokenized_texts, labels)]
all(check)
“`
which return `False`. Any clue about how to deal with this problem?
Thanks!
02/13/2019 at 2:48 pm
I guess huggingface changed the behavior of the tokenizer. With my version of their implementation your check returns true. I would suggest you check out the WordpieceTokenizer and adjust your labels accordingly. I hope this helps you.
01/10/2019 at 12:36 am
Hi Jonas,
Do you mind sharing the code you used to do this? I am facing the label-offset issue you mention. Thanks!
02/20/2019 at 7:58 am
Hi, I have solved doing the following
[cc lang=”python”]
from pytorch_pretrained_bert import WordpieceTokenizer
mytexts = []
mylabels = []
sentences_clean = [sent.lower() for sent in sentences]
for sent, tags in zip(sentences_clean,labels):
new_tags = []
new_text = []
for word, tag in zip(sent.split(),tags):
#print(‘splitting: ‘, word)
sub_words = tokenizer.wordpiece_tokenizer.tokenize(word)
for count, sub_word in enumerate(sub_words):
#print(‘subword: ‘,sub_word)
if count > 0:
tag = ‘X’
new_tags.append(tag)
new_text.append(sub_word)
mytexts.append(new_text)
mylabels.append(new_tags)
[/cc]
hope it helps!
02/26/2019 at 3:29 pm
Hi are you sure about this fix ? When I tried to run it I got error. Could you, if possible, share the full fix ?
03/27/2019 at 11:32 pm
It’s probably due to the “if count > 0:” line – it might not have converted correctly in HTML, but just replace > with a > sign.
07/09/2019 at 8:12 am
Thanks Michele! I changed this slightly to the following:
tokenizer = BertTokenizer.from_pretrained(‘bert-base-cased’, do_lower_case=False)
tokenized_texts = []
mylabels = []
for sent, tags in zip(sentences,labels):
BERT_texts = []
BERT_labels = np.array([])
for word, tag in zip(sent.split(),tags):
sub_words = tokenizer.wordpiece_tokenizer.tokenize(word)
tags = np.array([tag for x in sub_words])
tags[1:] = ‘X’
BERT_texts += sub_words
BERT_labels = np.append(BERT_labels,tags)
mytexts.append(BERT_texts)
mylabels.append(BERT_labels)
print(tokenized_texts[1])
print(mylabels[1])
which then yields
[‘Jim’, ‘Hen’, ‘##son’, ‘was’, ‘a’, ‘puppet’, ‘##eer’]
[‘I-PER’, ‘I-PER’, ‘X’, ‘O’, ‘O’, ‘O’, ‘X’]
04/11/2019 at 4:21 pm
Thanks Michele for your code snipet. Don’t forget to throw “X” labelled tokens during the computation of the F1 score as they are not important. I’ve also slighty changed the BertForTokenClassification class to not take into account the X labelled token for the computation of the loss during train.
Thus, the model achieved a F1-score of 0.84 with this dataset on BERT base uncased.
06/10/2019 at 2:32 pm
Here is an alternative solution
num_sent = len(labels)
for sent_id in range(num_sent):
tokens_len = len(tokenized_texts[sent_id])
for i in range(tokens_len):
if tokenized_texts[sent_id][i][:2] == "##":
labels[sent_id].insert(i, "X")
06/11/2019 at 1:59 am
Hi, I think if the sentence contains some symbols like 9:00 or old-style, the code will be not suitable. Because the tokenizer will split these words by the symbols without a “##” prefix.
01/31/2019 at 6:44 am
Hey Jonas,
I’m facing this exact problem. Do you mind publishing the code on how you offset the labels with an ‘X’ correctly?
Thanks much!
11/10/2019 at 9:31 pm
Hi Jonas, would you be able to share your code?
01/09/2019 at 10:25 pm
Hello, Can you show how to do this without GPU? Thanks
01/10/2019 at 7:13 am
Just leave out [cc lang=”python”]model.cuda()[/cc]. Then it should work on cpu.
01/10/2019 at 4:13 pm
I excluded the model.cuda() and I got all the way to line 23 but it fails at the first Epoch with the following error
RuntimeError Traceback (most recent call last)
in ()
13 # forward pass
14 loss = model(b_input_ids, token_type_ids=None,
—> 15 attention_mask=b_input_mask, labels=b_labels)
16 # backward pass
17 loss.backward()
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
–> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
D:\Anaconda3\envs\kerasenv\lib\site-packages\pytorch_pretrained_bert\modeling.py in forward(self, input_ids, token_type_ids, attention_mask, labels)
1018
1019 def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
-> 1020 sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
1021 sequence_output = self.dropout(sequence_output)
1022 logits = self.classifier(sequence_output)
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
–> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
D:\Anaconda3\envs\kerasenv\lib\site-packages\pytorch_pretrained_bert\modeling.py in forward(self, input_ids, token_type_ids, attention_mask, output_all_encoded_layers)
624 extended_attention_mask = (1.0 – extended_attention_mask) * -10000.0
625
–> 626 embedding_output = self.embeddings(input_ids, token_type_ids)
627 encoded_layers = self.encoder(embedding_output,
628 extended_attention_mask,
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
–> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
D:\Anaconda3\envs\kerasenv\lib\site-packages\pytorch_pretrained_bert\modeling.py in forward(self, input_ids, token_type_ids)
191 token_type_ids = torch.zeros_like(input_ids)
192
–> 193 words_embeddings = self.word_embeddings(input_ids)
194 position_embeddings = self.position_embeddings(position_ids)
195 token_type_embeddings = self.token_type_embeddings(token_type_ids)
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
–> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)
116 return F.embedding(
117 input, self.weight, self.padding_idx, self.max_norm,
–> 118 self.norm_type, self.scale_grad_by_freq, self.sparse)
119
120 def extra_repr(self):
D:\Anaconda3\envs\kerasenv\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1452 # remove once script supports set_grad_enabled
1453 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1454 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1455
1456
RuntimeError: Expected tensor for argument #1 ‘indices’ to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)
01/10/2019 at 5:19 pm
This is not related to using the GPU, as far as I can tell. You have to convert your input tensors to LongTensors in pytorch. It does not work with mere ints.
01/10/2019 at 8:12 pm
How come that isn’t necessary in the above code? I copied the code exactly as you have it.
01/10/2019 at 8:55 pm
Are you sure? For me it works with CPU and pytorch 1.0.
01/29/2019 at 4:02 pm
Hi Tobias, love this post!
I also got a Runtime Error and had to change the code in [17] to
xxx = torch.LongTensor(yyy)
to make it work.
Also, just FYI was not able to run the code on one of my machines that has only a 6GB CPU due to memory problems. BERT is great, but very memory hungry.
06/26/2019 at 2:05 pm
I had the same problem on Windows. Depending on what device you choose, you may get an CPU/GPU input error. The reason is that on Windows, tensors need to be cast to long (sourcetensor.long()) before fed to the input, while on Linux (where I guess OP tested it) this is not needed.
It has to do with a int32 representation of the values on Windows vs a int64 (already without the cast) on Linux
Now the only error I get is because of my 4gbs VRAM.
Source: https://github.com/dmlc/dgl/issues/672 (similar case)
04/20/2019 at 9:32 am
When running In[23] , I get the following error , please help !
C:\Users\Houssemus\Desktop\ner-prog>py -3.6 testing1.py
Using TensorFlow backend.
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Epoch: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "testing1.py", line 109, in
loss = model(b_input_ids, token_type_ids=None,attention_mask=b_input_mask, labels=b_labels)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py”, line 477, in __call__
result = self.forward(*input, **kwargs)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\pytorch_pretrained_bert\modeling.py”, line 1020, in forward
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py”, line 477, in __call__
result = self.forward(*input, **kwargs)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\pytorch_pretrained_bert\modeling.py”, line 626, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py”, line 477, in __call__
result = self.forward(*input, **kwargs)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\pytorch_pretrained_bert\modeling.py”, line 193, in forward
words_embeddings = self.word_embeddings(input_ids)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py”, line 477, in __call__
result = self.forward(*input, **kwargs)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\sparse.py”, line 110, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File “C:\Users\Houssemus\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\functional.py”, line 1110, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 ‘indices’ to have scalar type Long; but got CUDAIntTensor instead (while checking arguments for embedding)
HI JOSEPH ,
I am facing the same error you had , could you please tell me how to fix it if you know how ? thanks !
01/16/2019 at 5:10 pm
A summary of the resource usage/speed/accuracy of the various approaches might be of interest to your base. It is still relatively expensive to use a GPU/TPU.
01/17/2019 at 6:58 am
That’s true! I will compile a list as soon as possible.
01/16/2019 at 10:36 pm
Thanks for this great post. I have a question. Did you try adding bilstm-crf model ahead of the bert model during fine-tuning? Have you thought of that? Because I have implemented the same thing using tensorflow and I found out the performance is worser by adding bilstm-crf model. I don’t know whether it is true or not. Thanks!
01/22/2019 at 5:35 pm
Hi thanks for the code. I take it I save at the end of the training with
torch.save(model.state_dict(), ‘ner_model’)
then:
Load as:
model.load_state_dict(torch.load(‘ner_model’))
But do I need to do this:
model = BertForTokenClassification.from_pretrained(“bert-base-uncased”, num_labels=len(tag2idx))
+ FULL_FINETUNING = True ?
The validation accuracy when loading the model is vastly different to that outputted from training, making me think I didn’t load it properly.
Thanks
03/30/2019 at 5:35 am
This might be caused by sampling different train/test split. Is loss smaller at beginning?
01/22/2019 at 5:42 pm
Hi, thanks for the tutorial,
what’s the method to load the saved weights? This:
model.load_state_dict(torch.load(‘PATH))
doesn’t replicate the accuracy from training. Making me think I’ve loaded it wrongly.
03/20/2019 at 11:25 am
Hi,
Thanks for this great tutorial!
I have the same issue!
Did you find any solution?
Thanks!
08/20/2019 at 7:28 am
Hi,
I had the same issue, and get it solved with the following methods for saving/loading.
Saving:
torch.save(model.state_dict(), PATH)
Loading:
model_state_dict = torch.load(PATH)
model = BertForTokenClassification.from_pretrained(“bert-base-multilingual-cased”, state_dict=model_state_dict, num_labels=len(tag2idx))
I uesd multilingual model for Danish training. The accuracy can be replicated after loading the saved model.
Thanks for the great tutorial!
01/25/2019 at 4:03 am
Hi, thanks for sharing,
Have you ever tried to apply FP16_Optimizer to your codes ? I used your source-code and applied fp16 follow the logic given by “https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_lm_finetuning.py”
, so I would use FP16_Optimizer and optimizer.backward(loss) as well. But found the error in optimizer.step():
…
File “/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py”, line 147, in step
grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 194, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 194, in
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
AttributeError: ‘NoneType’ object has no attribute ‘contiguous’
I went as deeply as I could, but still couldn’t fix. Could you please try and give me the ideas ?.
Thanks lots.
01/25/2019 at 8:27 am
Hi Anh,
I haven’t tried training at fp16 and I have know idea how to help you. Maybe someone else here can help you. I would be interested in the results!
01/25/2019 at 9:35 am
I already fixed this issues, will note here if someone may need. Using fp16 will not only reduce RAM but also accelerate accelerate the computation, so, why not 😀
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if ‘pooler’ not in n[0]]
01/28/2019 at 10:56 pm
Hey, thanks a lot for the fix! I was thinking about trying it myself soon. What where your memory savings and speedup?
01/30/2019 at 3:26 pm
I noticed in the BERT paper (and the figure from that paper that you display in this article) there is an extra ‘[CLS]’ token at the beginning of an input sequence. Is this something we need to add manually?
02/13/2019 at 2:49 pm
You can try adding it and find out if it changes the performance. But it looks like its relevant for the model training procedure. I’m looking forward to hear about your experiment!
03/30/2019 at 5:41 am
This [CLS] token is important in sequence classification, it is used as convenience placeholder for comparison embedding. Here it isn’t required.
02/12/2019 at 6:16 pm
Hi Tobias,
I used the same dataset and parameters as you. But the performance is very poor. I want to know if here’s anything I’m missing.
The following are the result from my training:
Epoch: 0%| | 0/5 [00:00<?, ?it/s]Train loss: 0.14929957927830842
Validation loss: 0.1385537704328696
Validation Accuracy: 0.9221349206349209
Epoch: 20%|██ | 1/5 [23:02<1:32:09, 1382.35s/it]F1-Score: 0.41764101560589006
Train loss: 0.11814392791347118
Validation loss: 0.12850322524706523
Validation Accuracy: 0.9272111111111115
Epoch: 40%|████ | 2/5 [46:05<1:09:07, 1382.64s/it]F1-Score: 0.4333100558659218
Train loss: 0.09879241194479371
Validation loss: 0.12466114262739818
Validation Accuracy: 0.9192996031746031
Epoch: 60%|██████ | 3/5 [1:09:05<46:03, 1381.72s/it]F1-Score: 0.4177507882751372
Train loss: 0.08402714759161112
Validation loss: 0.12018664911389351
Validation Accuracy: 0.9205309523809526
Epoch: 80%|████████ | 4/5 [1:32:07<23:01, 1381.80s/it]F1-Score: 0.4267321539268514
Train loss: 0.07209687749665365
Validation loss: 0.1262998983512322
Validation Accuracy: 0.9167210317460324
Epoch: 100%|██████████| 5/5 [1:55:09<00:00, 1382.00s/it]
F1-Score: 0.4178773530430766
Validation loss: 0.1262998983512322
Validation Accuracy: 0.9167210317460324
Validation F1-Score: 0.4178773530430766
02/12/2019 at 6:59 pm
Very hard to tell what going on without seeing the code. Are you using the correct tokenizer and preparing the labels in the right way? Have you checked your inputs?
02/26/2019 at 10:20 pm
Hi,
Thank you for your post
I have the same issue as Juan.I got the same results.
I used just copied the code that you provided and I got this results. F1-Score is very low.
Epoch: 0%| | 0/5 [00:00<?, ?it/s]Train loss: 0.27496489635036997
Validation loss: 0.16450211400787035
Validation Accuracy: 0.9232158730158733
Epoch: 20%|██ | 1/5 [23:16<1:33:05, 1396.31s/it]F1-Score: 0.38974259065338623
Train loss: 0.14994709562645567
Validation loss: 0.14076075688004494
Validation Accuracy: 0.9270134920634923
Epoch: 40%|████ | 2/5 [46:33<1:09:49, 1396.51s/it]F1-Score: 0.4148827472773943
Train loss: 0.11850443832559705
Validation loss: 0.12456354605654875
Validation Accuracy: 0.9266436507936514
Epoch: 60%|██████ | 3/5 [1:09:50<46:33, 1396.77s/it]F1-Score: 0.43069575065242016
Train loss: 0.09963840262493528
Validation loss: 0.15225832387804986
Validation Accuracy: 0.9053071428571426
Epoch: 80%|████████ | 4/5 [1:33:07<23:16, 1396.68s/it]F1-Score: 0.3555979643765903
Train loss: 0.08547445007077273
Validation loss: 0.12115591257810593
Validation Accuracy: 0.9198857142857141
Epoch: 100%|██████████| 5/5 [1:56:24<00:00, 1396.81s/it]F1-Score: 0.41601942859761265
I will appreciate if you can check it.
Thank you 🙂
03/25/2019 at 1:15 pm
I’m experienced same issue using latest version 0.6.1 of pretrained Bert. F1 is about 80% returning to the recommended 0.4.0 version.
Can anybody suggest any explanation?
bin file is the same.
06/16/2019 at 11:42 am
I got the same result when FULL_FINETUNING was set to false.
02/13/2019 at 12:30 pm
Hi Tobias,
I just copy and paste your code. I didn’t change anything.
Every step was the same before the performance part.
-Juan
02/13/2019 at 2:39 pm
Hi Juan,
I have no idea what went wrong. You can send me your code if you want.
02/15/2019 at 8:09 am
Dear Tobias,
thanks for the great post, I really enjoyed it! I face the same issue as Juan. Upon checking the tokenizer I think, this is related to the previous posts of Michele, Mike K and Jonas.
I use the uncased version of the BERT tokenizer and it seems to be tokenizing by whitespace for tokens available in its vocabulary. For tokens not available in its vocabulary it seems to split the tokens into subtokens, that are available in its vocabulary.
For example: The word “gunships”, which occurs in the third sentence of the corpus:
tokenizer.vocab[‘gunships’] returns a KeyError with my setup. But if you tokenize the word using:
tokenizer.tokenize(‘gunships’) it returns [‘guns’, ‘##hips’] with my setup.
The tokenizer seems to use BPE as explained in “Tokenisation” section of the following Blogpost from HuggingFace: https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d
I will try the workaround Jonas suggested and will let you know about the results.
Cheers
Alex
02/17/2019 at 11:28 am
Hi,
Thanks for the great post, it’s really clear and easy to read.
I am too facing similar issues – the NER performance of the model is poor even on the dataset used in your blog…
My code is similar to yours, expect for the following modifications:
1. According to its documentation the f1_score function receives two input params – ground truth (y_true) and prediction (y_pred) at that order, and not as demonstrated in the blog.
2. After each epoch I’ve printed the classification report to allow label-based analysis
Notice the result for 5 epochs:
# epoch 1
Train loss: 2.1515635643844253
Validation loss: 1.5441952099402745
Validation Accuracy: 0.9520636574074074
F1-Score: 0
precision recall f1-score support
org 0.00 0.00 0.00 379
per 0.00 0.00 0.00 272
tim 0.00 0.00 0.00 352
geo 0.00 0.00 0.00 595
gpe 0.00 0.00 0.00 289
nat 0.00 0.00 0.00 3
art 0.00 0.00 0.00 8
eve 0.00 0.00 0.00 8
avg / total 0.00 0.00 0.00 1906
———————————————-
# epoch 2
Train loss: 0.7686076947936306
Validation loss: 0.5796454784770807
Validation Accuracy: 0.9522557870370368
F1-Score: 0
precision recall f1-score support
org 0.00 0.00 0.00 379
per 0.00 0.00 0.00 272
tim 0.00 0.00 0.00 352
geo 0.00 0.00 0.00 595
gpe 0.00 0.00 0.00 289
nat 0.00 0.00 0.00 3
art 0.00 0.00 0.00 8
eve 0.00 0.00 0.00 8
avg / total 0.00 0.00 0.00 1906
———————————————-
# epoch 3
Train loss: 1.2535134771907772
Validation loss: 0.8845297023653984
Validation Accuracy: 0.9522557870370368
F1-Score: 0
precision recall f1-score support
org 0.00 0.00 0.00 379
per 0.00 0.00 0.00 272
tim 0.00 0.00 0.00 352
geo 0.00 0.00 0.00 595
gpe 0.00 0.00 0.00 289
nat 0.00 0.00 0.00 3
art 0.00 0.00 0.00 8
eve 0.00 0.00 0.00 8
avg / total 0.00 0.00 0.00 1906
———————————————-
# epoch 4
Train loss: 0.7686076947936306
Validation loss: 0.5796454784770807
Validation Accuracy: 0.9522557870370368
F1-Score: 0
precision recall f1-score support
org 0.00 0.00 0.00 379
per 0.00 0.00 0.00 272
tim 0.00 0.00 0.00 352
geo 0.00 0.00 0.00 595
gpe 0.00 0.00 0.00 289
nat 0.00 0.00 0.00 3
art 0.00 0.00 0.00 8
eve 0.00 0.00 0.00 8
avg / total 0.00 0.00 0.00 1906
———————————————-
# epoch 5
Train loss: 0.5286654621638633
Validation loss: 0.44472165405750275
Validation Accuracy: 0.9522557870370368
F1-Score: 0
precision recall f1-score support
org 0.00 0.00 0.00 379
per 0.00 0.00 0.00 272
tim 0.00 0.00 0.00 352
geo 0.00 0.00 0.00 595
gpe 0.00 0.00 0.00 289
nat 0.00 0.00 0.00 3
art 0.00 0.00 0.00 8
eve 0.00 0.00 0.00 8
avg / total 0.00 0.00 0.00 1906
Notice that model did not classify tokens to one of the target classes.
I suspect that the validation accuracy has (very slightly) as the other rates improved over time as the model just classified everything as “O”.
I than tried training the model for additional 30 epochs and noticed that the performance per class slightly improves… It seems as if it learns from scratch… which contradicts Bert’s promise to generate a semi-SOTA model in only a few epochs.
Please advise,
Danny
08/25/2019 at 5:23 am
Hi Danny, did you find the root cause and solution for this? Thanks.
02/25/2019 at 7:58 pm
I have been using the PyTorch implementation of Google’s [BERT][1] by [HuggingFace][2] for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an **F-Score** of **0.81** for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)
> ValueError: Token indices sequence length is longer than the specified
> maximum sequence length for this BERT model (632 > 512). Running this
> sequence through BERT will result in indexing errors
The full code is available in this [colab notebook][3].
To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)
The result shouldn’t have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my **F-Score** has dropped to **0.40** and my **precision** to **0.27** while the **Recall** remains the same **(0.85)**. I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson –> Johan ##son) have been tagged with X and replaced later after the prediction as said in the [BERT Paper][4].
Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) people have done on their end recently?
[1]: https://github.com/google-research/bert#fine-tuning-with-bert
[2]: https://github.com/huggingface/pytorch-pretrained-BERT
[3]: https://colab.research.google.com/drive/1JxWdw1BjXZCFC2a8IwtZxvvq4rFGcxas
[4]: https://arxiv.org/abs/1810.04805
02/26/2019 at 7:56 pm
Hi Ashwin,
I am trying to understand what you did in the notebook. Could you please explain why you’re adding value of 102 to each tokenized input sequence ?
02/27/2019 at 8:48 pm
Hi Christian,
The BERT Model requires us to have a [SEP] token at the end of each sentence as a part of its preprocessing. 102 is the index BERT recognizes as the index of [SEP]. Hence, I am adding it to the end of the sentence after padding/truncating to be compatible with BERT’s requirement. I didn’t have it in the beginning and I thought it would be the reason for the poor results but changing it didn’t help and I chose to keep it anyways as it felt right. 🙂
03/01/2019 at 5:42 pm
Awesome post man, thanks a lot for sharing that, really great work!
I noticed that the code in your post run fine w/ good F1 Score if you run w/ pytorch-pretrained-bert==0.4.0. It should be somethine related either w/ tokenizer or BertForTokenClassification.
When I run w/ newer versions of pytorch-pretrained-bert (e.g. pytorch-pretrained-bert==0.6.1), F1 Score goes down as people is reporting.
**pytorch-pretrained-bert==0.4.0**
– F1-Score: 0.7806423229212494
– Train loss: 0.02669612845037273
– Validation loss: 0.039759145801266035
– Validation Accuracy: 0.9882277777777783
**pytorch-pretrained-bert==0.6.1**
– F1-Score: 0.4406215316315205
– Train loss: 0.09970482976470636
– Validation loss: 0.12539973077674707
– Validation Accuracy: 0.9194174603174605
03/01/2019 at 8:48 pm
You are right. That solves the issue.
There is something changed in the new update which affects the performance of the model.
Thanks anyways. 🙂
Cheers
03/02/2019 at 8:34 am
Thanks a lot for trying it! I updated the post and recommend using version 0.4.0 now.
11/24/2019 at 12:16 pm
The problem with different versions arises, because devs of pytorch-pretrained-bert fixed the issue with dealing with padded tokens since version 0.5.0. Previously, they summed up the loss on padded tokens, which is not correct.
But in this implementation, this fix results in performance drop because you need to pad sequence not with ‘O’ but with ‘[PAD]’ token.
The problem is here:
tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels], maxlen=MAX_LEN, value=tag2idx[“O”], padding=”post”, dtype=”long”, truncating=”post”)
# Just switch here “O” to “[PAD]”.
If you use “O” in versions >= 0.5.0, the problem happens because loss on padded tokens is ignored, then any wrong output of the model on padded tokens will not be penalized and the model will learn wrong signal for labels “O”.
The full fixed version of the code that does sequence tagging with BERT and newest version of pytorch pretrained bert is here:
https://github.com/IINemo/bert_sequence_tagger
There is a class SequenceTaggerBert that works with tokenized sequences (e.g., nltk tokenizer) and does all the necessary preprocessing under the hood.
Best
03/08/2019 at 9:21 am
Hi Tobias,
i tried to use your code for conll-2003 dataset, but unfortunately i’m facing very much low accuracy.
why did you choose “finding evaluation accuracy for each batch and then finally computing overall accuracy w.r.t total batch size” over “from seqeval.metrics import accuracy_score” ?
because for me accuracy_score is coming more for seqeval compare to sum(flat_accuracy)/total_batch.
any reason ?
for more details (https://stackoverflow.com/questions/55058998/what-is-causing-f1-score-high-but-accuracy-low-in-a-deep-learning-model)
03/08/2019 at 12:04 pm
Try using version 0.4.0 of pytorch-pretrained-bert. This should solve your problem.
03/11/2019 at 3:36 am
Hi, Does BERT create word embeddings as part of model architecture? As I see we convert tokens to ids as input.
03/11/2019 at 9:30 am
Yes, BERT creates token embeddings for inputs.
03/13/2019 at 5:45 pm
Hi
Thanks for sharing your work.
I am able to get the predictions just fine.
But i am finding it difficult to map the input batch to the original word form, after getting the predictions.
Any suggestions?
Thanks in advance.
03/13/2019 at 6:03 pm
You can use the method [cc lang=”python”]convert_ids_to_tokens(tokens)[/cc] of the [cc lang=”python”]BertTokenizer[/cc] class.
08/03/2019 at 9:02 pm
I am not able to understand it. After I have
model.eval
Should I make a call like below to get my words out for prediction data
predicted_token = tokenizer.convert_ids_to_tokens([tr_inputs])[0]
03/19/2019 at 7:33 am
Replicated results from BERT paper on CoNLL-2003 dataset
https://github.com/kamalkraj/BERT-NER
03/19/2019 at 8:34 am
Nicely done!
03/21/2019 at 2:23 pm
Hi Tobias,
Thanks for the wonderful tutorials and great work in BERT NER, just curious to know the training time took for each epoch, tried in local with 16 GB RAM in i5 but didnt see anything after 1 hour and 15 minutes so have to kill the process manually. Tried in google colab with GPU acceleration, saw the first epoch completing after 45 minutes and showing the metrics. This is normal behavior? Please let me know if I need to improvise anything.
03/22/2019 at 4:50 pm
Hi and thanks for the post.
I followed the tutorial on this data-set and got very promising results (f1 score 80 %) . However, when trying to run train the NER on another model, I get much lower f1 scores (45 % is my best result) . I noticed something strange about the tokenizer and labeling and how to distribute the labels after the tokenization of the text.
I don’t understand how this example here handles post-tokens such as hyperdimesia = hyper + ##dimen + ##sia ( .
In the Bert paper. they suggest that post-fixes starting with ## must be labelled as X and we do not train or preform predictions on them. I’m wondering on how to do that and if that may be the reason my model is failing to train properly, considering my other data-set has a large number of words that are unknown to Bert and hence will be split up.
Thanks for your help
03/25/2019 at 8:26 am
Which one (of the 8) will be the best to try first and then tune the model, if the training dataset is very small [with ~1000 labeled senetnces and ~10 tags]?
03/28/2019 at 3:28 pm
Hello, Is there a way to get this to run on the 1060 with 6GB of memory? I tried and it gave the following error
RuntimeError: CUDA out of memory. Tried to allocate 7.13 MiB (GPU 0; 6.00 GiB total capacity; 4.61 GiB already allocated; 140.80 KiB free; 58.77 MiB cached)
Thanks
03/28/2019 at 3:48 pm
You can try reducing the batch size or reduce the input sequence length. But I have only tried it on my 1080Ti with 11GB.
04/07/2019 at 11:05 am
How to perform prediction of the bert model like on elmo embedding
like model.predict(test_data)
04/08/2019 at 5:26 am
Even I wanted to know how to use the trained BERT model to predict on custom sentences. If anybody could help me out with this it would be of great help. Thanks in advance.
04/09/2019 at 11:58 am
i have tried using above evaluation method but with little bit of change like:
model.eval()
prediction = []
true_label = []
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for batches in pred_dataloader:
batches = tuple(t.to(device) for t in batches)
b_pred_ids, b_pred_mask = batches
with torch.no_grad():
logits_pred = model(b_pred_ids, token_type_ids=None,
attention_mask=b_pred_mask)
logits_pred = logits_pred.detach().cpu().numpy()
prediction.extend([list(p) for p in np.argmax(logits_pred, axis=-1)])
# print(len(prediction))
pred_tag = [[tags[p_i] for p_i in p] for p in prediction]
print(len(pred_tag))
and pred_dataloader is :
pred_data = TensorDataset(tr_inputs_pred, tr_masked)
# pred_sampler = RandomSampler(pred_data)
pred_dataloader = DataLoader(pred_data, batch_size=bs)
but result was very bad if you guys have got good result please share me
04/08/2019 at 5:33 am
Even I wanted some help regarding how to use the trained BERT model to do the prediction. If someone could help with this it would be very useful. Thanks in advance.
04/08/2019 at 2:11 pm
I’m experiencing strange “non deterministic” effect. Running BertForTokenClassification for the same sentence several times I’ve got different results.
Any thoughts?
04/08/2019 at 3:55 pm
This is a probabilistic model so some randomness has to be expected, especially with small datasets.
04/08/2019 at 2:33 pm
Hi Tobias, Thank you for article.
I’ve got some issues.
– running classifier several times I’ve got different results.
– how to clean context between documents?
04/08/2019 at 3:55 pm
What do you mean by “context between documents”?
04/10/2019 at 8:09 am
Sorry for naive questions. I’m beginner in NLU.
I guess tags like [CLS] should instruct BERT to clean internal state between documents and paragraphs?
Is it correct?
I was trying to use it. But it seams that default never_split=(“[UNK]”, “[SEP]”, “[PAD]”, “[CLS]”, “[MASK]”) is not working in tokenizer. And I don’t see that tags in the file vocab.txt
04/15/2019 at 5:16 pm
Hello Peter,
Did you activate the train mode of pytorch running model.train() ? This could explain the different results.
04/16/2019 at 8:45 am
My bad, i should have say : did you activate the eval mode ( model.eval() ) instead of the train mode….In fact, if you do predictions being in train mode, layers like dropout layers are activated and can lead to bad results.
04/16/2019 at 1:44 pm
print(model.training)
False
04/28/2019 at 2:35 pm
Hi! Great tutorial! Just to add:
While I was creating pad_sequences I was getting an error that translated to argument given to pad_sequences is empty. on further inspection, I found that this line:
[[tag2idx.get(l) for l in lab] for lab in labels]
for breaking tokens to character level: B-GEO became ‘b’,’-‘,’g’,’e’,’o’. and thus it’s tag2idx was None everywhere.
To solve for this I used the code from your previous LSTM tutorial:
[[tag2idx[w[2]] for w in s] for s in sentences]
Making the code this:
tags = pad_sequences([[tag2idx[w[2]] for w in s] for s in sentences],
maxlen=MAX_LEN, value=tag2idx[“O”], padding=”post”,
dtype=”long”, truncating=”post”)
It works fine now. Thanks again for this blog post!
11/22/2019 at 1:34 am
Hey Buddy,
I’m having the same issues it’s thinking it’s a none type, but when i use your code it tells me string index out of range.
any ideas?
04/29/2019 at 10:00 am
Hey! Great post! I’ve a question on the max_norm parameter in clipping gradients. I see that it’s set to 1.0, how do you decide the right parameter to set here and what does it exactly mean to clip parameters to 1.0?
Would really appreciate your help here, thanks again!
04/29/2019 at 10:36 am
The 1.0 for clipping was picked because the researchers that created Bert used it for fine-tuning on NER tasks. In my experience it is more important to clip the gradient (not by a too small value) than to what exact value. 10.0 could also be used IMO.
One difficulty when training neural networks with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, clipped the derivative of the loss with respect to the network inputs to the layers (before the activation functions are applied) to lie within a predefined range. In this case we shrink the gradient by `clip_coef = max_norm / (total_norm + 1e-6)` if clip_coef < 1. You can find more details about the implementation here: https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html
Does this help you?
05/01/2019 at 9:32 am
After bert_tokenizer length of tokenized_sen and thier respective labels are not matching.
so now ground truth is not valid.
05/01/2019 at 4:17 pm
See other comments. You have to use a specific version of the package.
05/06/2019 at 8:12 am
I wonder does this approach change weights of pretrained BERT or just classification layer?
According to the code below
no_decay = [‘bias’, ‘gamma’, ‘beta’]
optimizer regulates other layers.
05/07/2019 at 5:46 am
Hi Tobias,
Great Post! I want to know if there is a way to train custom entities using BERT. I have large amount of unsupervised text corpus with a limited number of labelled data-set (40-50 examples usually) and i want my custom entities to be trained using that data-set.
How can i achieve this using BERT ? I’ve been looking for this but unable to find an appropriate solution for this.
Any help is highly appreciated!
Thanks
05/07/2019 at 6:13 am
This is definitely possible. Check out the github of pretrained-bert-pytorch for more information.
05/07/2019 at 9:34 am
I have already seen that and it states that the fine-tuning requires labelled dataset. I do not have much labelled dataset as i have already mentioned but i do have unsupervised raw text. But i don’t know if i can train (or how i can train) it in an unsupervised manner.
Can you point to a more specific resource on how to do this (prefferably in Keras or Tensorflow).
Thanks for reply!
05/07/2019 at 10:15 am
Hi Usman,
you can also fine-tune on the unlabeled data first and then fine-tune for the supervised task. I would recommend doing this with pytorch, but there should be a tensorflow implementation availiable since it was released in tensorflow first. You can find all of these information on the pretrained-bert-pytorch github readme.
05/15/2019 at 6:56 am
That Helps.
Thank you!
05/11/2019 at 2:56 pm
Could you fix the identation for In [24]: and 23? or at least clarify?
Thanks for the excellent post
05/11/2019 at 3:00 pm
Now it’s ok! thanks!
05/11/2019 at 3:45 pm
Haven’t changed anything, but I’m happy you like it. 🙂
05/12/2019 at 5:40 pm
Tobias I like it a lot, and it is much appreciated!
Regarding my comment, about cells 23,24, and the indentation that is missing I can say that when I normally see the page via https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ I have problem with the identation, but there is no issue when I tried to add a comment seeing the post via the following link: https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/?unapproved=839&moderation-hash=37d4f8fb7dd65dd979f59df9bb8c1c5e#comment-839
05/13/2019 at 12:17 pm
Thanks a lot Antonis. I will investigate this 🙂
05/14/2019 at 9:16 am
Hi Tobias,
How can we calculate training accuracy? In the below code is there any way to get the training accuracy along with the loss?
loss = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask, labels=b_labels)
05/19/2019 at 6:48 pm
Hi, I’m trying to run your code. But in the training section (specifically at this line:
loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
I get the following error:
RuntimeError: Expected tensor for argument #1 ‘indices’ to have scalar type Long; but got CUDAType instead (while checking arguments for embedding)
If I don’t use CUDA I’ll get the same error but it finds CPUType instead.
05/29/2019 at 4:40 pm
Hi Daniel.
There is another comment about this problem. Try to replace on code [17] the term “.tensor” by “.LongTensor” and run everything again.
Example:
xxx = torch.LongTensor(yyy)
05/29/2019 at 3:57 pm
Very nice work! Unbelievable!
Can I ask you if there’s a problem of indentation in the last snippet of code? I was trying to copy and paste but of course there’s a problem of indentation, I tried to do it by myself, but given that I’m a noob I made some mistakes for sure (even if the code is running).
05/31/2019 at 12:38 pm
Thanks for sharing your work.
I am able to get the predictions just fine.
But i am finding it difficult to map the input batch to the original word form, after getting the predictions.
can you please give any suggestions ,because i am working on different dataset
07/21/2019 at 1:10 pm
how to save the trained model? and also can you please share the test code for this testing this model..
Thanks in advance
07/25/2019 at 6:48 am
i want to make a model which find the entity(company name, name, address, phone number, email etc.) from business card but where to start it. we facing some problem like- our business card text is unstructure and person name and company name my be same. So how to start it. thanks
07/30/2019 at 7:52 pm
A cool and informative tutorial and equally enlightening discussions.
Won’t it be a problem treating B- and I- separately ? How do we interpret the result if the sequence of labels predicted by the model is “O I- B-” or “O I- I- O”.
Does demarcating the beginning and intermediate tokens of a mention span add any advantage ?
08/01/2019 at 2:11 pm
Hi Tobias,
Very nice work !
I have a question about the tokenization.
I know BERT tokenizer uses wordpiece tokenization, but when it splits a word to a prefixe and suffixe, what happens to the tag of the word ?
For example :
The word indian is tagged with B-gpe, let’s say it is tokenized as “in” and “##dian”.
Then what is the label corresponding to “in” and what is the label corresponding to “##dian” ?
08/18/2019 at 5:58 am
Hi,
When you tokenize a word and the BERT tokenizer splits a given word into more than one token (e.g. “aardvark” to [‘aa’, ‘##rd’, ‘##var’, ‘##k’]), then wouldn’t you need to extend the tags/labels accordingly? Otherwise you may be mapping other unrelated labels to the sub-tokens ‘##rd’, ‘##var’, ‘##k’.
Best,
Alvaro
09/11/2019 at 10:23 am
Hello,
thank you for the code and the explanation,
in my case I have a multi word targets, how can I tag multi word target after word tokenization ?
For a example for the sentence “I’m playing video-games”, after tokenization I will have “video” “-” “games”, wile my target is “video-games”.
Thank you in advance.
10/04/2019 at 7:08 am
Please let me know how to test with any input
11/19/2019 at 5:44 am
Yes please
11/19/2019 at 10:05 am
You have to prepare the data as in the evaluation case and put them through the model. Basically you do tokenization and indexation, transform the integer sequences to torch tensors and pass them through the model.
10/11/2019 at 7:00 am
I am facing this issue. Can anybody help?
Can’t find a vocabulary file at path ‘models/bert-base-cased/vocab.txt’. To load the vocabulary from a Google pretrained model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`
10/20/2019 at 1:36 pm
Hi, why do you use RandomSampler during training? Will this add anything, or can be omitted?
11/11/2019 at 2:33 pm
Hi Tobias Sterbak,
Thanks for such a detailed post. I am trying this out for my project.
I am using it for Entity Detection hence using B-ENT,I-ENT, O format.
The only issue that I am now facing is while retrieving back the word text and labels.
I am using something like this :
val_inputtext = [tokenizer.convert_ids_to_tokens(ids) for ids in val_inputs.to(‘cpu’).numpy()]
Once I get back the words from this I am seeing that a word like ecg has now split into two words like
ec ,
##g
This certainly is an issue due to the masks that we have applied. I dont think this was our intended behaviour . Did you also face this ?Could you please look into this?
11/11/2019 at 2:46 pm
This is related to the wordpiece tokenizer and expected behavior.
11/11/2019 at 3:36 pm
Thanks Tobias , how do I retrieve the right words and labels ? Could you please suggest?
1. How do I correctly revert the PAD and mask behaviour?
val_inputtext = [tokenizer.convert_ids_to_tokens(ids) for ids in val_inputs.to(‘cpu’).numpy()]
this code is giving me masks and pads.
2 . Also , my data set has lots of ‘O’ and very less entities , what tuning do you suggest to get better results?
Thanks again 🙂
11/17/2019 at 9:30 am
Could you update the tutorial for the new version of the library?
11/19/2019 at 10:06 am
There shouldn’t be much to change here, if I have time i’ll do it. Until then, this might help you: https://huggingface.co/transformers/migration.html
11/24/2019 at 4:20 pm
Don’t we need to add [CLS] symbol at the beginning of each sentence and [SEP] at the end? Also do we need to pad the sentences with pad_sequences keras function? I mean do we need to pad at all? Doesn’t the library does it for us? Thanks!
For your info, I see huge degradation in performance once I migrated to transformers. I suspect, that if it is not any bug in my changes, this could be due to the (non trivial?) changes we have to do for the optimizers policy.
https://github.com/huggingface/transformers/blob/master/docs/source/migration.md#optimizers-bertadam–openaiadam-are-now-adamw-schedules-are-standard-pytorch-schedules
11/17/2019 at 9:35 am
I have a dataset with BIO labels for each word, but where sentences constitute essays. And maybe there are long term dependencies. Do you think that I could benefit, if I used an essay (4-5 sentences each) as a sentence? First of all is this practically possible? I was thinking to give to the network 1) the sentences, as you do AND also 2) the essays (4-5 sentences). This means that the sentences will be seen two times from the network. But maybe there are issues with the gpu memory. What do you think? Could you provide some feedback with your thoughts?
12/03/2019 at 10:43 am
Google is now working more towards quality content, and easily search-able content and I think BERT update will enforce the voice optimization, even more.
12/04/2019 at 4:28 pm
Nice explanation, however you should use `bert_base_cased` instead of the uncased model. Unless your training data are lower-cased.
The performance will reach 91 F1 for CoNLL once you fix it.
12/08/2019 at 5:50 pm
Hi, I wonder about these wordpieces and their labels, example: Indian => Ind + ##ian. What happend with the output label? What will be happen to their labels(B-gpe)? We have to convert the label to B-gpe( Ind) I-gpe(##ian)?. Hope see your answer! Thanks so much!!
12/10/2019 at 8:06 am
Hi Duc,
I would go with converting the label as you did. Depending on the performance I would try other options.
Let me know how it works for you 🙂