Today, I want to show you how you can build an NLP application without explicitly labeled data. I use the “German Recipes Dataset” I recently published on kaggle, to build a neural network model, that can identify ingredients in cooking recipes. We have more than 12000 German recipes and their ingredients list. First we will generate labels for every word in the recipe, if it is an ingredient or not. Then we use a sequence-to-sequence neural network to tag every word like in a named entity recognition task. Then we pseudo-label the training set and update the model with the new labels.
import numpy as np
import pandas as pd
df = pd.read_json("../input/recipes.json")
df.Instructions[2]
We put some recipes aside for later evaluation.
eval_df = df[11000:]
eval_df.shape
df = df[:11000]
df.shape
!python -m spacy download de_core_news_sm
import spacy
nlp = spacy.load('de_core_news_sm', disable=['parser', 'tagger', 'ner'])
We run the spacy tokenizer on all instructions.
tokenized = [nlp(t) for t in df.Instructions.values]
And now we build a vocabulary of known tokens.
vocab = {"<UNK>": 1, "<PAD>": 0}
for txt in tokenized:
for token in txt:
if token.text not in vocab.keys():
vocab[token.text] = len(vocab)
print("Number of unique tokens: {}".format(len(vocab)))
What is missing now, are the labels. We need to know where in the text are ingredients. We will try to bootstrap this information from the provided ingredints list.
ingredients = df.Ingredients
ingredients[0]
We first clean the ingredients lists from stopwords, numbers and other stuff.
def _filter(token):
if len(token) < 2:
return False
if token.is_stop:
return False
if token.text[0].islower():
return False
if token.is_digit:
return False
if token.like_num:
return False
return True
def _clean(text):
text = text.replace("(", "")
text = text.split("/")[0]
return text
clean = [_clean(t.text) for i in ingredients[214] for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
clean
def get_labels(ingredients, tokenized_instructions):
labels = []
for ing, ti in zip(ingredients, tokenized_instructions):
l_i = []
ci = [_clean(t.text) for i in ing for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
label = []
for token in ti:
l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
labels.append(l_i)
return labels
labels = get_labels(ingredients, tokenized)
set([t.text for t, l in zip(tokenized[214], labels[214]) if l])
First we have to look at the length of our recipes, to determin the length we want to pad our inputs for the network to.
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist([len([t for t in tokens]) for tokens in tokenized], bins=20);
We picked a maximum length of 400 tokens.
MAX_LEN = 400
Now we pad the sequences and map the words to integers.
from keras.preprocessing.sequence import pad_sequences
def prepare_sequences(texts, max_len, vocab={"<UNK>": 1, "<PAD>": 0}):
X = [[vocab.get(w.text, vocab["<UNK>"]) for w in s] for s in texts]
return pad_sequences(maxlen=max_len, sequences=X, padding="post", value=vocab["<PAD>"])
X_seq = prepare_sequences(tokenized, max_len=MAX_LEN, vocab=vocab)
X_seq[1]
y_seq = []
for l in labels:
y_i = []
for i in range(MAX_LEN):
try:
y_i.append(float(l[i]))
except:
y_i.append(0.0)
y_seq.append(np.array(y_i))
y_seq = np.array(y_seq)
y_seq = y_seq.reshape(y_seq.shape[0], y_seq.shape[1], 1)
Now we can start to setup the model.
import tensorflow as tf
from tensorflow.keras import layers
print(tf.VERSION)
print(tf.keras.__version__)
We build a simple 2-layer LSTM-based sequence tagger with tensorflow.keras.
model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=len(vocab), mask_zero=True, output_dim=50))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
And now we fit it.
history = model.fit(X_seq, y_seq, epochs=10, batch_size=256, validation_split=0.1)
plt.plot(history.history["loss"], label="trn_loss");
plt.plot(history.history["val_loss"], label="val_loss");
plt.legend();
plt.title("Loss");
plt.plot(history.history["acc"], label="trn_acc");
plt.plot(history.history["val_acc"], label="val_acc");
plt.legend();
plt.title("Accuracy");
Now that the model is trained, we can look at some predictions on the training set.
y_pred = model.predict(X_seq, verbose=1, batch_size=1024)
i = 3343
pred_i = y_pred[i] > 0.05
tokenized[i]
ingreds = [t.text for t, p in zip(tokenized[i], pred_i) if p]
print(set(ingreds))
ingreds = [t.text for t, p in zip(tokenized[i], y_seq[i]) if p]
set(ingreds)
ingredients[i]
This looks very good! Our model seems to be able to identify the ingredients better than our training labels. So we now use the produced labels for fine-tuning the network.
new_labels = []
for pred_i, ti in zip(y_pred, tokenized):
l_i = []
ci = [t.text for t, p in zip(tokenized[i], pred_i > 0.05) if p]
label = []
for token in ti:
l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
new_labels.append(l_i)
y_seq_new = []
for l in new_labels:
y_i = []
for i in range(MAX_LEN):
try:
y_i.append(float(l[i]))
except:
y_i.append(0.0)
y_seq_new.append(np.array(y_i))
y_seq_new = np.array(y_seq_new)
y_seq_new = y_seq.reshape(y_seq_new.shape[0], y_seq_new.shape[1], 1)
We fit the network again for one epoch with the new labels.
history = model.fit(X_seq, y_seq_new, epochs=1, batch_size=256, validation_split=0.1)
Now we can look at the test data we put aside in the beginning.
eval_ingredients = eval_df.Ingredients.values
eval_tokenized = [nlp(t) for t in eval_df.Instructions.values]
X_seq_test = prepare_sequences(eval_tokenized, max_len=MAX_LEN, vocab=vocab)
y_pred_test = model.predict(X_seq_test, verbose=1, batch_size=1024)
i = 893
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
i = 26
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
This looks quite good! We build a quite strong model to identify ingredients in recipes. I hope you learned something and had some fun. You can try to improve the model by manual labeling or adding labels from a dictionary of ingredients. Or you try some more advanced sequence labeling models like Character-LSTMs or Bert.
You might also be interested in:
- Introduction to n-gram language models: You might have heard, that neural language models power a lot of the recent advances in natural lang …
- Named Entity Recognition with Bert: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
- Cluster discovery in german recipes: If you are dealing with a large collections of documents, you will often find yourself in the situat …
03/19/2019 at 12:15 pm
Great stuff, I love it!
05/16/2019 at 12:37 pm
Gut gemacht, schöner Artikel