Today, I want to show you how you can build an NLP application without explicitly labeled data. I use the “German Recipes Dataset” I recently published on kaggle, to build a neural network model, that can identify ingredients in cooking recipes. We have more than 12000 German recipes and their ingredients list. First we will generate labels for every word in the recipe, if it is an ingredient or not. Then we use a sequence-to-sequence neural network to tag every word like in a named entity recognition task. Then we pseudo-label the training set and update the model with the new labels.

In [1]:
import numpy as np
import pandas as pd

Load the recipes

In [2]:
df = pd.read_json("../input/recipes.json")
In [67]:
df.Instructions[2]
Out[67]:
'Die Kirschen abtropfen lassen, dabei den Saft auffangen. Das Puddingpulver mit dem Vanillezucker mischen und mit 6 EL Saft glatt rühren. Den übrigen Kirschsaft aufkochen und vom Herd nehmen. Das angerührte Puddingpulver einrühren und unter Rühren ca. eine Minute köcheln. Die Kirschen unter den angedickten Saft geben. Milch, 40 g Zucker, Vanillemark und Butter aufkochen. Den Topf vom Herd ziehen und den Grieß unter Rühren einstreuen. Unter Rühren einmal aufkochen lassen und zugedeckt ca. fünf Minuten quellen lassen.In der Zeit das Ei trennen. Das Eiweiß mit einer Prise Salz steif schlagen und dabei die restlichen 20 g Zucker einrieseln lassen. Das Eigelb unter den Brei rühren und dann das Eiweiß unterheben.Den Grießbrei mit dem Kompott servieren.'

We put some recipes aside for later evaluation.

In [3]:
eval_df = df[11000:]
eval_df.shape
Out[3]:
(1190, 8)
In [4]:
df = df[:11000]
df.shape
Out[4]:
(11000, 8)

Tokenize the texts with spacy

In [5]:
!python -m spacy download de_core_news_sm
Collecting de_core_news_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz#egg=de_core_news_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz (38.2MB)
    100% |████████████████████████████████| 38.2MB 85.4MB/s ta 0:00:011  19% |██████▏                         | 7.3MB 87.8MB/s eta 0:00:01
Installing collected packages: de-core-news-sm
  Running setup.py install for de-core-news-sm ... done
Successfully installed de-core-news-sm-2.0.0

    Linking successful
    /opt/conda/lib/python3.6/site-packages/de_core_news_sm -->
    /opt/conda/lib/python3.6/site-packages/spacy/data/de_core_news_sm

    You can now load the model via spacy.load('de_core_news_sm')

In [6]:
import spacy
nlp = spacy.load('de_core_news_sm', disable=['parser', 'tagger', 'ner'])

We run the spacy tokenizer on all instructions.

In [7]:
tokenized = [nlp(t) for t in df.Instructions.values]

And now we build a vocabulary of known tokens.

In [8]:
vocab = {"<UNK>": 1, "<PAD>": 0}
for txt in tokenized:
    for token in txt:
        if token.text not in vocab.keys():
            vocab[token.text] = len(vocab)
print("Number of unique tokens: {}".format(len(vocab)))
Number of unique tokens: 17687

Create the labels

What is missing now, are the labels. We need to know where in the text are ingredients. We will try to bootstrap this information from the provided ingredints list.

In [9]:
ingredients = df.Ingredients
In [10]:
ingredients[0]
Out[10]:
['600 g Hackfleisch, halb und halb',
 '800 g Sauerkraut',
 '200 g Wurst, geräucherte (Csabai Kolbász)',
 '150 g Speck, durchwachsener, geräucherter',
 '100 g Reis',
 '1 m.-große Zwiebel(n)',
 '1 Zehe/n Knoblauch',
 '2 Becher Schmand',
 '1/2TL Kümmel, ganzer',
 '2 Lorbeerblätter',
 'Salz und Pfeffer',
 '4 Ei(er) (bei Bedarf)',
 'Paprikapulver',
 'etwas Wasser',
 'Öl']

We first clean the ingredients lists from stopwords, numbers and other stuff.

In [11]:
def _filter(token):
    if len(token) < 2:
        return False
    if token.is_stop:
        return False
    if token.text[0].islower():
        return False
    if token.is_digit:
        return False
    if token.like_num:
        return False
    return True
In [12]:
def _clean(text):
    text = text.replace("(", "")
    text = text.split("/")[0]
    return text
In [13]:
clean = [_clean(t.text) for i in ingredients[214] for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
clean
Out[13]:
['Rosenkohl',
 'Schalotten',
 'Hühnerbrühe',
 'Milch',
 'EL',
 'Crème',
 'Speck',
 'Kartoffelgnocchi']
In [15]:
def get_labels(ingredients, tokenized_instructions):
    labels = []
    for ing, ti in zip(ingredients, tokenized_instructions):
        l_i = []
        ci = [_clean(t.text) for i in ing for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
        label = []
        for token in ti:
            l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
        labels.append(l_i)
    return labels
In [16]:
labels = get_labels(ingredients, tokenized)
In [17]:
set([t.text for t, l in zip(tokenized[214], labels[214]) if l])
Out[17]:
{'Crème', 'Hühnerbrühe', 'Milch', 'Rosenkohl', 'Schalotten', 'Speck'}

Modelling with a LSTM network

First we have to look at the length of our recipes, to determin the length we want to pad our inputs for the network to.

In [18]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.hist([len([t for t in tokens]) for tokens in tokenized], bins=20);

We picked a maximum length of 400 tokens.

In [19]:
MAX_LEN = 400

Prepare the sequences by padding

Now we pad the sequences and map the words to integers.

In [20]:
from keras.preprocessing.sequence import pad_sequences

def prepare_sequences(texts, max_len, vocab={"<UNK>": 1, "<PAD>": 0}):
    X = [[vocab.get(w.text, vocab["<UNK>"]) for w in s] for s in texts]
    return pad_sequences(maxlen=max_len, sequences=X, padding="post", value=vocab["<PAD>"])
Using TensorFlow backend.
In [21]:
X_seq = prepare_sequences(tokenized, max_len=MAX_LEN, vocab=vocab)
In [22]:
X_seq[1]
Out[22]:
array([192, 193, 194, 183, 195, 196, 128, 197,   9, 198, 199, 200, 201,
       202, 203,  60, 204, 205,   9,  13, 206,  15,  23,  98, 207, 208,
        51, 209,  68, 202, 203,  25,   6, 195, 125, 202, 210, 211, 212,
        33,  45, 213, 214, 100, 196,  13, 215, 216, 217,  33,   9,  68,
       218, 219, 213, 169,  35,  82, 100, 220, 221, 202,   6, 222,  45,
       223,  48, 224,  33,  67, 225, 100, 226,   6, 227, 228, 229, 130,
        45,  92,  85, 230, 211, 231,   6, 232, 233, 234, 235, 145, 157,
       236,   9, 237, 238, 104, 239, 210, 240, 157, 241,  54,   6, 109,
       242, 243, 244, 245, 246, 187, 247,   6, 248, 183, 249, 250,  33,
       129,  13, 251, 252, 101, 253,  33, 254,   9,  31, 255,  40, 172,
         6,   2, 256, 257, 177, 258, 259, 260,  33,  42, 261, 262, 263,
       131, 264, 265, 266,  33, 267,  74, 268, 269,  68, 270,   6,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int32)
In [23]:
y_seq = []
for l in labels:
    y_i = []
    for i in range(MAX_LEN):
        try:
            y_i.append(float(l[i]))
        except:
            y_i.append(0.0)
    y_seq.append(np.array(y_i))
y_seq = np.array(y_seq)
y_seq = y_seq.reshape(y_seq.shape[0], y_seq.shape[1], 1)

Setup the network

Now we can start to setup the model.

In [24]:
import tensorflow as tf
from tensorflow.keras import layers

print(tf.VERSION)
print(tf.keras.__version__)
1.12.0
2.1.6-tf

We build a simple 2-layer LSTM-based sequence tagger with tensorflow.keras.

In [25]:
model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=len(vocab), mask_zero=True, output_dim=50))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
In [26]:
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])
In [27]:
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 50)          884350    
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, None, 50)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         58880     
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, None, 128)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128)         98816     
_________________________________________________________________
time_distributed (TimeDistri (None, None, 1)           129       
=================================================================
Total params: 1,042,175
Trainable params: 1,042,175
Non-trainable params: 0
_________________________________________________________________

And now we fit it.

In [28]:
history = model.fit(X_seq, y_seq, epochs=10, batch_size=256, validation_split=0.1)
Train on 9900 samples, validate on 1100 samples
Epoch 1/10
9900/9900 [==============================] - 131s 13ms/step - loss: 0.3855 - acc: 0.9019 - val_loss: 0.2982 - val_acc: 0.9108
Epoch 2/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.2846 - acc: 0.9105 - val_loss: 0.2620 - val_acc: 0.9108
Epoch 3/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.2379 - acc: 0.9112 - val_loss: 0.1950 - val_acc: 0.9160
Epoch 4/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.1214 - acc: 0.9528 - val_loss: 0.0729 - val_acc: 0.9746
Epoch 5/10
9900/9900 [==============================] - 130s 13ms/step - loss: 0.0663 - acc: 0.9757 - val_loss: 0.0668 - val_acc: 0.9763
Epoch 6/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.0616 - acc: 0.9774 - val_loss: 0.0645 - val_acc: 0.9774
Epoch 7/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.0590 - acc: 0.9785 - val_loss: 0.0625 - val_acc: 0.9782
Epoch 8/10
9900/9900 [==============================] - 126s 13ms/step - loss: 0.0570 - acc: 0.9794 - val_loss: 0.0608 - val_acc: 0.9789
Epoch 9/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.0545 - acc: 0.9806 - val_loss: 0.0595 - val_acc: 0.9794
Epoch 10/10
9900/9900 [==============================] - 126s 13ms/step - loss: 0.0521 - acc: 0.9815 - val_loss: 0.0564 - val_acc: 0.9808
In [29]:
plt.plot(history.history["loss"], label="trn_loss");
plt.plot(history.history["val_loss"], label="val_loss");
plt.legend();
plt.title("Loss");
In [30]:
plt.plot(history.history["acc"], label="trn_acc");
plt.plot(history.history["val_acc"], label="val_acc");
plt.legend();
plt.title("Accuracy");

Analyse the predictions of the model

Now that the model is trained, we can look at some predictions on the training set.

In [49]:
y_pred = model.predict(X_seq, verbose=1, batch_size=1024)
11000/11000 [==============================] - 10s 945us/step
In [50]:
i = 3343
pred_i = y_pred[i] > 0.05
In [51]:
tokenized[i]
Out[51]:
Kohlrabi schälen, waschen und in Stifte schneiden. Brühe und Milch ankochen, Kohlrabi dazugeben, aufkochen lassen und 10 Minuten kochen. Dann herausnehmen und abtropfen lassen, die Brühe aufheben.Butter erhitzen, das Mehl darin anschwitzen, mit Kohlrabibrühe ablöschen und aufkochen lassen. Mit den Gewürzen abschmecken. Kohlrabi wieder dazugeben.Hähnchenbrust schnetzeln, kräftig anbraten und würzen. Das Fleisch in eine Auflaufform geben, die Speckwürfel darüber verteilen. Mit Käse bestreuen. Nun das Gemüse darüber schichten und alles bei 180 °C ca. 25 Minuten überbacken.Tipp:Man kann auch gut gekochte, in Würfel geschnittene Kartoffeln unter die Kohlrabi mischen. Ebenso kann man auch Kohlrabi und Möhren für den Auflauf nehmen. Schmeckt auch sehr lecker!
In [52]:
ingreds = [t.text for t, p in zip(tokenized[i], pred_i) if p]
print(set(ingreds))
{'Milch', 'Kartoffeln', 'Möhren', 'Kohlrabi', 'Butter', 'Gemüse', 'Brühe', 'Käse', 'Speckwürfel', 'Mehl'}
In [53]:
ingreds = [t.text for t, p in zip(tokenized[i], y_seq[i]) if p]
set(ingreds)
Out[53]:
{'Butter', 'Kohlrabi', 'Käse', 'Mehl', 'Milch'}
In [37]:
ingredients[i]
Out[37]:
['500 g Kohlrabi',
 '1/4Liter Hühnerbrühe',
 '1/4Liter Milch',
 '1 EL Butter',
 '30 g Mehl',
 '300 g Hähnchenbrustfilet(s)',
 'Salz und Pfeffer',
 'Muskat',
 '50 g Käse, gerieben',
 '50 g Speck, gewürfelt']

This looks very good! Our model seems to be able to identify the ingredients better than our training labels. So we now use the produced labels for fine-tuning the network.

In [39]:
new_labels = []
for pred_i, ti in zip(y_pred, tokenized):
    l_i = []
    ci = [t.text for t, p in zip(tokenized[i], pred_i > 0.05) if p]
    label = []
    for token in ti:
        l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
    new_labels.append(l_i)
In [40]:
y_seq_new = []
for l in new_labels:
    y_i = []
    for i in range(MAX_LEN):
        try:
            y_i.append(float(l[i]))
        except:
            y_i.append(0.0)
    y_seq_new.append(np.array(y_i))
y_seq_new = np.array(y_seq_new)
y_seq_new = y_seq.reshape(y_seq_new.shape[0], y_seq_new.shape[1], 1)

We fit the network again for one epoch with the new labels.

In [41]:
history = model.fit(X_seq, y_seq_new, epochs=1, batch_size=256, validation_split=0.1)
Train on 9900 samples, validate on 1100 samples
Epoch 1/1
9900/9900 [==============================] - 127s 13ms/step - loss: 0.0479 - acc: 0.9834 - val_loss: 0.0533 - val_acc: 0.9824

Look at test data

Now we can look at the test data we put aside in the beginning.

In [62]:
eval_ingredients = eval_df.Ingredients.values
In [56]:
eval_tokenized = [nlp(t) for t in eval_df.Instructions.values]

X_seq_test = prepare_sequences(eval_tokenized, max_len=MAX_LEN, vocab=vocab)
In [57]:
y_pred_test = model.predict(X_seq_test, verbose=1, batch_size=1024)
1190/1190 [==============================] - 2s 2ms/step
In [73]:
i = 893
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
Den Quark durch ein Sieb in eine tiefe Schüssel streichen.Das Mehl, den Zucker, Salz, Vanillezucker und das rohe Ei/er gut verrühren.Diese Masse auf einem mit Mehl bestreuten Backbrett zu einer dicken Wurst rollen und in 10 gleichgroße Scheiben schneiden. In heißer Butter von beiden Seiten goldbraun braten.Die fertigen Tworoshniki werden mit Puderzucker bestreut oder warm mit saurer Sahne oder Obstsirup zu Tisch gebracht.

['500 g Quark, sehr trockenen', '80 g Mehl', '2 EL Zucker', '1 Pck. Vanillezucker', 'Salz', '1 Ei(er), evt. 2', '4 EL Butter oder Margarine', 'Puderzucker', '125 ml Sirup (Obstsirup) oder saure Sahne', 'Mehl für die Arbeitsfläche']

{'Quark', 'Obstsirup', 'Butter', 'Vanillezucker', 'Puderzucker', 'Sahne', 'Zucker', 'Mehl', 'Salz'}
In [76]:
i = 26
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
Spargel putzen und bissfest garen. Herausnehmen, abschrecken und warm stellen.Fisch mit Salz und Pfeffer würzen. Öl in einer Pfanne erhitzen und den Lachs darin 3-4 Min. je Seite braten. Butter schmelzen, Mandeln hinzufügen und leicht bräunen. Schale der Limette mit einem Zestenreißer abziehen, den Saft auspressen, beides in die Butter geben. Mit Salz und Pfeffer würzen.Spargel abtropfen lassen, mit Lachs anrichten und mit Mandelbutter beträufeln.Dazu passen Salzkartoffeln.

['500 g Spargel, weißer', '500 g Spargel, grüner', 'Salz und Pfeffer', '4 Scheibe/n Lachsfilet(s) (à ca. 200g)', '2 EL Öl', '100 g Butter', '30 g Mandel(n) in Blättchen', '1 Limette(n), unbehandelt']

{'Pfeffer', 'Öl', 'Fisch', 'Saft', 'Butter', 'Limette', 'Lachs', 'Spargel', 'Mandeln', 'Salz'}

This looks quite good! We build a quite strong model to identify ingredients in recipes. I hope you learned something and had some fun. You can try to improve the model by manual labeling or adding labels from a dictionary of ingredients. Or you try some more advanced sequence labeling models like Character-LSTMs or Bert.

You might also be interested in: