# Guide to word vectors with gensim and keras ## Word vectors

Today, I tell you what word vectors are, how you create them in python and finally how you can use them with neural networks in keras. For a long time, NLP methods use a vectorspace model to represent words. Commonly one-hot encoded vectors are used. This traditional, so called Bag of Words approach is pretty successful for a lot of tasks.

Recently, new methods for representing words in a vectorspace have been proposed and yielded big improvements in a lot of different NLP tasks. We will discuss some of these methods and see how to create this vectors in python.

### Continous Bag of Words

The basic idea grounds in the distributional hypothises, which states that the meaning of a word can be inferred from it’s surrounding words. So the idea of the continuous bag of words (CBOW) is to use the context of a word to predict the probability that this word appears. The image shows a simple CBOW model with only one word in the context window. So the input is a one-hot encoded context word and a hidden size of $N$ and a vocabulary size of $V$. After training this little network the columns of $W^\prime$ contains a $N$-dimensional vector representation for every word in the vocabulary. This is what we use as word vector.

For more details I recommend to read the following:

Or directly the main papers:

A great blog post for this topic you can find here.

We use the dataset from the “Toxic Comment Classification Challenge”, a recent kaggle competition, where you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. We first preprocess the comments, and train word vectors. Then we initialize a keras embedding layer with the pretrained word vectors and compare the performance with an randomly initialized embedding. On top of the embeddings an LSTM with dropout is used.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use("ggplot")

path = 'data/'

TRAIN_DATA_FILE = path + 'train.csv'
TEST_DATA_FILE = path + 'test.csv'

train_df = pd.read_csv(TRAIN_DATA_FILE)

train_df.head(10)


idcomment_texttoxicsevere_toxicobscenethreatinsultidentity_hate
00000997932d777bfExplanation\nWhy the edits made under my usern...000000
1000103f0d9cfb60fD'aww! He matches this background colour I'm s...000000
2000113f07ec002fdHey man, I'm really not trying to edit war. It...000000
30001b41b1c6bb37e"\nMore\nI can't make any real suggestions on ...000000
40001d958c54c6e35You, sir, are my hero. Any chance you remember...000000
500025465d4725e87"\n\nCongratulations from me as well, use the ...000000
60002bcb3da6cb337COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK111010
700031b1e95af7921Your vandalism to the Matt Shirvington article...000000
800037261f536c51dSorry if the word 'nonsense' was offensive to ...000000
900040093b2687caaalignment on this subject and which are contra...000000

## Preprocess the text

Now we preprocess the comments. For this we replace URLs by “URL” and IP addresses by “IPADDRESS”. Then we tokenize the text and lowercase it.

########################################
## process texts in datasets
########################################
print('Processing text dataset')
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
from string import punctuation, ascii_lowercase
import regex as re
from tqdm import tqdm

# replace urls
re_url = re.compile(r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\
.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*",
re.MULTILINE|re.UNICODE)
# replace ips
re_ip = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

# setup tokenizer
tokenizer = WordPunctTokenizer()

vocab = Counter()

def text_to_wordlist(text, lower=False):
# replace URLs
text = re_url.sub("URL", text)

# replace IPs

# Tokenize
text = tokenizer.tokenize(text)

# optional: lower case
if lower:
text = [t.lower() for t in text]

# Return a list of words
vocab.update(text)
return text

for text in tqdm(list_sentences):
txt = text_to_wordlist(text, lower=lower)

list_sentences_train = list(train_df["comment_text"].fillna("NAN_WORD").values)
list_sentences_test = list(test_df["comment_text"].fillna("NAN_WORD").values)


Processing text dataset

100%|██████████| 312735/312735 [00:17<00:00, 18030.28it/s]

print("The vocabulary contains {} unique tokens".format(len(vocab)))

The vocabulary contains 365516 unique tokens


## Model the word vectors with Gensim

Now we are ready to train the word vectors. We use the gensim library in python which supports a bunch of classes for NLP applications. As discussed, we use a CBOW model with negative sampling and 100 dimensional word vectors.

from gensim.models import Word2Vec

model = Word2Vec(comments, size=100, window=5, min_count=5, workers=16, sg=0, negative=5)

word_vectors = model.wv

print("Number of word vectors: {}".format(len(word_vectors.vocab)))

Number of word vectors: 70056


Let’s see if we have trained semantically reasonable word vectors.

model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])

[('prince', 0.9849334359169006),
('queen', 0.9684107899665833),
('princess', 0.9518582820892334),
('bishop', 0.9380313158035278),
('duke', 0.9368391633033752),
('duchess', 0.9353090524673462),
('victoria', 0.920809805393219),
('mary', 0.9180552363395691),
('mayor', 0.912704348564148),
('prussia', 0.9100297093391418)]


That looks really good! Now we can go on to the text classification model.

## Initialize the embeddings in keras

First we have to pad or cut the tokenized comments to a certain length.

MAX_NB_WORDS = len(word_vectors.vocab)
MAX_SEQUENCE_LENGTH = 200

from keras.preprocessing.sequence import pad_sequences

word_index = {t: i+1 for i,t in enumerate(vocab.most_common(MAX_NB_WORDS))}
sequences = [[word_index.get(t, 0) for t in comment]
test_sequences = [[word_index.get(t, 0)  for t in comment]

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train_df[list_classes].values
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', y.shape)

truncating="post")
print('Shape of test_data tensor:', test_data.shape)

Using TensorFlow backend.

Shape of data tensor: (159571, 200)
Shape of label tensor: (159571, 6)
Shape of test_data tensor: (153164, 200)


Now we finally create the embedding matrix. This is what we will feed to the keras embedding layer. Note, that you can use the same code to easily initialize the embeddings with Glove or other pretrained word vectors.

WV_DIM = 100
nb_words = min(MAX_NB_WORDS, len(word_vectors.vocab))
# we initialize the matrix with random numbers
wv_matrix = (np.random.rand(nb_words, WV_DIM) - 0.5) / 5.0
for word, i in word_index.items():
if i >= MAX_NB_WORDS:
continue
try:
embedding_vector = word_vectors[word]
wv_matrix[i] = embedding_vector
except:
pass


## Setup the comment classifier

from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout,SpatialDropout1D, Bidirectional
from keras.models import Model
from keras.layers.normalization import BatchNormalization


We use a bidirectional LSTM with dropout and batch normalization. Note that we use the CUDA implementation of keras, which runs much faster on GPU than the normal LSTM layer.

wv_layer = Embedding(nb_words,
WV_DIM,
weights=[wv_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

# Inputs
comment_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = wv_layer(comment_input)

# biGRU
embedded_sequences = SpatialDropout1D(0.2)(embedded_sequences)
x = Bidirectional(CuDNNLSTM(64, return_sequences=False))(embedded_sequences)

# Output
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
preds = Dense(6, activation='sigmoid')(x)

# build the model
model = Model(inputs=[comment_input], outputs=preds)
model.compile(loss='binary_crossentropy',
metrics=[])

hist = model.fit([data], y, validation_split=0.1,
epochs=10, batch_size=256, shuffle=True)

Train on 143613 samples, validate on 15958 samples
Epoch 1/10
143613/143613 [==============================] - 15s 103us/step - loss: 0.1681 - val_loss: 0.0551
Epoch 2/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0574 - val_loss: 0.0502
Epoch 3/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0522 - val_loss: 0.0486
Epoch 4/10
143613/143613 [==============================] - 14s 96us/step - loss: 0.0501 - val_loss: 0.0466
Epoch 5/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0472 - val_loss: 0.0464
Epoch 6/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0460 - val_loss: 0.0448
Epoch 7/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0448 - val_loss: 0.0448
Epoch 8/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0438 - val_loss: 0.0453
Epoch 9/10
143613/143613 [==============================] - 14s 98us/step - loss: 0.0431 - val_loss: 0.0446
Epoch 10/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0425 - val_loss: 0.0454


Let’s look at the training loss compared to the validation loss.

history = pd.DataFrame(hist.history)
plt.figure(figsize=(12,12));
plt.plot(history["loss"]);
plt.plot(history["val_loss"]);
plt.title("Loss with pretrained word vectors");
plt.show(); It looks like the loss is decreasing nicely, but there is still room for improvement. You can try to play with the embeddings, the dropout and the architecture of the network. Now we want to compare the pretrained word vectors with randomly initialized embeddings.

wv_layer = Embedding(nb_words,
WV_DIM,
# weights=[wv_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

# Inputs
comment_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = wv_layer(comment_input)

# biGRU
embedded_sequences = SpatialDropout1D(0.2)(embedded_sequences)
x = Bidirectional(CuDNNLSTM(64, return_sequences=False))(embedded_sequences)

# Output
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
preds = Dense(6, activation='sigmoid')(x)

# build the model
model = Model(inputs=[comment_input], outputs=preds)
model.compile(loss='binary_crossentropy',
metrics=[])

hist = model.fit([data], y, validation_split=0.1,
epochs=10, batch_size=256, shuffle=True)

Train on 143613 samples, validate on 15958 samples
Epoch 1/10
143613/143613 [==============================] - 14s 99us/step - loss: 0.1800 - val_loss: 0.1085
Epoch 2/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.1117 - val_loss: 0.1063
Epoch 3/10
143613/143613 [==============================] - 14s 98us/step - loss: 0.1063 - val_loss: 0.0990
Epoch 4/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0993 - val_loss: 0.1131
Epoch 5/10
143613/143613 [==============================] - 14s 98us/step - loss: 0.0951 - val_loss: 0.0999
Epoch 6/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0922 - val_loss: 0.0907
Epoch 7/10
143613/143613 [==============================] - 14s 98us/step - loss: 0.0902 - val_loss: 0.0867
Epoch 8/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0868 - val_loss: 0.0850
Epoch 9/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0838 - val_loss: 0.1060
Epoch 10/10
143613/143613 [==============================] - 14s 97us/step - loss: 0.0821 - val_loss: 0.0860

history = pd.DataFrame(hist.history)
plt.figure(figsize=(12,12));
plt.plot(history["loss"]);
plt.plot(history["val_loss"]);
plt.title("Loss with random word vectors");
plt.show(); We see the loss decreasing much slower and the validation loss is pretty unstable.

I hope you enjoyed this post and it is useful to you. Have fun with word vectors! 