Deep learning models have shown amazing performance in a lot of fields such as autonomous driving, manufacturing, and medicine, to name a few. However, these are fields in which representing model uncertainty is of crucial importance. The standard deep learning tools for regression and classification do not capture model uncertainty. In classification, predictive probabilities obtained at the end of the pipeline (the softmax output) are often erroneously interpreted as model confidence. Gal et. al argue, that a model can be uncertain in its predictions even with a high softmax output. Passing a point estimate of a function through a softmax results in extrapolations with unjustified high confidence for points far from the training data.

Model uncertainty is indispensable for the deep learning practitioner as well. With model confidence at hand we can treat uncertain inputs and special cases explicitly. I recommend Vincent Warmerdam amazing talk “How to Constrain Artificial Stupidit” from PyData London 2019 for a general view on the topic. For example, in the case of classification, a model might return a result with high uncertainty. In this case we might decide to pass the input to a human for classification.

Gal et. al show that the use of dropout in neural networks can be interpreted as a Bayesian approximation of a Gaussian process, a well known probabilistic model. Dropout is used in many models in deep learning as a way to avoid over-fitting, and they show that dropout approximately integrates over the models weights. In this article we will see how to represent model uncertainty of existing dropout neural networks with keras. This approach, called Monte Carlo dropout, will mitigates the problem of representing model uncertainty in deep learning without sacrificing either computational complexity or test accuracy and can be used for all kind of models trained with dropout.

# Load the MNIST data

We will use the MNIST dataset to explore the proposed method of Monte Carlo dropout. We can conveniently load it with keras.

```
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential, Model, Input
from keras.layers import Dense, Dropout, Flatten from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
```

```
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
```

```
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
```

```
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
```

```
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
```

```
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
```

# Build Monte Carlo Model and run experiments

Modelling uncertainty with Monte Carlo dropout works by running multiple forward passes trough the model with a different dropout masks every time. Let’s say we are given a trained neural network model with dropout

By computing the average and the variance of this sample we get an ensemble prediction, which is the mean of the models posterior distribution for this sample and an estimate of the uncertainty of the model regarding

Note that the dropout NN model itself is not changed. To estimate the predictive mean and predictive uncertainty we simply collect the results of stochastic forward passes through the model. As a result, this information can be used with existing NN models trained with dropout.

To achieve this in keras, we have to use the functional api and setup dropout this way: `Dropout(p)(input_tensor, training=True)`

. We build a simple convolutional network with a without Monte Carlo dropout and compare their properties.

```
def get_dropout(input_tensor, p=0.5, mc=False):
if mc:
return Dropout(p)(input_tensor, training=True)
else:
return Dropout(p)(input_tensor)
def get_model(mc=False, act="relu"):
inp = Input(input_shape)
x = Conv2D(32, kernel_size=(3, 3), activation=act)(inp)
x = Conv2D(64, kernel_size=(3, 3), activation=act)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = get_dropout(x, p=0.25, mc=mc)
x = Flatten()(x)
x = Dense(128, activation=act)(x)
x = get_dropout(x, p=0.5, mc=mc)
out = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inp, outputs=out)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
return model
```

```
model = get_model(mc=False, act="relu")
mc_model = get_model(mc=True, act="relu")
```

```
h = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=10,
verbose=1,
validation_data=(x_test, y_test))
```

```
# score of the normal model
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```

```
h_mc = mc_model.fit(x_train, y_train,
batch_size=batch_size,
epochs=10,
verbose=1,
validation_data=(x_test, y_test))
```

```
import tqdm
mc_predictions = []
for i in tqdm.tqdm(range(500)):
y_p = mc_model.predict(x_test, batch_size=1000)
mc_predictions.append(y_p)
```

```
# score of the mc model
accs = []
for y_p in mc_predictions:
acc = accuracy_score(y_test.argmax(axis=1), y_p.argmax(axis=1))
accs.append(acc)
print("MC accuracy: {:.1%}".format(sum(accs)/len(accs)))
```

```
mc_ensemble_pred = np.array(mc_predictions).mean(axis=0).argmax(axis=1)
ensemble_acc = accuracy_score(y_test.argmax(axis=1), mc_ensemble_pred)
print("MC-ensemble accuracy: {:.1%}".format(ensemble_acc))
```

Look at the distributions of the monte carlo predictions and in blue you see the prediction of the ensemble.

```
plt.hist(accs);
plt.axvline(x=ensemble_acc, color="b");
```

# Explore the Monte Carlo predictions

We look at images and their assigned predictions with probability and uncertainty.

```
idx = 247
plt.imshow(x_test[idx][:,:,0])
```

```
p0 = np.array([p[idx] for p in mc_predictions])
print("posterior mean: {}".format(p0.mean(axis=0).argmax()))
print("true label: {}".format(y_test[idx].argmax()))
print()
# probability + variance
for i, (prob, var) in enumerate(zip(p0.mean(axis=0), p0.std(axis=0))):
print("class: {}; proba: {:.1%}; var: {:.2%} ".format(i, prob, var))
```

```
x, y = list(range(len(p0.mean(axis=0)))), p0.mean(axis=0)
plt.plot(x, y);
```

```
fig, axes = plt.subplots(5, 2, figsize=(12,12))
for i, ax in enumerate(fig.get_axes()):
ax.hist(p0[:,i], bins=100, range=(0,1));
ax.set_title(f"class {i}")
ax.label_outer()
```

We see, that the model is correct but fairly uncertain. This seems to be a hard example. We would probably let a human decide what to do with this example.

# Find the most uncertain examples

Next, we find the most uncertain examples. This can be useful to understand your dataset or where the model has problems.

### 1. Selection by probability

First we select images by the predictive mean, our probability.

```
max_means = []
preds = []
for idx in range(x_test.shape[0]):
px = np.array([p[idx] for p in mc_predictions])
preds.append(px.mean(axis=0).argmax())
max_means.append(px.mean(axis=0).max())
```

```
(np.array(max_means)).argsort()[:10]
```

```
plt.imshow(x_test[247][:,:,0])
```

It seems the image from above was the one the model was most uncertain about by probability.

### 2. Selection by variance

Now we can select the images by the variance of the predictions.

```
max_vars = []
for idx in range(x_test.shape[0]):
px = np.array([p[idx] for p in mc_predictions])
max_vars.append(px.std(axis=0)[px.mean(axis=0).argmax()])
```

```
(-np.array(max_vars)).argsort()[:10]
```

```
plt.imshow(x_test[259][:,:,0])
```

# How is the uncertainty measure behaved in random regions of the feature space?

One important test is how well can the uncertainty estimate identify out-of-scope samples. Here we just create random images and see what the model predicts.

```
random_img = np.random.random(input_shape)
```

```
plt.imshow(random_img[:,:,0]);
```

```
random_predictions = []
for i in tqdm.tqdm(range(500)):
y_p = mc_model.predict(np.array([random_img]))
random_predictions.append(y_p)
```

```
p0 = np.array([p[0] for p in random_predictions])
print("posterior mean: {}".format(p0.mean(axis=0).argmax()))
print()
# probability + variance
for i, (prob, var) in enumerate(zip(p0.mean(axis=0), p0.std(axis=0))):
print("class: {}; proba: {:.1%}; var: {:.2%} ".format(i, prob, var))
```

```
x, y = list(range(len(p0.mean(axis=0)))), p0.mean(axis=0)
plt.plot(x, y);
```

Wow, this is bad! The model is pretty certain that this is an eight. But it is clearly just random noise. If you try different random images, you will find that the model always predicts them as eight. So there might be something wrong with the “understanding” of eight in our model. This is good to know and keep in mind when using the model.

```
fig, axes = plt.subplots(5, 2, figsize=(12,12))
for i, ax in enumerate(fig.get_axes()):
ax.hist(p0[:,i], bins=100, range=(0,1));
ax.set_title(f"class {i}")
ax.label_outer()
```

Looking at the probability distributions of the predictions for different classes we can probably find a way to identify out-of-scope samples by using higher moments than just mean and variance. Try it and let me know what you find.

That’s all for now. We saw a simple but effective way to derive uncertainty estimates for deep learning models, a useful tool in your machine learning toolbox. I hope you liked this tutorial and it will be helpful to you. Watch out for more to come on this topic!

### You might also be interested in:

- Image segmentation with test time augmentation with keras: In the last post, I introduced the U-Net model for segmenting salt depots in seismic images. This ti …
- Guide to sequence tagging with neural networks in python: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Re …
- U-Net for segmenting seismic images with keras: Today I’m going to write about a kaggle competition I started working on recently. I will show you …

08/30/2019 at 4:02 am

Hi!

This was super useful.

I’ve come across many papers such as “Dropout as a bayesian approximation”, “Selective Classification for DNNs”, “SelecitiveNet”, etc., which employ MC dropout on image classification tasks.

Thanks! for a comprehensive explanation.

I have 2 question:

1. How do higher moments help with knowing more about the out-of-scope samples? Till now, I just had the idea that, variance is a pretty good estimator of uncertainty (it directly gives the spread of the distribution), hence I always used this as a measure of uncertainty (or negative variance as confidence).

2. This particular line:

y_p = mc_model.predict(x_test, batch_size=1000)

You are actually running all the samples together through the network for a number of times. However, the papers state running each sample one by one, a number of times through the network. In such case, model.predict won’t work. Any reason I should not run samples one by one and rather stick with all at once like you have done?

12/23/2019 at 11:44 am

This is an interesting post, and I will buy you coffee if you can answer some of my queries:

1) Why are you trying to find the most uncertain model using prediction probabilities ? Using Prediction Variance appropriate (which I know you have mentioned)

2) About the noisy input, you haven’t tried explaining why the model thinks it is class “8”. Does this mean MCDropout is not useful for finding predictive uncertainty ? Is adversarial training the solution to this kind of problem ?

3) Have you looked at Gaussian Process Classifier (GPC) ?

Thanks!

12/24/2019 at 10:31 am

Hi Malhar, I’m happy you like it 🙂

1) I’m trying to find the most uncertain predictions using the predicted probabilities. You can also use the variance and you will notice, that they have different properties. The idea is, that a well calibrated probability estimate (first moment) should be a good indicator of how certain the model is.

2) I have no idea why it is always predicting “8” on noise. The take away for me here is, that MCDropout is not particularly well suited to work with out-of-distribution samples. At least it works well, on close-to-distribution samples, which is probably enough in some applications. I guess it is hard to fix the problem of completely out-of-distribution samples, but I’m not aware of most of the literature in this field.

3) I used GPCs before, but I found them rather hard to train on high dimensional data. On the other hand, they have nice statistical properties, but it is unclear to me, how they would preform on high-dimensional out-of-distribution samples. You could try it and report the results 🙂

01/31/2020 at 6:51 pm

Hey, I was just wondering… I think you made a little mistake in your selection (by probability and variance): your index should go through all test images you made predictions for, i.e. idx in range(x_test.shape[0]) and not idx in range(len(mc_predictions)).

02/01/2020 at 10:00 am

Hi Julia,

you are right. I’ll fix it immediately. Thanks a lot!

Tobias

02/15/2020 at 2:17 am

by variance do you mean standard deviation?

02/15/2020 at 9:43 am

Yes, while not totally statistically correct. Would you name it differently?