Often in machine learning tasks, you have multiple possible labels for one sample that are not mutually exclusive. This is called a multi-class, multi-label classification problem. Obvious suspects are image classification and text classification, where a document can have multiple topics. Both of these tasks are well tackled by neural nets. A famous python framework for this tasks is keras. We will discuss how to use keras to solve this problem. If you are not familiar with keras, check out the excellent documentation.

```
from keras.models import Sequential
from keras.layers import Dense
```

To begin with, we discuss the general problem and in the next post, I show you an example, where we assume a classification problem with 5 different labels. This means we are given samples

and labels

with . We use a simple neural network as an example to model the probability of a class given sample . We then estimate out prediction as

Now we set up a simple neural net with 5 output nodes, one output node for each possible class.

```
nn = Sequential()
nn.add(Dense(10, activation="relu", input_shape=(10,)))
nn.add(Dense(5))
```

## Multi-class classification

Now the important part is the choice of the output layer. The usual choice for multi-class classification is the softmax layer. The softmax function is a generalization of the logistic function that “squashes” a -dimensional vector of arbitrary real values to a -dimensional vector of real values in the range that add up to .

```
import math
def softmax(z):
z_exp = [math.exp(i) for i in z]
sum_z_exp = sum(z_exp)
return [i / sum_z_exp for i in z_exp]
```

Assume our last layer (before the activation) returns the numbers . Every number is the value for a class. Lets see what happens if we apply the softmax activation.

```
z = [1.0, 2.0, 3.0, 4.0, 1.0]
softmax(z)
```

So we would predict class 4. But let’s understand what we model here. Using the softmax activation function at the output layer results in a neural network that models the probability of a class as multinominal distribution.

A consequence of using the softmax function is that the probability for a class is not independent from the other class probabilies. This is nice as long as we only want to predict a single label per sample.

## Multi-class mulit-label classification

But now assume we want to predict multiple labels. For example what object an image contains.

Say, our network returns

for a sample (e.g. an image).

```
z = [-1.0, 5.0, -0.5, 4.7, -0.5]
softmax(z)
```

By using softmax, we would clearly pick class 2 and 4. But we have to know how many labels we want for a sample or have to pick a threshold. This is clearly not what we want. If we stick to our image example, the probability that there is a cat in the image should be independent of the probability that there is a dog. Both should be equaly likely.

A common activation function for binary classification is the sigmoid function

for .

```
def sigmoid(z):
return [1 / (1 + math.exp(-n)) for n in z]
```

```
z = [-1.0, 5.0, -0.5, 5.0, -0.5]
sigmoid(z)
```

With the sigmoid activation function at the output layer the neural network models the probability of a class as bernoulli distribution.

Now the probabilites of each class is independant from the other class probabilies. So we can use the threshold as usual. This is excactly what we want. So we set the output activation.

```
nn = Sequential()
nn.add(Dense(10, activation="relu", input_shape=(10,)))
nn.add(Dense(5, activation="sigmoid"))
```

To make this work in keras we need to compile the model. An important choice to make is the loss function. We use the binary_crossentropy loss and not the usual in multi-class classification used categorical_crossentropy loss. This might seem unreasonable, but we want to penalize each output node independantly. So we pick a binary loss and model the output of the network as a independent bernoulli distributions per label.

```
nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```

To get everything running, you now need to get the labels in a “multi-hot-encoding”. A label vector should look like

if class and class are present for the label. We will see how to do this in the next post, where we will try to classify movie genres by movie posters or this post about a kaggle challenge applying this. Note that you can view image segmentation, like in this post, as a extreme case of multi-label classification.

### You might also be interested in:

- Classifying genres of movies by looking at the poster – A neural approach: Today we will apply the concept of multi-label multi-class classification with neural networks from …
- A strong baseline to classify toxic comments on Wikipedia with fasttext in keras: This time we’re going to discuss a current machine learning competion on kaggle. In this competitio …
- Guide to sequence tagging with neural networks in python: Named entity recognition series: Introduction To Named Entity Recognition In Python Named Entity Rec …

12/17/2017 at 4:15 pm

In the current case, you are trying to pick class 2 AND 4. How would you proceed if you want the minimize the error for the prediction “2 OR 4”. By this I mean, you would like the error for the given sample to be nil if the prediction is “CAT” or “DOG”. For example maybe you consider that cat and dog are 2 animals and you want to classify animals vs other categories.

One would use the usual softmax cross entropy to get the prediction for the class, but then you cant just use softmax cross entropy to calculate the error because if you predict “CAT”, the cat category will produce a low error, but the dog category will produce a high error, whereas you want the dog category to also predict a low error. (because DOG or CAT is the same for you, its what you are looking for as an anwer)

Any ides? thanks

12/17/2017 at 6:14 pm

Thanks for your question. That really depends on your use-case. If you want to predict animal versus other categories you would simply join the classes for cat and dog (and probably other animals in your data) to the label “animal”. And then use the cross entropy loss with softmax as usual.

Does this solve your problem?

01/03/2018 at 5:14 am

This is mind-opening ! I have been searching for multi-class + multi-labels for ages ! Lots of confusion, whereby we should use softmax vs sigmoid AND especially binary_entropu OR categorical_entropy.

Your post is really life-saver 🙂 I will try it out and let you know.

One question please: if I would like to do “repeat targets” where I will normalize losses at EACH STEPS of a sample sequence in combination with the final output of a sample sequence, how to I manage the loss function differently ?

Thanks again.

Steve (thusinh1969@yahoo.com)

01/04/2018 at 7:17 pm

Thanks a lot for your support! I really appreciate it. Let’s me know how it works 🙂

I don’t really get your question. Can you tell me about a use-case and what exactly you are trying to achieve. You can also contact me via email if you prefer, I’m happy to help. 🙂

01/03/2018 at 9:51 pm

Nice Article man I had a question what is the best performance metric for this case ?

01/04/2018 at 7:13 pm

Thanks, happy that you like it!

I would say the choice of the performance measure always depends on the task you are trying to solve. A good first approach is to measure the quality of each label independently. Then you can average the result. For example you could pick accuracy or logarithmic loss. Measure the metric for the first class (which is then a binary task) and so on. If you then need a number you can just average the results. Keras does this automatically if you use accuracy or log_loss as a metric. You can see it here for example.

02/12/2018 at 9:34 pm

Hi,

Thank you so much for the great post.

I am trying to do a multi-class multilabel classification but I need to do the weighted version of labeling, instead of on-hot- encoding I do have a weighted estimation about the possible labels, like [0.5 0.8 0.7 0].

Do you have any experience with similar setting or how we can integrate softmax with temperature in Keras for such a problem?

Thanks,-Shek

02/13/2018 at 7:53 pm

Happy you like it 🙂

I have no experience with this so far. But here is what I would try, without knowing much about your task.

I would try

a) to fit the data with a linear (or relu) output layer and binary crossentropy.

b) to fit it with a softmax output layer as usual.

c) to model the problem differently

Hope that helps you. What are you trying to model?

05/05/2018 at 11:19 am

Hi ! Lovely Post. Am pretty much new to ML and getting a feel of things. However I came here to find out how to create a multi label input for a sample.

say the picture has both a dog and a cat, how do I create a one hot vector like {1,1}

05/05/2018 at 3:07 pm

Hi Praveen, I’m really happy you enjoyed the post. I think the most sane way (assuming you have a larger dataset) would be, to take the binary vectors for each class and concatenate it with numpy.vstack to get the desired labels. Does this help you? Feel free to ask again 🙂

05/08/2018 at 6:41 am

Hi

Very informative post Thank You

but I have a problem to identify if my project is multi labels or multi classes problem. if you could please guide me I will be extremely grateful.

I have these labels: ocean, grass, fire,snow, mildew and dirt I want to do image recognition… Can you help me please to identify the type of the problem I’m newbie in this and it’s my first project to work with deep learning.

05/08/2018 at 6:48 am

Depends on what kind of question you want to answer. Are the labels mutually exclusive? If yes, then it’s a multiclass problem, otherwise it’s a multilabel situation.

06/19/2018 at 5:41 am

Hi Tobias!

First of all, thank you so much for your amazing tutorials!

I need to classify attributes in a face like colour of eye, hair, skin; facial hair, lighting and so on. Each has few sub-categories in it. So should I directly apply sigmoid on all the labels or separately apply softmax on each subcategory like hair/eye colour etc?

Which one will be better in this case?

Or should I combine both as some subclasses are binary?

So I should choose binary cross entropy for binary-class classification and categorical-cross entropy for multi-class classification? And combine them together afterwards in the same model?

Moreover, should I approach this a multi-output problem or a multi-label classification problem?

Thanks for your help!

Sarthak

06/19/2018 at 1:51 pm

Hi Sarthak,

I’m happy that you liked it 🙂

It’s hard to tell what will work best. The first thing I would try is to have a full multi-label model. That means you have a binary decision for every possible sub-category. Other than that you can try a multi-headed model, that has different classification layers and shares the feature extractors. Or you build a different multi-class model for every category. But in the end, it’s a empirical question.

Let me know what works for you.

06/29/2018 at 8:17 pm

Great post, I have done the same setting for my text classification problem which is multi-class, multi-label.

when I run the code in some cases the probabilities are small number (all less than 0.2). for example: [0.09, 0.00, 0.00, 0.00, 0.14]. So should just rank the probabilities and select top n? in the given example, the predicted class would be 5, and 1.

then I am wondering how the metrics such as accuracy, precision and recall will be calculated?

Thanks

06/30/2018 at 11:19 am

Hi Alex, I’m happy that you like it 🙂

That’s of course the hard problem in this case. You can do several things depending on your use-case. If you know how many labels a sample can have you just pick that number of top ranked labels. Another option is, to take a holdout set and optimize a class-wise threshold to accept a label. For this you can use metrics like MAP (mean average precision) or class-wise AUC score.

or you checkout this metrics: http://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics

Does this help you?

07/23/2018 at 12:58 pm

Nice article!

Can you give me the scientific paper you used in order to explain the multi-label classification. I am looking for references to reinforce every thing you mentioned concerning that subject (the choice of sigmoid, the cross entropy, etc..)

Thanks.

07/23/2018 at 2:44 pm

You can find the described setup here for example.

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1898-z

10/24/2018 at 7:51 pm

Nice and clear!

Another q: is there a way to do this multi-dimensional?

By that I mean that I have a vector of variables, for each I want a probability per label. So the output of the network is the tensor: sample X variables X class_prob

Is that at all possible?

10/24/2018 at 8:10 pm

Thanks Yoav!

Of course you can do that (and that’s what you do in general). I just reduced it to the single output case to explain it more clearly.

10/25/2018 at 6:26 am

Than how would I do that?

Need to define somehow that the output is a 3D matrix (samplex by variable by classes) But I do not see how.

Also need somehow to say that dimension 1 is independent and dimention 2 is dependent,

So that y.sum(axis=2) =1 as in class probability.

I realy do not see an example to do that, can you help? Do I need to define output_shape , output_dim in the last Dense?

Thanks a lot, Yoav

10/25/2018 at 6:42 am

I will try to be more precise.

This is my (simplified) multi-variable binary classification:

X = Input(shape=(500,))

X_sm = Dense(1000, activation=’relu’)(X)

X_sm = Dense(1000, activation=’relu’)(X_sm)

out = Dense(250, activation=’sigmoid’)(X_sm)

model = Model(inputs=X, outputs=out)

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

For which I get vector of 250 independent binary probabilities.

However my problem has 5 classes foe each element, not 2.

So what I need is ‘softmax’, not ‘sigmoid’.

But when using ‘softmax’ I couldn’t find a way to describe the dimension of the classifier is different than the dimension of the input.

If you can help, it would be amazing.

11/09/2018 at 6:52 am

Hey man You are really great. Such an explanation is all what I need. The fonts are more soothing for my eyes to read <3

11/13/2018 at 3:04 pm

Hi, thanks for the great article!

I would like to extend the problem: my input for the NN is not only multi-label, but every label has a different weight. That is, assume that we have 3 possible labels: dog, cat, car. Every sample in my data has a probability being each one of them, i.e., a sample could be 50% dog, 49% cat, and 1% car.

Eventually, my goal is to determine a single label, but I understand how to do that.

Does anyone know how to account for multi-class weighted labels in the training set?

Thanks!

11/13/2018 at 5:34 pm

Thanks Shiran,

If I understand correctly, every sample has a percentage label that you try to predict. Or your training samples have only one label?

In the first case you could do regression and limit the range to [0,1] and in the second case you are doing regular multi-class classification. You can get the highest label by numpy.argmax.

Does this help you?

11/29/2018 at 3:23 pm

Dear Tobias thank you very much , for sharing such a nice post. I tried and it works fine.

But to fully implement this method i got a serious problem with the hot encoding process. Let me explain two cases:

case 1: I have {+1, -1} labels and i simply changed this to (1,0) sequences. for example 4 classes with true labels = (1,1,-1,1).

i coded as labels = (1,1,0,1) and used sigmoid just like the your example and it works.

case 2: labels = (-1,3,1,-3). Here is the problem. how can I map to hot-encoded labels to fit into sigmoid? I tried label encoding, but it only covert this to postive number of classes like this : (1,3,2,0). one-hot or multi-hot encoding convert this to 4×4 binary matrix like,

[[1 ,0, 0, 0][0,1,0,0],[0,0,1,0],[0,0,0,1]]

But our output layer is expecting (4,1) labels right ?

I couldn’t figure out where i missed the point. please any help on this will be very appreciated !

11/30/2018 at 6:45 am

Hi, thanks a lot!

In case 2, can you make an example of what you want to classify? I don’t really get what you want to achieve. To work with the shown approach, you have to flatten your labels. You have a label vector of dimension of possible labels. If a label is present for a sample, then there’s a one in the vector, otherwise a zero.

Does this help you?

11/30/2018 at 3:00 am

Hi Tobias !

Thank you very much for such a nice post. It is very helpful.

But I stack with the multi-hot encoding. In my use-case I have a desired outputs like {-1,+1},{-1,1, -3,3} etc. It comes from digital communication modulations.

In the case of {-1,1}, I used only label encoder to get l = [1,0]. for example if i have an output s= [1,-1,-1,1] my label becomes l= [1,0,0,1]. Then following your steps it works fine.

But when I want to train with outputs {-1,1,-3,3} how to get the binary representation of it ? let for example s=[3,-1,1,3,-3].

It tried options like one-hot encoding, MultiLabelBinarizer(). They give 5×5 binary matrix in this case, while i want equivalent 5×1 labels.

What did I miss here please, and how to work with this kind of problems? I really appreciate any help on this.

12/03/2018 at 6:57 pm

Hi Tobias again. Thank you very much for your response, and sorry for the second post. Please ignore it.

Let me explain a little on case 2.

my use case is signal detection in mimo wireless communication. for example four independent users are tranmitting data (like making a phone call,..) and multiple base-station antenna is receiving .

Let base-station antennas are 10, and the four users are transmitting symbols ,s = (-1,3,1,-3). Now at base-station we have an erroneously recieved signal y of shape (10,1). Our goal is to detect the four transmitted symbols given the observation signal y.

I modeled the problem using DNN as signal y is input feature, and tranmitted symbols are the true labels. So i have four labels = [-1,3,1,-3], and then encoded as mentioned before.

If I flatten the one-hot label it will become (16,1). But my label is (4,1). How can I handle this in keras please ?

12/04/2018 at 11:17 am

Hey, that sounds like an interesting problem!

How is the dependency structure between the symbols s? Could there be symbols s_1 = (1,3,1,3) and s_2 = (-1,-1,-1,-1) and does the ordering matter? If the four possible labels are independent, you can just map your label to a multiple-hot-vector of dimension 16. Then you can use 16 sigmoid nodes to model the symbols independently and then pick the 4 symbols with the highest probability.

Does this make sense?

12/04/2018 at 12:18 pm

Oh yes, I think it make sense. The symbols are independently chosen from reference array = {-1,1,3,-3} randomly. So, any combination is possible and ordering doesn’t matter. But length of symbols can be any number. It depends on the number of users a base-station can serve.

I will try the way you suggest, it may work. Thank you for that,sir !

In case there is another way you would suggest , I will be happy to know.

12/04/2018 at 10:15 pm

Hi Tobias!

I hate to be a “me too”, but I, too, really enjoyed your post! It was exactly what I was looking for. I’m planning to use this method immediately.

Would you do anything differently if you were trying to predict membership in, say, 30 classes?

Cheers,

-Maashu

12/05/2018 at 8:33 am

Hi Maashu,

I’m happy you like it! 🙂 Let me know how it works for you.

30 classes seem to be no problem for this approach. Maybe you want to use class-dependent thresholds tuned on a hold-out-set. Especially, if your classes are imbalanced. Otherwise, if you have some hierarchy or grouping in your classes, you can try to make use of this information.

Best,

Tobias

12/12/2018 at 12:53 am

Hi Tobias, this tutorial has been very helpful! Any tips or ideas on how to normalize the loss function for severely imbalanced training inputs for my 8 output classes? I have training labels for the 8 classes similar to this (34470467, 1004, 18, 733, 561, 3522, 68, 175, 235) — with the largest group being the “None” class. I looked into class_weights in Keras but inputs are 3D arrays, so I’m unable to use them.

12/12/2018 at 7:34 am

Hey, I’m happy it helps you. You can set the class weight like this:

1. Define a dictionary with your labels and their associated weights

1: 50.,

2: 2.}

2. Feed the dictionary as a parameter:

12/12/2018 at 2:19 pm

I got the following since my inputs are more than 2 dimensions:

Exception: class_weight not supported for 3+ dimensional targets.

My inputs are:

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_72 (InputLayer) (None, 20, 20) 0

__________________________________________________________________________________________________

input_71 (InputLayer) (None, 20) 0

__________________________________________________________________________________________________

Basically what I’ve been trying to accomplish is converting the word and character-level NER NN into a multilabel classifier. My task involves labeling 100+ attributes in a document, with some words in the document relating to one or more of my custom output labels, so softmax on the output isn’t really a good choice for this. I also wanted to avoid training a large number of models (perhaps one model per attribute) for scalability and efficiency reasons. Thoughts?

12/12/2018 at 4:32 pm

Ah yes, it’s about the labels. Weighting is not supported for sequences with this API. I don’t know a way in keras to do the desired weighting. If you use sigmoid activations at the output layer, you can just tune the thresholds of the classes to account for the imbalance. Or you try to use the sample_weight API of keras.

12/12/2018 at 4:49 pm

I’m going to give redistributing the sigmoid output classes a shot first, dropping the “padded” values (empty input words/tokens) that for certain lines return false-positive high values. Perhaps I’ll start with something like a bell-curve for the redistribution and keep the top 50% and hand-tune from there. If I’m able to achieve some good results, I’d be happy to share what worked.

01/07/2019 at 4:51 pm

This is just multi-label, not multi-class multi-label, since all the labels are binary.