Explain cross entropy loss.
Suppose I build a neural network for classification. The last layer is a dense layer with Softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions are [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?
The cross entropy loss formula takes in two distributions, p(x), the true distribution, and q(x), the estimated distribution, defined over the discrete variable x and is given by
H(p,q)=−∑∀xp(x)log(q(x))
For a neural network, the calculation is independent of the following:
What kind of layer was used?
What kind of activation was used - although many activations will not be compatible with the calculation because their outputs are not interpretable as probabilities (i.e., their outputs are negative, greater than 1, or do not sum to 1). Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.
For a neural network, you will usually see the equation written in a form where y is the ground truth vector and y^ (or some other value taken directly from the last layer output) is the estimate. For a single example, it would look like this:
L=−y⋅log(y^)
where ⋅ is the inner product.
Your example ground truth y gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates y^
L=−(1×log(0.1)+0×log(0.5)+...)
L=−log(0.1)≈2.303
An important point from comments
That means, the loss would be the same no matter if the predictions are [0.1,0.5,0.1,0.1,0.2] or [0.1,0.6,0.1,0.1,0.1]?
Yes, this is a key feature of multiclass log loss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.
You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size N might look like this:
J=−1N(∑i=1Nyi⋅log(y^i))
Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.