A quick derivation of the CE loss with a Softmax activation.
The cross-entropy loss is defined as follows:
\[ L = -\sum_j^C y_j \log p_j \\ = -\sum_j^C \ell_j. \] Here, we’ve used \(p_j\) to denote the \(j\)th entry of the Softmax output,
\[ p_j = \frac{\exp( a_j)}{\sum_k^C \exp(a_k)}. \] To compute the derivative \(\frac{\partial L}{\partial a_i}\), we first rewrite the derivative using the chain rule,
\[ \frac{\partial L}{\partial a_i} = -\sum_j^C \frac{\partial \ell_j}{\partial p_j} \frac{\partial p_j}{\partial a_i}. \] Starting with the first term in the chain rule,
\[ \frac{\partial \ell_j}{\partial p_j} = \frac{\partial}{\partial p_j} \left( - y_j \log p_j \right) \\ = -\frac{y_j}{p_j}. \] Next, we compute the second term of the chain rule,
\[ \frac{\partial p_j}{\partial a_i} = \frac{\partial }{\partial a_i} \frac{\exp( a_j)}{\sum_k^C \exp(a_k)} \] If \(i=j\),
\[ = \frac{\partial }{\partial a_j} \frac{\exp( a_j)}{\sum_k^C \exp(a_k)} \\ = \frac{\partial }{\partial a_j} \exp( a_j) \left(\sum_k^C \exp(a_k) \right)^{-1} \\ = \exp( a_j) \left(\sum_k^C \exp(a_k) \right)^{-1} - \exp( a_j) \left(\sum_k^C \exp(a_k) \right)^{-2} \exp( a_j) \\ = p_j -p_j^2 \\ = p_j(1-p_j). \] And if \(i \neq j\),
\[ = \frac{\partial }{\partial a_i} \frac{\exp( a_j)}{\sum_k^C \exp(a_k)} \\ = \exp( a_j) \frac{\partial }{\partial a_i}\left(\sum_k^C \exp(a_k) \right)^{-1} \\ = -\exp( a_j) \left(\sum_k^C \exp(a_k) \right)^{-2} \exp(a_i) \\ = -p_jp_i. \\ \]
Both of these cases can be combined, \[ \frac{\partial p_j}{\partial a_i} = p_j(\delta_{ij} -p_i). \]
So, we can substitute the expressions for each term in the chain rule to get the final derivative,
\[ \frac{\partial L}{\partial a_i} = \sum_j^C \frac{\partial \ell_j}{\partial p_j} \frac{\partial p_j}{\partial a_i} \\ = \sum_j^C -\frac{y_j}{p_j} p_j(\delta_{ij} - p_i) \\ = \sum_j^C y_j (p_i -\delta_{ij}). \] Suppose the true class label occurs at index \(j=c\). Since \(y\) is a one-hot vector, terms of the sum where \(j \neq c\) are zero since \(y_{j\neq c}=0\), and when \(j=c\) we have \(y_c = 1\):
\[ \frac{\partial L}{\partial a_i} = p_i - y_c \delta_{ic}. \] The gradient is \(p_i\) if \(i\neq c\), and \(p_c - y_c\) (where \(y_c=1\)) if \(i = c\). But \(y_i = 0\) when \(i\neq c\) and \(y_i=1\) when \(i=c\), so we can simply write,
\[ \frac{\partial L}{\partial a_i} = p_i - y_i. \]