A quick derivation of the CE loss with a Softmax activation.
Compressing neural network weights for efficient inference.
A look into the architecture that powers most LLMs.