NOTEApril 2020

Loss functions in machine learning

Published · 07 Apr 2020 Topics · loss-functions, optimization

Loss functions are the heartbeat of machine learning models. They quantify the discrepancy between a model’s predictions and the actual ground truth, guiding the optimization process. Choosing the right loss function is often as critical as the choice of model architecture.

In this guide, we’ll explore the essential loss functions used in modern ML, categorized by their primary use cases.

1. Regression Losses

Used when the target variable is continuous.

Mean Squared Error (MSE) / L2 Loss

The most common loss for regression. It penalizes larger errors more heavily by squaring them.

Formula: $L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$
Typical Activation: Linear
Pros: Smooth and differentiable; easy to optimize with Gradient Descent.
Cons: Highly sensitive to outliers due to the squaring effect.

Mean Absolute Error (MAE) / L1 Loss

Measures the average magnitude of errors without considering their direction.

Formula: $$L = \frac{1}{n} \sum_{i=1}^n y_i - \hat{y}_i $$
Typical Activation: Linear
Pros: More robust to outliers than MSE.
Cons: Not differentiable at zero; gradients are constant regardless of error size.

Huber Loss

A hybrid approach that’s quadratic for small errors and linear for large ones.

Formula: $L_\delta(a) = \begin{cases} \frac{1}{2}a^2 & \text{for } |a| \le \delta \\ \delta(|a| - \frac{1}{2}\delta) & \text{otherwise} \end{cases}$ where $a = y - \hat{y}$.
Pros: Combines the stability of MSE for small errors and the robustness of MAE for outliers.
Typical Use: Robust regression tasks where outliers are expected.

Log-Cosh Loss

Logarithm of the hyperbolic cosine of the prediction error.

Formula: $L(y, \hat{y}) = \sum_{i=1}^n \log(\cosh(\hat{y}_i - y_i))$
Pros: Smooth like MSE but less sensitive to outliers, similar to Huber loss.

2. Classification Losses

Used for predicting categorical outcomes.

Binary Cross-Entropy (BCE) / Log Loss

The standard loss for binary classification. It measures the performance of a model whose output is a probability between 0 and 1.

Formula: $L = -\frac{1}{N} \sum_{i=1}^N [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$
Typical Activation: Sigmoid
Use Case: Predicting one of two possible outcomes (e.g., Spam vs. Not Spam).

Categorical Cross-Entropy Loss

Used for multi-class classification where each example belongs to exactly one class.

Formula: $L = -\sum_{i=1}^M y_i \log(\hat{y}_i)$
Typical Activation: Softmax
Use Case: Predicting one of many possible labels (e.g., Digit recognition).

Hinge Loss (SVM Loss)

Used for training maximum-margin classifiers, most notably Support Vector Machines (SVMs).

Formula: $L = \sum_{j \ne y_i} \max(0, s_j - s_{y_i} + \Delta)$
Goal: To ensure that the correct class score is greater than all other scores by a safety margin.
Typical Activation: Linear (before sign function).

Focal Loss

An evolution of Cross-Entropy designed to address extreme class imbalance (e.g., in object detection where background examples far outnumber objects).

Formula: $FL(p_t) = -(1 - p_t)^\gamma \log(p_t)$
Core Idea: Adds a modulating factor to down-weight the loss for “easy” examples, focusing training on hard, misclassified ones.

3. Probability and Similarity Losses

KL Divergence (Kullback-Leibler)

Measures how one probability distribution differs from a second, reference distribution.

Formula: $D_{KL}(P \| Q) = \sum_{x \in X} P(x) \log\left(\frac{P(x)}{Q(x)}\right)$
Common Use: Variational Autoencoders (VAEs) and Generative models where we want to match a predicted distribution to a known prior.

Dice Loss

Measures the overlap between two samples. Frequently used in medical image segmentation to handle class imbalance between foreground and background.

Formula: $$DL(A, B) = 1 - \frac{2 A \cap B }{ A + B }$$

4. Ranking and Retrieval Losses

Crucial for recommendation systems and search where relative order matters more than absolute scores.

Triplet Loss

Commonly used in face recognition and metric learning.

Mechanism: Minimizes the distance between an “anchor” ($a$) and a “positive” ($p$) sample while maximizing the distance between the anchor and a “negative” ($n$) sample.
Formula: $L = \max(d(a, p) - d(a, n) + \text{margin}, 0)$

Quantile Loss

Used when we are interested in predicting a specific quantile (e.g., the 90th percentile) rather than the mean. Useful for estimating uncertainty and prediction intervals.

Formula: $$L_\tau = \sum_{i: y_i \ge \hat{y}_i} \tau

y_i - \hat{y}_i

+ \sum_{i: y_i < \hat{y}_i} (1 - \tau)

y_i - \hat{y}_i

Choosing the right loss function is an iterative process. For large-scale systems, we often see hybrid losses that combine multiple objectives to balance precision, recall, and specific business constraints.