Loss functions in machine learning
Loss functions are the heartbeat of machine learning models. They quantify the discrepancy between a model’s predictions and the actual ground truth, guiding the optimization process. Choosing the right loss function is often as critical as the choice of model architecture.
In this guide, we’ll explore the essential loss functions used in modern ML, categorized by their primary use cases.
1. Regression Losses
Used when the target variable is continuous.
Mean Squared Error (MSE) / L2 Loss
The most common loss for regression. It penalizes larger errors more heavily by squaring them.
- Formula: L = (1/n) * sum((y_i - y_hat_i)^2)
- Typical Activation: Linear
- Pros: Smooth and differentiable; easy to optimize with Gradient Descent.
- Cons: Highly sensitive to outliers due to the squaring effect.
Mean Absolute Error (MAE) / L1 Loss
Measures the average magnitude of errors without considering their direction.
Formula: L = (1/n) * sum( y_i - y_hat_i ) - Typical Activation: Linear
- Pros: More robust to outliers than MSE.
- Cons: Not differentiable at zero; gradients are constant regardless of error size.
Huber Loss
A hybrid approach that’s quadratic for small errors and linear for large ones.
- Pros: Combines the stability of MSE for small errors and the robustness of MAE for outliers.
- Typical Use: Robust regression tasks where outliers are expected.
Log-Cosh Loss
Logarithm of the hyperbolic cosine of the prediction error.
- Pros: Smooth like MSE but less sensitive to outliers, similar to Huber loss.
2. Classification Losses
Used for predicting categorical outcomes.
Binary Cross-Entropy (BCE) / Log Loss
The standard loss for binary classification. It measures the performance of a model whose output is a probability between 0 and 1.
- Typical Activation: Sigmoid
- Use Case: Predicting one of two possible outcomes (e.g., Spam vs. Not Spam).
Categorical Cross-Entropy Loss
Used for multi-class classification where each example belongs to exactly one class.
- Typical Activation: Softmax
- Use Case: Predicting one of many possible labels (e.g., Digit recognition).
Hinge Loss (SVM Loss)
Used for training maximum-margin classifiers, most notably Support Vector Machines (SVMs).
- Goal: To ensure that the correct class score is greater than all other scores by a safety margin.
- Typical Activation: Linear (before sign function).
Focal Loss
An evolution of Cross-Entropy designed to address extreme class imbalance (e.g., in object detection where background examples far outnumber objects).
- Core Idea: Adds a modulating factor to down-weight the loss for “easy” examples, focusing training on hard, misclassified ones.
3. Probability and Similarity Losses
KL Divergence (Kullback-Leibler)
Measures how one probability distribution differs from a second, reference distribution.
- Common Use: Variational Autoencoders (VAEs) and Generative models where we want to match a predicted distribution to a known prior.
Dice Loss
Measures the overlap between two samples. Frequently used in medical image segmentation to handle class imbalance between foreground and background.
4. Ranking and Retrieval Losses
Crucial for recommendation systems and search where relative order matters more than absolute scores.
Triplet Loss
Commonly used in face recognition and metric learning.
- Mechanism: Minimizes the distance between an “anchor” and a “positive” sample while maximizing the distance between the anchor and a “negative” sample.
Quantile Loss
Used when we are interested in predicting a specific quantile (e.g., the 90th percentile) rather than the mean. Useful for estimating uncertainty and prediction intervals.
Choosing the right loss function is an iterative process. For large-scale systems, we often see hybrid losses that combine multiple objectives to balance precision, recall, and specific business constraints.
