Forecast accuracy metrics #
A compilation of various metrics used to measure forecast accuracy
Let $y(t)$ be the actual observation for time period $t$ and let $f(t)$ be the forecast for the same time period. Let $n$ the length of the training dataset (number of historical observations), and $h$ the forecasting horizon.
Forecast errors #
Based on how we measure the forecast error $e(t)$ at a time point $t$ several metrics are defined. A forecast error is the difference between an observed value and its forecast.
acronym | error type | equation | scale independent |
---|---|---|---|
*E | error | \(e(t)=y(t)-f(t)\) | no |
*PE | percentage error | \(pe(t)=\frac{y(t)-f(t)}{y(t)}\) | yes |
*RE | relative error | \(re(t)=\frac{y(t)-f(t)}{y(t)-f^*(t)}\) | yes |
*SE | scaled error | \(se(t)=\frac{y(t)-f(t)}{s}\) | yes |
We are evaluating the accuracy over $h$ forecasting time steps. The final forecast accuracy metric is then an aggregation (via mean, median etc.) over over $h$ time steps.
acronym | aggregation |
---|---|
M* | Mean |
MA* | Mean Absolute |
Md* | Median |
MdA* | Median Absolute |
GM* | Geometric Mean |
GMA* | Geometric Mean Absolute |
MS* | Mean Squared |
RMS* | Root Mean Squared |
{M|Md|GM}{-|A|S}{E|PE|RE|SE}
TLDR #
While there are several metrics we recommend the following 6 metrics to be definitely included in a forecasting toolkit.
acronym | name | comments |
---|---|---|
MAE | Mean Absolute Error | For assessing accuracy on a a single time series use MAE because it is easiest to interpret. However, it cannot be compared across series because it is scale dependent. |
RMSE | Root Mean Square Error | A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Hence RMSE is also widely used, despite being more difficult to interpret. |
MAPE | Mean Absolute Percentage Error | MAPE has the advantage of being scale independent and hence can be used to compare forecast performance between different time series. However MAPE has the disadvantage of being infinite or undefined if there are zero values in the time series, as is frequent for intermittent demand data. |
MASE | Mean Absolute Scaled Error | MASE is a scale free error metric and can be used to compare forecast methods on a single time series and also to compare forecast accuracy between series. It is also well suited for intermittent demand time series since is never gives infinite or undefined values. |
wMAPE | (volume) weighted Mean Absolute Percentage Error | The is MAPE which is weighted by the volume. |
MSPL | Mean Scaled Pinball Loss | The metric to use for quantile forecasts. |
Categorizing metrics #
category | description |
---|---|
Accuracy | How close are the forecasts to the true value? |
Bias | Is there a systematic under- or over-forecasting? |
Error Variance | The spread of the point forecast error around the mean. |
Accuracy Risk | The variability in the accuracy metric over multiple forecasts. |
Stability | The degree to twhich forecasts remain unchanged subject to minor variations in the underlying data. |
Scale dependent metrics #
Let $e(t)$ be the one-step forecast error, which is the difference between the observation $y(t)$ and the forecast made using all observations up to but not including $y(t)$.
\( e(t)=y(t)-f(t) \)By scale dependent we mean that the value of the metric depends on the scale of the data.
ME #
Mean Error
MAE #
Mean Absolute Error
MdAE #
Median Absolute Error
GMAE #
Geometric Mean Absolute Error
MSE #
Mean Square Error
RMSE #
Root Mean Square Error
Percentage error metrics #
For scale dependent metrics that the value of the metric depends on the scale of the data. Therefore, they do not facilitate comparison across different time series and for different time intervals.
First we define a relative or percentage error as
\( pe(t)=\frac{y(t)-f(t)}{y(t)} \)Percentage errors have the advantage of being scale independent and hence useful to compare performance between different time series.
When the time series values are very close to zero (especially for intermittent demand data), the relative or percentage error is meaningless and is actually not defined when the value is zero.
MPE #
Mean Percentage Error
MAPE #
Mean Absolute Percentage Error
MdAPE #
Median Absolute Percentage Error
sMAPE #
Symmetric Mean Absolute Percentage Error
sMdAPE #
Symmetric Median Absolute Percentage Error
MSPE #
Mean Square Percentage Error
RMSPE #
Root Mean Square Percentage Error
MAAPE #
Mean Absolute Arctangent Percent Error

Relative error metrics #
An alternative to percentage error for the calculation of scale-independent metrics involves dividing each error by the error obtained using a baseline forecasting algorithm (typically the naive baseline $f^*(t)=y(t-1)$).
\( re(t)=\frac{e(t)}{e^*(t)}=\frac{y(t)-f(t)}{y(t)-f^*(t)} \)MARE #
Mean Absolute Relative Error
MdARE #
Median Absolute Relative Error
GMARE #
Geometric Mean Absolute Relative Error
U-statistic #
Theil’s $U$-statistic
Scaled error metrics #
This generally involves scaling the error term by a suitable scale parameter.
\( se(t)=\frac{e(t)}{s}=\frac{y(t)-f(t)}{s} \)MAD/Mean #
MAD/Mean ratio
wMAPE #
(volume) weighted Mean Absolute Percentage Error
MASE #
Mean Absolute Scaled Error
RMSSE #
Root Mean Squared Scaled Error
MSLE #
Mean Squared Log Error
Correlation based metrics #
PCC #
Pearson correlation coefficient ($r$)
KRCC #
Kendall rank correlation coefficient ($\tau $)
A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them.
For $t_i < t_j$, two sequences $y(t)$ and $f(t)$ are said to be concordant if the ranks for both elements agree: that is, if both $f(t_i)>f(t_j)$ and $y(t_i)>y(t_j)$; or if both $f(t_i)<f(t_j)$ and $y(t_i)<y(t_j)$.
They are said to be discordant, if $fx(t_i)>f(t_j)$ and $y(t_i)<y(t_j)$ or if $f(t_i) < f(t_j)$ and $y(t_i)>y(t_j) $.
If $x(t_i)=x(t_j)$ and $y(t_i)=y(t_j) $, the pair is neither concordant nor discordant.
NGini #
Normalized Gini Coefficient
Measures how far away the true values sorted by predictions are from a random sorting, in terms of number of swaps. If $y(t)$ represents “True Values” and $f(t)$ represents the corresponding predictions. Now we let $y_f(t)$ represent “Sorted True” values, where $y(t)$ is sorted based of values of $f(t)$.
Let $\text{swaps}_{\text{sorted}}$ represent the number of swaps of adjacent digits (like in bubble sort) it would take to get from the “Sorted True” state to the “True Values” state.
And $\text{swaps}_{\text{random}}$ represents the number of swaps it would take on average to get from a random state to the “True Values” state.
Poisson metrics #
These are mainly suitable for sparse intermiited sales data or count data in general and are essentailly based on the Poisson goodness-of-fit metrics.
PNLL #
Poisson Negative Log Likelihood
P-DEV #
Poisson Deviance
CHI^2 #
Pearson’s chi-squared statistic
D^2 #
Percent of deviance explained
Metrics for Quantile Forecasts #
Point forecast $f(t)$ and the observed/real value $y(t)$ at time $t$ can be expressed as \( y(t) = f(t) + \epsilon_t \) where $\epsilon_t$ is the corresponding error. The most common extension from point to probabilistic forecasts is to construct prediction intervals. A number of methods can be used for this purpose, the most popular take into account both the point forecast and the corresponding error: the center of the prediction interval at the $(1-\alpha)$ confidence level is set equal to $f(t)$ and its bounds are defined by the $\alpha/2$ and $(1-\alpha/2)$ quantiles of the cumulative distribution function (CDF) of $\epsilon_t$ as $[f_L(t),f_U(t)]$.
Another popular alternative to this approach is to find the distribution of the point forecast $f(t)$ itself, as $F_t$, where $F_t$ is random variable with an explicit density distribution, that is
\( F_{n+h}(x) = P (f(n+h) \leq x \,\,|\,\, \{y(t)\}_{t=1, \dots, n}) \)From a practical perspective, a probabilistic forecast $F_{n+h}$ is usually represented as a histogram where each bin represents a range of future demand, and where the bin height represents the estimated probability that future demand will fall within the specific range associated to a bucket.
When evaluating a probabilistic forecast, the main challenge is that we never observe the true distribution of the underlying process. In other words, we cannot compare the predictive distribution, $F_{t}$, or the prediction interval, $[f_L(t),f_U(t)]$, with the actual distribution of the $y(t)$, but only with observed past values $y(t)$, $t< n$.
Scoring rules provide summary measures for the evaluation of probabilistic forecasts, by assigning a numerical score, $S(F_{t}, y(t))$, based on the predictive distribution, $F_{t}$ and on the actual observed value $y(t)$. A scoring rule is called proper if given the true distribution of $y(t)$ as $Y_t$, then \( S(F_{t}, Y_t) \leq S(F_{t}, y(t)) \) Now we present two proper scoring rules: Pinball loss and Continuous Ranked Probability Score
Pinball Loss #
In general $y(t)$ is a random variable with it own distribution and what is typically predicted as the point forecast, $f(t)$, is the mean $\mathbb{E}[y(t)]$ of this distribution. A quantile forecast $f_p(t)$ is an estimation of the $p^{th}$ quantile of the distribution. The $p^{th}$ quantile of $y(t)$ is defined as \( \mathbb{Q}_p[y(t)] = \left\{x : \text{Pr}[y(t) \leq x] = p \right\} \)
The pinball loss function is a metric used to assess the accuracy of a quantile forecast. \( pl(t,p) = \begin{cases} \ p\left(y(t)-f_p(t)\right)\quad\text{if}\quad f_p(t) \leq y(t)\\ (1-p)\left(f_p(t)-y(t)\right)\quad\text{if}\quad f_p(t) > y(t) \end{cases} \)

- The pinball function is always non-negative and farther the value of forecast from real value, larger the value of the loss function.
- The slope is used to reflect the desired imbalance in the quantile forecast.
- You want the truth $y(t)$ to be less than your prediction 100$p$% of the time. For example, your prediction for the 0.25 quantile should be such that the true value is less than your prediction 25% of the time. The loss function enforces this by penalizing cases when the truth is on the more unusual side of your prediction more heavily than when it is on the expected side.
- The metric further encourages you to predict as narrow a range of possibilities as possible by weighting the penalty by the absolute difference between the truth and your quantile.
- Stockout and inventory costs of the quantile.
MSPL #
Mean Scaled Pinball Loss
Metrics for Prediction Intervals #
Prediction intervals express the uncertainty in the forecasts. This is useful because it provides the user of the forecasts with worst and best case estimates and a sense of how depedenable the forecast it.
While we would like to estimate the actual distribution based on training as a surrogate we typically estimate a 95% prediction interval $[f_L(t),f_U(t)]$ such that $\text{Pr}[f_L(t) \leq y(t) \leq f_U(t)]=0.95$.
MSPL values can be aggregated over different values of $p$. For example, for a a 95% prediction interval $p$ can be set to is set to $p_1=0.025, p_2=0.5, p_3=0.975$. The the aggregate SPL can be determined as below to measure the accuracy of the prediction interval.
AgMSPL #
Aggregate Mean Scaled Pinball Loss
CRPS #
Continuous Ranked Probability Score
Metrics for multiple time series #
When forecasting multiple time series an agggregate metric can be computed via weighted combinaton of the corresponding metric ($\text{metric}_j$) for each time series $y_j(t)$. \( \text{W-metric}= \frac{\sum_{j=1}^K w_j \text{metric}_j}{\sum_{j=1}^K w_j} \) For example, \( \text{W-MASE}= \frac{\sum_{j=1}^K w_j \text{MASE}_j}{\sum_{j=1}^K w_j} \) The weights $w_j$ represent a certain relative importance of the $j^{th}$ time series $y_j(t)$.
- The simplest weights are $w_j=1/K$ which gives a uniform weight to all the time series.
- For example the M5 competition used the sum of units sold in the last 28 observations multiplied by their respective price (cumulative actual dollar sales). It is an objective proxy of value of an unit for the company in monetary terms, hence it can be used as weight.
Metrics for hierarchical forecasting #

Node level metrics The forecast accuracy metric can be computed at any node in the specified hierarchy. However there are differnt ways to generate an accuracy metric at a node.
- aggregate-metrics For any node compute the metric for all the children and then aggregate the metric via suitable weighted aggregation method as earlier.
- aggregate-forecasts Forecast for all the children at a node, aggregate the forecasts and then evaluate the metric.
- aggregate-data Aggregate the data fro all the children at a node, forecast and then evaluate the metric.
Aggregate across all nodes at a level
Aggregate across all levels
Metrics for new products #
- Cold start evaluation
- Forecast adjustments after $k$ time periods
References #
https://otexts.com/fpp2/accuracy.html
https://www.relexsolutions.com/resources/measuring-forecast-accuracy/
Rob J. Hyndman and Anne B. Koehler, Another look at measures of forecast accuracy, International Journal of Forecasting, Volume 22, Issue 4, 2006.
Rob J Hyndman, Another look at forecast-accuracy metrics for intermittent demand. Chapter 3.4, pages 204-211, in “Business Forecasting: Practical Problems and Solutions”, John Wiley & Sons, 2015.
Sungil Kim and Heeyoung Kim, A new metric of absolute percentage error for intermittent demand forecasts, International Journal of Forecasting, Volume 32, Issue 3, 2016.
Makridakis, S., Spiliotis, E. and Assimakopoulos, V, The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36, 54–74, 2000.
Spyros Makridakis and Michèle Hibon, The M3-Competition: results, conclusions and implications, International Journal of Forecasting, Volume 16, Issue 4, 2000.
Fotios Petropoulos, Xun Wang and Stephen M. Disney. The inventory performance of forecasting methods: Evidence from the M3 competition data,International Journal of Forecasting, Volume 35, Issue 1, 2019.
Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69.2 (2007): 243-268.