Forecast accuracy metrics #

A compilation of various metrics used to measure forecast accuracy

Let $y(t)$ be the actual observation for time period $t$ and let $f(t)$ be the forecast for the same time period. Let $n$ the length of the training dataset (number of historical observations), and $h$ the forecasting horizon.

Forecast errors #

Based on how we measure the forecast error $e(t)$ at a time point $t$ several metrics are defined. A forecast error is the difference between an observed value and its forecast.

acronym	error type	equation	scale independent
`*E`	error	$e(t)=y(t)-f(t)$	`no`
`*PE`	percentage error	$pe(t)=\frac{y(t)-f(t)}{y(t)}$	`yes`
`*RE`	relative error	$re(t)=\frac{y(t)-f(t)}{y(t)-f^*(t)}$	`yes`
`*SE`	scaled error	$se(t)=\frac{y(t)-f(t)}{s}$	`yes`

We are evaluating the accuracy over $h$ forecasting time steps. The final forecast accuracy metric is then an aggregation (via mean, median etc.) over over $h$ time steps.

acronym	aggregation
`M*`	Mean
`MA*`	Mean Absolute
`Md*`	Median
`MdA*`	Median Absolute
`GM*`	Geometric Mean
`GMA*`	Geometric Mean Absolute
`MS*`	Mean Squared
`RMS*`	Root Mean Squared

{M|Md|GM}{-|A|S}{E|PE|RE|SE}

TLDR #

While there are several metrics we recommend the following 6 metrics to be definitely included in a forecasting toolkit.

acronym	name	comments
`MAE`	Mean Absolute Error	For assessing accuracy on a a single time series use MAE because it is easiest to interpret. However, it cannot be compared across series because it is scale dependent.
`RMSE`	Root Mean Square Error	A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Hence RMSE is also widely used, despite being more difficult to interpret.
`MAPE`	Mean Absolute Percentage Error	MAPE has the advantage of being scale independent and hence can be used to compare forecast performance between different time series. However MAPE has the disadvantage of being infinite or undefined if there are zero values in the time series, as is frequent for intermittent demand data.
`MASE`	Mean Absolute Scaled Error	MASE is a scale free error metric and can be used to compare forecast methods on a single time series and also to compare forecast accuracy between series. It is also well suited for intermittent demand time series since is never gives infinite or undefined values.
`wMAPE`	(volume) weighted Mean Absolute Percentage Error	The is MAPE which is weighted by the volume.
`MSPL`	Mean Scaled Pinball Loss	The metric to use for quantile forecasts.

Categorizing metrics #

category	description
`Accuracy`	How close are the forecasts to the true value?
`Bias`	Is there a systematic under- or over-forecasting?
`Error Variance`	The spread of the point forecast error around the mean.
`Accuracy Risk`	The variability in the accuracy metric over multiple forecasts.
`Stability`	The degree to twhich forecasts remain unchanged subject to minor variations in the underlying data.

Scale dependent metrics #

Let $e(t)$ be the one-step forecast error, which is the difference between the observation $y(t)$ and the forecast made using all observations up to but not including $y(t)$.

$ e(t)=y(t)-f(t) $

By scale dependent we mean that the value of the metric depends on the scale of the data.

ME #

Mean Error

$ \text{ME}=\text{mean}\left(e(t)\right) $

notes

- ME is likely to be small since positive and negative errors tend to offset one another. - ME will tell you if there is a systematic under- or over-forecasting, called the **forecast bias**. - ME does not give much indication as the the size of the typical errors.

MAE #

Mean Absolute Error

MAE

$ \text{MAE}=\text{mean}\left(|e(t)|\right) $

notes

- MAE has the advantage of being more interpretable and is easier to explain to users. - MAE is also sometimes referred to as MAD (Mean Absolute Deviation). - A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean.

MdAE #

Median Absolute Error

MdAE

$ \text{MdAE}=\text{median}\left(|e(t)|\right) $

notes

- MdAE has the advantage of being more interpretable and is easier to explain to users.

GMAE #

Geometric Mean Absolute Error

GMAE

$ \text{GMAE}=\text{gmean}\left(|e(t)|\right) $

notes

- GMAE has the flaw of being equal to zero when any of the error terms are zero. - GMAE is same as GRMSE because the square root and the square cancel each other in a geometric mean.

MSE #

Mean Square Error

MSE

$ \text{MSE}=\text{mean}\left(e(t)^2\right) $

notes

- MSE has the advantage of being easier to handle mathematically and is commonly used in loss functions for optimization. - MSE tends to penalize large errors over smaller ones.

RMSE #

Root Mean Square Error

RMSE

$ \text{RMSE}=\sqrt{\text{mean}\left(e(t)^2\right)} $

notes

- RMSE has the advantage of being easier to handle mathematically and is commonly used in loss functions for optimization. - RMSE tends to penalize large errors over smaller ones. - A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean.

Percentage error metrics #

For scale dependent metrics that the value of the metric depends on the scale of the data. Therefore, they do not facilitate comparison across different time series and for different time intervals.

First we define a relative or percentage error as

$ pe(t)=\frac{y(t)-f(t)}{y(t)} $

Percentage errors have the advantage of being scale independent and hence useful to compare performance between different time series.

When the time series values are very close to zero (especially for intermittent demand data), the relative or percentage error is meaningless and is actually not defined when the value is zero.

MPE #

Mean Percentage Error

MPE

$ \text{MPE}=\text{mean}\left(pe(t)\right) $

notes

- MPE is likely to be small since positive and negative errors tend to offset one another. - MPE will tell you if there is a systematic under- or over-forecasting, called the **forecast bias**. - MPE does not give much indication as the the size of the typical errors.

MAPE #

Mean Absolute Percentage Error

MAPE

$ \text{MAPE}=\text{mean}\left(|pe(t)|\right) $

notes

- MAPE puts heavier penalty on positive errors than on negative errors. - This observation has led to the use of symmetric MAPE or [sMAPE](#smape).

MdAPE #

Median Absolute Percentage Error

MdAPE

$ \text{MdAPE}=\text{median}\left(|pe(t)|\right) $

notes

sMAPE #

Symmetric Mean Absolute Percentage Error

sMAPE

$ \text{sMAPE}=\text{mean}\left(2\cdot\left|\frac{y(t)-f(t)}{y(t)+f(t)}\right|\right) $

notes

- MAPE puts heavier penalty on positive errors than on negative errors. - This observation has led to the use of symmetric MAPE or sMAPE. - sMAPE which was used in the M3 forecasting competition.

sMdAPE #

Symmetric Median Absolute Percentage Error

sMdAPE

$ \text{sMdAPE}=\text{median}\left(2\cdot\left|\frac{y(t)-f(t)}{y(t)+f(t)}\right|\right) $

notes

MSPE #

Mean Square Percentage Error

MSPE

$ \text{MSE}=\text{mean}\left(pe(t)^2\right) $

notes

RMSPE #

Root Mean Square Percentage Error

RMSPE

$ \text{MSE}=\sqrt{\text{mean}\left(pe(t)^2\right)} $

notes

MAAPE #

Mean Absolute Arctangent Percent Error

MAAPE

$ \text{MAAPE}= \text{mean} \left(\text{arctan}\left(|pe(t)|\right)\right) = \text{mean} \left(\text{tan}^{-1}\left(|pe(t)|\right)\right) $

notes

- MAPE produces infinite or undefined values when the actual values are zero or close to zero, which is a common with intermittent data in retail. While other alternate measures have been proposed to deal with this disadvantage of MAPE, it still remains the preferred method of business forecasters following its intuitive interpretation as absolute percentage error. Mean Absolute Arctangent Percent Error is an alternative measure that has the same interpretation as an absolute percentage error (APE) but can overcome MAPE’s disadvantage of generating infinite values. - **Boundedness** MAAPE is bounded and it varies from 0 to $\pi/2$. MAAPE does not go to infinity even with close-to-zero actual values, which is a significant advantage of MAAPE over MAPE. -

- **Robustness** Absolute Arctangent Percent Error given by $AAPE = \text{arctan}\left(|pe(t)|\right)$ converges to $\pi/2$ for large forecast errors, thus limits the influence of outliers, which often distort the calculation of the overall forecast accuracy. Therefore, MAAPE can be particularly useful if there are extremely large forecast errors as a result of mistaken or incorrect measurements. - However, if the extremely large forecast errors are considered as genuine variations that might have some important business implications, rather than being due to mistaken or incorrect measurements, MAAPE would not be appropriate. - MAAPE is also asymmetric but has a more balanced penalty between positive and negative errors than MAPE.

references

Kim, Sungil, and Heeyoung Kim. [A new metric of absolute percentage error for intermittent demand forecasts](https://core.ac.uk/reader/82178886). International Journal of Forecasting 32.3 (2016): 669-679.

Relative error metrics #

An alternative to percentage error for the calculation of scale-independent metrics involves dividing each error by the error obtained using a baseline forecasting algorithm (typically the naive baseline $f^*(t)=y(t-1)$).

$ re(t)=\frac{e(t)}{e^*(t)}=\frac{y(t)-f(t)}{y(t)-f^*(t)} $

MARE #

Mean Absolute Relative Error

MARE

$ \text{MARE}=\text{mean}\left(|re(t)|\right) $

notes

MdARE #

Median Absolute Relative Error

MdARE

$ \text{MdARE}=\text{median}\left(|re(t)|\right) $

notes

GMARE #

Geometric Mean Absolute Relative Error

GMARE

$ \text{GMARE}=\text{gmean}\left(|re(t)|\right) $

notes

U-statistic #

Theil’s $U$-statistic

U-statistic

$ \text{U}= \sqrt{\frac{A}{B}},\text{where}, $ $ \quad\text{A}=\sum_{t=1}^{n-1}\left(\frac{y(t+1)-f(t+1)}{y(t)}\right)^2\quad\text{B}=\sum_{t=1}^{n-1}\left(\frac{y(t+1)-y(t)}{y(t)}\right)^2 $

notes

- Large errors given more weight then small errors and also provides a relative basis for comparison with naive (**NF1** forecast $f(t)=y(t-1)$) methods. - The numerator is similar to the MAPE and the denominator to the MAPE of the naive forecast method. - Smaller the better and should be less than 1. - $U=1$ The naive method is as good as the forecasting technique being evaluated. - $U<1$ The forecasting technique being evaluated is better than the naive method. The smaller the $U$-statistic, the better the forecasting method is relative to the naive method. - $U>1$ There is no point in using the forecasting method, since using a naive method will produce better results.

Scaled error metrics #

This generally involves scaling the error term by a suitable scale parameter.

$ se(t)=\frac{e(t)}{s}=\frac{y(t)-f(t)}{s} $

MAD/Mean #

MAD/Mean ratio

MAD/Mean

$ \text{MAD/Mean}=\text{mean}\left(|se(t)|\right),\text{where } s = \frac{1}{n}\sum_{i=1}^{n} y(t) $

notes

- The scale parameter $s$ is the *in-sample* mean of the time series. $ s = \frac{1}{n}\sum_{i=1}^{n} y(t) $ - Assumes stationarity, that is, the mean is stable over time, which is generally not true for data which shows trend, seasonality, or other patterns.

references

wMAPE #

(volume) weighted Mean Absolute Percentage Error

wMAPE

$ \text{wMAPE}=\text{mean}\left(|se(t)|\right),\text{where } s = \sum_{i=n+1}^{n+h} |y(t)| $

notes

- The scale parameter $s$ is the sum of the absolute value of the time series we are evaluation on (sometimes also called *volume*). $ s = \sum_{i=n+1}^{n+h} |y(t)| $ - So essentially wMAPE is the Sum of Absolute errors divided by the Sum of the Actuals. - This is equivalent to weighing the percentage error $pe(t)$ by a scale term, hence the term wMAPE. $ \frac{y(t)}{s} \cdot pe(t) = \frac{y(t)}{s} \cdot \frac{y(t)-f(t)}{y(t)} $ - MAPE is sentitive to very small changes in low volume data. However, wMAPE is less sensitive to it. - MAPE assumes that, absolute error on each item is equally important. Large error on a low-value item can unfairly skew the overall error. In contrast, wMAPE penalizes more for errors in high volume data as compared to low volume data. - wMAPE do not have divide by zero error.

references

MASE #

Mean Absolute Scaled Error

MASE

$ \text{MASE}=\text{mean}\left(|se(t)|\right),\text{where } s = \frac{1}{n-1}\sum_{i=2}^{n} |y(t)-y(t-1)| $

notes

- MASE is independent of the scale of the data. - The scale parameter $s$ is the *in-sample* MAE from the naive forecast method. $ s = \frac{1}{n-1}\sum_{i=1}^{n} |y(t)-y(t-1)| $ - A scaled error is less than one if it arises from a better forecast than the average one-step, naive forecast computed in-sample. - The only scenario under which MASE would be infinite or undefined is when all historical observations are equal.

references

- Rob J. Hyndman and Anne B. Koehler, [Another look at measures of forecast accuracy](https://github.ibm.com/retail-supply-chain/planning/files/582845/Another.look.at.measures.of.forecast.accuracy.pdf), International Journal of Forecasting, Volume 22, Issue 4, 2006. - Rob J Hyndman, [Another look at forecast-accuracy metrics for intermittent demand](https://github.ibm.com/retail-supply-chain/planning/files/582847/Metrics.for.intermitted.demand.pdf). Chapter 3.4, pages 204-211, in "Business Forecasting: Practical Problems and Solutions", John Wiley & Sons, 2015.

RMSSE #

Root Mean Squared Scaled Error

RMSSE

$ \text{RMSSE}=\sqrt{\text{mean}\left(se(t)^2\right)},\text{where } s = \frac{1}{n-1}\sum_{i=2}^{n} |y(t)-y(t-1)| $

notes

- This is the root mean squared variant of MASE. - This is suited for time series characterized by intermittency, involving spardic unit sales with lots of zeros. - Absolute errors used by MASE are optimized for the median and would assign lower scores to forecasting methods that derive the forecasts close to zero. RMSSE buils on squared errors, which are optimized for the mean. - The measure is scale independent, meaning that it can be effectively used to compare forecasts across series with different scales. - The measure penalizes positive and negative forecast errors, as well as large and small forecasts, equally, thus being symmetric. - It can be safely computed as it does not rely on divisions with values that could be equal or close to zero (e.g. as done in percentage errors when $y(t)=0$ or relative errors when the error of the benchmark used for scaling is zero).

references

- [The M5 Competition](https://mofc.unic.ac.cy/m5-competition/) - The M5 series are characterized by intermittency, involving sporadic unit sales with lots of zeros. This means that absolute errors, which are optimized for the median, would assign lower scores (better performance) to forecasting methods that derive forecasts close to zero. However, the objective of M5 is to accurately forecast the average demand and for this reason, the accuracy measure used builds on squared errors, which are optimized for the mean.

MSLE #

Mean Squared Log Error

MSLE

$ \text{MSLE}= \text{mean} \left(\log(y(t)+1)- \log(f(t)+1)\right) ^2 $

notes

- The introduction of the logarithm makes MSLE only care about the relative difference between the true and the predicted value. Thus making this scale independent. - Robustness: In the case of MSE, the presence of outliers can explode the error term to a very high value. But, in the case of MLSE the outliers are drastically scaled down therefore nullifying their effect. - Biased penalty: MSLE penalizes underestimates more than overestimates because of an asymmetry in the error curve. - The reason ‘1’ is added to both $y(t)$ and $f(t)$ is for mathematical convenience since $\log(0)$ is not defined but both $y(t)$ or $f(t)$ can be 0.

references

Correlation based metrics #

PCC #

Pearson correlation coefficient ($r$)

PCC

$ r = \frac{\sum_{t = n+1}^{n+h} (y(t) - \text{mean} (y(t)))(f(t) - \text{mean} (f(t)))}{\sqrt{\sum_{t = n+1}^{n+h} (y(t) - \text{mean} (y(t)))^2}\sqrt{\sum_{t = n+1}^{n+h} (f(t) - \text{mean} (f(t)))^2}} $

notes

- Pearson correlation measures how two continuous signals co-vary over time and indicate the linear relationship as a number between -1 (negatively correlated) to 0 (not correlated) to 1 (perfectly correlated). - Sensitive to outliers. - Based on the assumption of homoscedasticity of the data (variance of your data is homogenous across the data range). - It is a scale independent and offset independent metric. - Caution is necessary when using Pearson's correlation coefficient for model selection (see references).

references

- Waldmann, Patrik. [On the use of the Pearson correlation coefficient for model evaluation in genome-wide prediction](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781837/) Frontiers in Genetics 10 (2019): 899. - Armstrong, Richard A. [Should Pearson's correlation coefficient be avoided?.](https://onlinelibrary.wiley.com/doi/full/10.1111/opo.12636) Ophthalmic and Physiological Optics 39.5 (2019): 316-327.

KRCC #

Kendall rank correlation coefficient ($\tau $)

A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them.

For $t_i < t_j$, two sequences $y(t)$ and $f(t)$ are said to be concordant if the ranks for both elements agree: that is, if both $f(t_i)>f(t_j)$ and $y(t_i)>y(t_j)$; or if both $f(t_i)<f(t_j)$ and $y(t_i)<y(t_j)$.

They are said to be discordant, if $fx(t_i)>f(t_j)$ and $y(t_i)<y(t_j)$ or if $f(t_i) < f(t_j)$ and $y(t_i)>y(t_j) $.

If $x(t_i)=x(t_j)$ and $y(t_i)=y(t_j) $, the pair is neither concordant nor discordant.

acronym	error type	equation	scale independent
`*E`	error	\(e(t)=y(t)-f(t)\)	`no`
`*PE`	percentage error	\(pe(t)=\frac{y(t)-f(t)}{y(t)}\)	`yes`
`*RE`	relative error	\(re(t)=\frac{y(t)-f(t)}{y(t)-f^*(t)}\)	`yes`
`*SE`	scaled error	\(se(t)=\frac{y(t)-f(t)}{s}\)	`yes`

Forecast accuracy metrics #

Forecast errors #

TLDR #

Categorizing metrics #

Scale dependent metrics #

ME #

MAE #

MdAE #

GMAE #

MSE #

RMSE #

Percentage error metrics #

MPE #

MAPE #

MdAPE #

sMAPE #

sMdAPE #

MSPE #

RMSPE #

MAAPE #

Relative error metrics #

MARE #

MdARE #

GMARE #

U-statistic #

Scaled error metrics #

MAD/Mean #

wMAPE #

MASE #

RMSSE #

MSLE #

Correlation based metrics #

PCC #

KRCC #

NGini #

Poisson metrics #

PNLL #

P-DEV #

CHI^2 #

D^2 #

Metrics for Quantile Forecasts #

Pinball Loss #

MSPL #

Metrics for Prediction Intervals #

AgMSPL #

CRPS #

Metrics for multiple time series #

Metrics for hierarchical forecasting #

Metrics for new products #

References #