TimeSHAP #
code API docs example notebook
TimeSHAP is a feature based post-hoc black box explainer to explain the forecast of any univariate time series forecaster using tree-based regressors to build the surrogate model and SHAP(SHapley Additive exPlanations) values for the explanations.
For any given univariate time series we first generate a sequence of backtested
historical forecasts using an (expanding window) splitter. Using the splitter we
split the time series into a sequence of train an test splits. The expanding window
splitter uses more and more training data, while keeping the test window size fixed
to the forecast horizon. The method is model agnostic in the sense that it needs
access to only the fit and predict methods of the forecaster. The forecaster
is trained on the train split and evaluated on the test split and all the test split
predictions are concatenated to get the backtested historical forecasts for each
step of the forecast horizon. We the construct a surrogate time series forecasting
task by reducing it to a standard supervised regression problem. For each time point
we generate a set of interpretable features (lag features, seasonal features,
date time encodings etc) based on which we need to predict the backtested forecasted
time series values. The surrogate model is fitted using tree-based regressors like
XGBoost, CatBoost,LightGBM etc. Explanation is in now in terms of features that
encode the time series. We mainly rely on the TreeSHAP
(SHapley Additive exPlanations) algorithm to explain the output of
ensemble tree models. In order to improve the sensitivity we extend the above
approach by aggregating multiple explanations from bootstrapped versions of
the time series.
Time series forecasting #
A univariate time series is a series with a single time-dependent variable. Let \(f(t):\mathbb{Z}\to\mathbb{R}^1\) represent a latent univariate time series for any discrete time index \(t \in \mathbb{Z}\) . We observe a sequence of historical noisy (and potentially missing) values \(y(t)\) for \(t \in [1,\dots,T]\) such that in expectation \(\mathbb{E}[y(t)]=f(t)\) . For example, in the retail domain \(y(t)\) could represent the daily sales of a product and \(f(t)\) the true latent demand for the product.
The task of time series forecasting is to estimate \(f(t)\) for all \(t >T\) based on the observed historical time series \(y(t)\) for \(t \in [1,\dots,T]\) . When we talk about the forecast, we usually mean the average value of the forecast distribution also known as the point forecast, \(f(t)\) , which is the mean ( \(\mathbb{E}[y(t)]\) ) of the forecast distribution. The time series forecast is typically done for a fixed number or periods in the future, refered to as the forecast horizon, \(h\) . Let \(\hat{f}(T+h)\) for \(h \in [1,\dots,H]\) be the forecasted time series for the forecast horizon \(h\) based on the historical observed time series \(y(1),...,y(T)\) . \( \textbf{forecaster model}\:\:\hat{f}(T+h|y(1),...,y(T))\:\text{for}\:h=1,...,H \)
Using the language of supervised learning, \(y(1),...,y(T)\) is the training data of \(T\) samples based on which we learn/train a forecaster model \(\hat{f}\) . The trained model \(\hat{f}\) is then used to predict/forecast on the test set of \(H\) time points in the future. Unlike supervised learing where the model is usually fit once in time series forecasting for many algorithms we will have to continously fit the model before forecasting as more recent data arrives.

Explainability for forecasting #
Explainability is the degree to which a human can understand the cause of a decision (or prediction) made by a prediction model 123.Various notions of explainability (local and global explanations) and explainer algorithms has been studied in classical supervised learning paradigms like classification and regression.
Scope of explanations #
An explanation is the answer to either a why-question or a what if-scenario. We define the following three notions of explanations in the context of time series forecasting.
- A local explantion explains the forecast made by a forecaster at a certain point in time.- Why is the forecasted sales on July 22, 2019 much higher than the average sales?
- Will the forecast increase if I increase the offered discount?
 
- A global explanation explains the forecaster trained on the historical time series.- What are the most important attributes the forecaster relies on to make the forecast?
- What is the impact of diccount on the sales forecast?
 
- A semi-local explanation explains the overall forecast made by a forecaster in a certain time interval. In general this returns one (semi-local) explanation aggregated over all the multiple time steps in the forecast horizon.- Why is the forecasted sales over the next 4 weeks much higher?
 
Type of explanations #
In the context of time series the explanations can boradly of the following two types.
- Features based Explanation is in terms of features that encode the time series (lag features, date encodings etc.) and external regressors.
- Instance based Explanation is in terms of the importance or certain time points in the historical time series.
Type of explainers #
An explainer is an algorithm the generates local, semi-local and global explanations for a forecasting algorithm. We can boradly categorize explainers into the following 3 types.
- Directly interpretable explainers Some time series forecasting algorithms are inherently interpretable by desgin. For example, for an autoregressive model of order \(p\) ( \(AR(p)\) ) the coefficients correponds to the importance associated with the past \(p\) values of the time series. Another example is, Prophet which uses a decomposable additive time series model with three main model components: trend, seasonality, and holidays. The explanation is directly in terms of these 3 components.
- White-box explainers Though not directly interpretable, some forecasting algorithms can be explained if we have access to the internals of the corresponding forecasting algorithm. For example, for deep neural forecasting algorithms if we have access to the model we can compute saliency maps and activations to explain the inner working of the model. Tree based regression ensembles like XGBoost, CatBoost and LightGBM gprovide global explanations in terms of the feature importance scores.
- Black-box explainers Such explainers are model agnostic and generally require access to model’s predict(and sometimesfit) functions. Given a source black box model, such explainers generally train a surrogate model that is explainable.
TimeSHAP #
TimeSHAP is a feature based post-hoc black box explainer to explain the forecast of any univariate time series forecaster using tree-based regressors to build the surrogate model and SHAP(SHapley Additive exPlanations) values for the explanations.
Surrogate model #
The method is model agnostic in the sense that it needs access to only the fit and predict methods of the forecaster.
At any time \(t\)
using all the historical data available so far (that its, \(y(1),...,y(t)\)
)
we can train the forecaster model \(f\)
using the fit method. Once the forecaster is trained we can
then generate the corresponding \(H\)
(forecast horizon) forecasts
\(
f(t+h|t)=f(t+h|y(1),...,y(t))\:\text{for}\:h=1,...,H,
\)
time steps ahead using the predict method. Our goal now is to learn a surrogate model(s) \(g\)
to predict the forecasts from the forecaster.
\(
g(t+h|t)=g(t+h|y(1),...,y(t))\:\text{for}\:h=1,...,H,
\)
While the original forecaster learns to predict \(y(t+h)\)
based on \(y(1),...,y(t)\)
the surrogate model is trained to predict the forecasts \(f(t+h)\)
made by the forecaster. Essentially we want to mimic the forecaster using a surrogate model. We choose the surrogate model that can be easily interpreted.

Backtested historical forecasts #
In order to generate data to train the surrogate model we use backtesting. For any given univariate time series we first generate a sequence of backtested historical forecasts using an expanding window splitter. Using the splitter we split the time series into a sequence of train and test splits. The expanding window splitter uses more and more training data, while keeping the test window size fixed to the forecast horizon. The forecaster is trained on the train split and evaluated on the test split and all the test split predictions are concatenated to get the backtested historical forecasts for each step of the forecast horizon.

This is one of the most computationally expensive steps since for a time series of length \(n\) we will have to potentially invoke `fit` and `predict` roughly \(O(n)\) times to generate the backtested forecasts. - This can be effficiently parallelized since each task can be executed independently. In our implementation we use [ray](https://github.com/ray-project/ray) to parallelize this. - For computationally expensive `fit` models it may be cheaper to forecast without full refitting. However we may need to pass the context/historical time series so far to do the forecast. Some forecasters have a light weight `update` method has the same signature as train but does not do refitting and only does minimal context updates to make the prediction. - Certain forecasting algorithms train one large model based on large number of multiple univariate time series. In such scenarios we completely avoid refitting the model and rely only on the `predict` method to generate the in-sample forecasts and build a surrogate model. While this avoids backtesting and can potentially overfit, since the model is trained or large number of time series this may be fine.
Regressor reduction #
We now have an original time series and the corresponding backtested forecast time series for each step of the forecast horizon. The goal of the surrogate model is to learn to predict the backtested forecast time series based on on the original time series. We construct a surrogate time series forecasting task by reducing it to a standard supervised regression problem.
A common machine learning approach to time series forecasting is to reduce it to a standard supervised regression problem. A standard supervised regression task takes as input a \(d\) -dimensional feature vector \(\mathbf{x}\in\mathbb{R}^d\) and predicts a scalar \(y \in \mathbb{R}\) . The regressor \(y = f(\mathbf{x})\) is learnt based on a labelled training dataset \(\left(\mathbf{x}_i,y_i\right)\) , for \(i=1,..,n\) samples. However there is no direct concept of input features ( \(\mathbf{x}\) ) and output target ( \(y\) ) for a time series. Instead, we must choose the backtested time series forecast values as the variable to be predicted and use various time series feature engineering techniques (like lag features, date time encodings etc.) to construct the features from the original time series.
Interpretable features #
Essentially, for each time point \(t\) we generate a feature vector \(\mathbf{x}(t) \in\mathbb{R}^d\) based on the original time series values observed so far based on which we need to predict the backtested forecast time series for each step in the forecast horizon \(y(t) \in\mathbb{R}\) . The table below is a list of features we use. See here for more details. We generally like the features to be interpretable in the sense that the end user of the explanation should be able to comprehend the meaning of these features.
The feature vector \(\mathbf{x}(t)\) needs to be constructed only based on the time step \(t\) and the historical values of the time series \(y(1),...,y(t-1)\) and should not use the current time series value \(y(t)\) .
| feature name | description | 
|---|---|
| sales(t-3) | The value of the time series (sales) at the (t-3) previous time step. | 
| sales(t-2) | The value of the time series (sales) at the (t-2) previous time step. | 
| sales(t-1) | The value of the time series (sales) at the (t-1) previous time step. | 
| sales(t-2*365) | The value of the time series (sales) at the (t-2*365) previous time step. | 
| sales(t-1*365) | The value of the time series (sales) at the (t-1*365) previous time step. | 
| sales_min(t-1,t-3) | The min of the past 3 values in the sales time series. | 
| sales_mean(t-1,t-3) | The mean of the past 3 values in the sales time series. | 
| sales_max(t-1,t-3) | The max of the past 3 values in the sales time series. | 
| sales_min(0,t-1) | The min of all the values so far in the sales time series. | 
| sales_mean(0,t-1) | The mean of all the values so far in the sales time series. | 
| sales_max(0,t-1) | The max of all the values so far in the sales time series. | 
| year | The year. | 
| month | The month name of the year from January to December. | 
| day_of_year | The ordinal day of the year from 1 to 365. | 
| day_of_month | The ordinal day of the month from 1 to 31. | 
| week_of_year | The ordinal week of the year from 1 to 52. | 
| week_of_month | The ordinal week of the month from 1 to 4. | 
| day_of_week | The day of the week from Monday to Sunday. | 
| is_weekend | Indicates whether the date is a weekend or not. | 
| quarter | The ordinal quarter of the date from 1 to 4. | 
| season | The season Spring/Summer/Fall/Winter. | 
| fashion_season | The fashion season Spring/Summer (January to June) or Fall/Winter (July to December). | 
| is_month_start | Indicates whether the date is the first day of the month. | 
| is_month_end | Indicates whether the date is the last day of the month. | 
| is_quarter_start | Indicates whether the date is the first day of the quarter. | 
| is_quarter_end | Indicates whether the date is the last day of the quarter. | 
| is_year_start | Indicates whether the date is the first day of the year. | 
| is_year_end | Indicates whether the date is the last day of the year. | 
| is_leap_year | Indicates whether the date belongs to a leap year. | 
| hour | The hours of the day. | 
| minute | The minutes of the hour. | 
| second | The seconds of the minute. | 
| holiday-IN | Indicates whether the date is a IN holiday or not. | 
| t | Feature to model simple polynomial (of degree 1) trend in sales. | 
| t^2 | Feature to model simple polynomial (of degree 2) trend in sales. | 
**External regressors** Classical time series forecasting algorithms esentially learn a model to forecast based on the historical values of the time series. In many domains, the value of the time series depends on several external time series which we refer to as related external regressors. For example in the retail domain, the sales is potentially influence by discount, promotion, events, weather etc. Each external regressor in itself is a time series and some methods allow to explicity include external regressors to improve forecasting. If external regressors are avaiable we can also encode them as interpretable features using similar features as above. Typically some exeternal regressors can be **forwad looking**. For example, the discount that will be used iin the future is typically planned by the retailer in advance.
| feature name | description | 
|---|---|
| discount(t-3) | The value of the time series (discount) at the (t-3) previous time step. | 
| discount(t-2) | The value of the time series (discount) at the (t-2) previous time step. | 
| discount(t-1) | The value of the time series (discount) at the (t-1) previous time step. | 
| discount(t) | The value of the time series (discount) at the current time step. | 
Multi-step forecasting #
In the last section we described some of the commonly used methods to transform the original time series to a set of interpretable features. Recall that we have \(H\) backtested time series forecats which are to be used as targets to learn the surrogate regressor model. Here we will desxribe common strategies4 that can be used used for multi-step forecasting to regression reduction.
Recursive #
A single regressor model is fit for one-step-ahead forecast horizon and then called recursively to predict multiple steps ahead. Let \(\mathcal{G}(y(1),...,y(t))\) be the one-step ahead surrogate forecaster that has been learnt based on the training data, where the forecaster predicts the one-step ahead forecast using features based on the time series values till \(t\) , that is, \(y(1),...,y(t)\) . The forecasts from the surrogate models are made recusively as follows, \( g(t+1|t) = \mathcal{G}(y(1),...,y(t)). \) For \(h=2,3,...,H\) \( g(t+h|t) = \mathcal{G}(y(1),...,y(t),g(t+1|t),...,g(t+h-1|t)). \)
For example,
| horizon | forecast | strategy | 
|---|---|---|
| 1 | \( g(t+1|t) \) | \( \mathcal{G}(y(1),...,y(t)) \) | 
| 2 | \( g(t+2|t) \) | \( \mathcal{G}(y(1),...,y(t),g(t+1|t)) \) | 
| h | \( g(t+h|t) \) | \( \mathcal{G}(y(1),...,y(t),g(t+1|t),...,g(t+h-1|t)) \) | 
A well-known drawback of the recursive method is its sensitivity to the estimation error, since estimated values, instead of actual ones, are more and more used when we get further in the future forecasts.

Direct #
A separate regressor model is fit for each step ahead in the forecast horizon and then independently applied to predict multiple steps ahead. Let \(\mathcal{G}_h(y(1),...,y(t))\) be the h-step ahead surrogate forecaster that has been learnt based on the training data, where the forecaster predicts the h-step ahead forecast using features based on the time series values till \(t\) , that is, \(y(1),...,y(t)\) . The forecasts are made directly as follows, \( g(t+h|t) = \mathcal{G}_h(y(1),...,y(t))\text{ for }h=1,2,...,H \)
For example,
| horizon | forecast | strategy | 
|---|---|---|
| 1 | \( g(t+1|t) \) | \( \mathcal{G}_1(y(1),...,y(t)) \) | 
| 2 | \( g(t+2|t) \) | \( \mathcal{G}_2(y(1),...,y(t)) \) | 
| h | \( g(t+h|t) \) | \( \mathcal{G}_h(y(1),...,y(t)) \) | 
Since the Direct strategy does not use any approximated values to compute the forecasts, it is not prone to any accumulation of errors. However since the models are learned independently no statistical dependencies between the predictions is considered. This strategy also demands a large computational time since the number of models to learn is equal to the size of the forecast horizon.

DirRec #
The DirRec strategy combines the architectures and the principles underlying the Direct and the Recursive strategies. DirRec computes the forecasts with different models for every horizon (like the Direct strategy) and, at each time step, it enlarges the set of inputs by adding variables corresponding to the forecasts of the previous step (like the Recursive strategy).
\( g(t+1|t) = \mathcal{G}_1(y(1),...,y(t)) \)For \(h=2,3,...,H\) \( g(t+h|t) = \mathcal{G}_h(y(1),...,y(t),g(t+1|t),...,g(t+h-1|t)) \)
For example,
| horizon | forecast | strategy | 
|---|---|---|
| 1 | \( g(t+1|t) \) | \( \mathcal{G}_1(y(1),...,y(t)) \) | 
| 2 | \( g(t+2|t) \) | \( \mathcal{G}_2(y(1),...,y(t),g(t+1|t)) \) | 
| h | \( g(t+h|t) \) | \( \mathcal{G}_h(y(1),...,y(t),g(t+1|t),...,g(t+h-1|t)) \) | 
Tree ensemble regressors #
So far, for each time point we generate a set of interpretable features (lag features, seasonal features, date time encodings etc) based on which we need to predict the backtested forecasted time series values. The surrogate model is then fitted using tree-based regressors like XGBoost, CatBoost,LightGBM etc.
SHapley Additive exPlanations #
While in general we can use any regressor we prefer tree-based ensembles like XGBoost, CatBoost and LightGBM since they are resonably accurate and more importantly support fast explaninablity algorithms like TreeSHAP which we reply on for the various explanations. Explanation is in now in terms of features that encode the time series. We mainly rely on the TreeSHAP (SHapley Additive exPlanations) algorithm to explain the output of ensemble tree models. SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. 56
Global explanations #
A global explanation explains the forecaster trained on the historical time series. The explanation type can be one of following four types.
Note that the recurisve strategy has only one model, while the direct and the dirrec strategies have \(H\) models corresponding to each step of the forecasting horizon. Hence there will be a possible seprate global explanation for each model based on the forecast horizon.
SHAP feature importance #
The importance score for each features based on the shap values. Specifically this is the mean absolute value of the shap values for each feature across the entire dataset. To get an overview of which features are most important for a model we can compute the SHAP values of every feature for every sample. We can then take the mean absolute value of the SHAP values for each feature to get a feature importance score for each feature.

Feature importance #
The relative contribution of each feature to the model. A higher value of this score when compared to another feature implies it is more important for generating a forecast. Feature importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance. Feature importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The feature importances are then averaged across all of the the decision trees within the model. The gain is the most relevant attribute to interpret the relative importance of each feature. The gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.

Partial dependence plot #
The partial dependence plot (PDP) for a feature shows the marginal effect the feature has on the forecast (from the surrogate model). The PDP shows how the average prediction in your dataset changes when a particular feature is changed. The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value.
The calculation for the partial dependence plots has a causal interpretation. One way to think about PDP is that it is an **intervention query**. We intervene on a feature and measure the changes in the predictions. In doing so, we analyze the causal relationship between the feature and the prediction.

The assumption of independence is the biggest issue with PDP plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features.
SHAP dependence plot #
The shap dependence plot (SDP) for each feature shows the mean shap value for a particular features across the entire dataset. This shows how the model depends on the given feature, and is like a richer extenstion of the classical partial dependence plots.

Local explanations #
A local explantion explains the forecast made by a forecaster at a certain point in time.
SHAP explanation #
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations. While SHAP values can explain the output of any machine learning model, high-speed exact algorithms are available for tree ensemble methods
The SHAP explanation shows features contributing to push the forecasted sales from the base value (the average sales) to the forecaster model output. Features pushing the forecast higher are shown in blue and those pushing the forecast lower are in red.

This time instance (Mon Jul 1 00:00:00 2019) has a forecasted sales (5028.76), 2579.59 units higher than the average (2449.17) mainly because of the discount(t) (40.50) and sales(t-1) (4242.54) while sales(t-2*52) (4375.00) was trying to push it lower.
Local partial dependence plot #
The (local) PDP for a given feature shows how the forecast (from the surrogate model) varies as the feature value changes.

Local SHAP dependence plot #
The (local) SDP for a given feature shows how the shap value varies as the feature value changes.

Semi-local explanations #
A semi-local explanation explains the overall forecast made by a forecaster in a certain time interval. In general this returns one (semi-local) explanation aggregated over all the multiple time steps in the forecast horizon.
SHAP feature importance #
The importance score for each feature which is the corresponding shap value for that feature.

Partial dependence plot #
The PDP for a given feature shows how the forecast (from the surrogate model) varies as the feature value changes.

SHAP dependence plot #
The SDP for a given feature shows how the shap value varies as the feature value changes.

Explaining prediction intervals #
The same setup can also be used to explain the width of the prediction interval. Instead of regressing on the mean forecast from the forecaster we can regress on the width of the prediction interval.
Bootstrapped ensemble #
In order to improve the sensitivity we extend the above approach by aggregating multiple explanations from bootstrapped versions of the time series.
Examples #
Naive #
The forecast is the value of the last observation. \( f(t+h|t) = y(t),\text{ for }h=1,2,... \)
from aix360ts.forecasting.forecasters import Naive
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = Naive()
explainer = TimeSHAP(forecaster=forecaster)





SeasonalNaive #
The forecast is the value of the last observation from the same season of the year. For example, with monthly data, the forecast for all future February values is equal to the last observed February value.
from aix360ts.forecasting.forecasters import SeasonalNaive
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = SeasonalNaive(m=52)
explainer = TimeSHAP(forecaster=forecaster)





MovingAverage #
A moving average forecast of order \(k\) , or, \(MA(k)\) , is the mean of the last \(k\) observations of the time series.
from aix360ts.forecasting.forecasters import MovingAverage
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = MovingAverage(k=6)
explainer = TimeSHAP(forecaster=forecaster)





Simple Exponential Smoothing #
The forecast is the exponentially weighted average of its past values. The forecast can also be interpreted as a weighted average between the most recent observation and the previous forecast.
from aix360ts.forecasting.forecasters import SES
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = SES(alpha=0.5)
explainer = TimeSHAP(forecaster=forecaster)





Prophet #
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.
from aix360ts.forecasting.forecasters import Prophet
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = Prophet()
explainer = TimeSHAP(forecaster=forecaster)





XGBoost #
Forecasting to Regression reduction using XGBoost.
from aix360ts.forecasting.forecasters import RegressorReduction
from aix360ts.forecasting.explainers import TimeSHAP
forecaster = RegressorReduction()
explainer = TimeSHAP(forecaster=forecaster)





References #
- Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, Christoph Molnar ↩︎ 
- Explanation in Artificial Intelligence: Insights from the Social Sciences, Tim Miller ↩︎ 
- Towards A Rigorous Science of Interpretable Machine Learning, Finale Doshi-Velez, Been Kim ↩︎ 
- Bontempi G., Ben Taieb S., Le Borgne YA. (2013) Machine Learning Strategies for Time Series Forecasting. In: Aufaure MA., Zimányi E. (eds) Business Intelligence. eBISS 2012. Lecture Notes in Business Information Processing, vol 138. Springer, Berlin, Heidelberg. ↩︎ 
- Lundberg, S.M., Erion, G., Chen, H. et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2, 56–67 (2020). ↩︎