Weighted MAPE

Most classical statistical signal processing forecasting techniques — Weiner filtering, Kalman filtering, etc — minimize the MSE. The problem with the MSE is that it is very sensitive to scale. In many situations, we want a scale free error metric. By scale free, I mean a metric that can be used to compare the error for two different forecasts for two different series even if one typically has values between 0 and .1 and the other has values between 1000 and 10000. If I just used MSE, I would get small MSEs for the previous and big MSEs for the latter — but the estimator on the latter could be doing better.

What error metric can we use? There are two choices that alot of sales/inventory modelling people use:

Mean absolute percentage error (MAPE)
Weighted MAPE (wMAPE)

The MAPE is defined as:

\[\mathbf{MAPE} = \frac{1}{N} \sum_{n} \mathrm{APE}[n] = \frac{1}{N} \sum_{n} \frac{|\hat{y}[n] - y[n]|}{|y[n]|}\]

There are problems with the MAPE though. Most prominently, the APE could be very large if the value of the time-series is really small. And if its zero — then the MAPE blows up to infinity. This is not “good” behavior for energy forecasting. In fact, I don’t really care that much about errors made when the consumption is low. I care more about errors made when the consumption is high. Which brings us to the weighted MAPE, which fixes some of the shortcomings of straight MAPE.

As the name implies it is a weighted version of the mean absolute percentage error (MAPE). Instead of weighting each APE equally by 1/N it weights by a index specific weight:

\[\mathbf{wMAPE} = \sum_{n} w[n] \mathrm{APE}[n]\]

As typical, we weight the APE by the value of the relative value of the time-series as that point compared to the entire time-series. This is expressed mathematically as:

\[w[n] = \frac{|y[n]|}{\sum_{n} |y[n]|}\]

Intuitively, this choice of weight down-weighs errors that are made when the time-series is small. For example, we could have 50% APE when the time-series is really close to zero — but we don’t really care because the time-serie is so small anyways. This is especially important if the time-series takes values close to zero. Without this weighting, the APE there could blow up to something huge and would totally dominate the MAPE. We care more about the errors made when the time-series is large. This is what this error metric does.

This expression above can be further simplified because the numerator of w[n] cancels the denominator of the APE[n] giving us the expression we use to compute the wMAPE:

\[\mathbf{MAPE} = \sum_{n} \underbrace{\frac{|y[n]|}{\sum_{n} |y[n]|}}_{w[n]} \cdot \underbrace{\frac{|\hat{y}[n] - y[n]|}{|y[n]|}}_{\mathrm{APE[n]}} = \frac{\sum_{n} |\hat{y}[n] - y[n]| }{\sum_{n} |y[n]|}\]

This is the typical wMAPE expression that you see around the internet. I find it very confusing without the derivation. Hopefully this makes more sense to you all as well.