By Rick Durham
· Building data mining models is one thing
· Determining the primary factors that contribute to the final predicted data point is quite another
One of the most useful algorithms in the SSAS suite of data mining tools is the Time Series algorithm. Its forecast method is really a combination of two other algorithms (ARIMA, ARTxp) and it takes the historic data values in a series to make future predictions regarding that series. It does this by assigning weights to past data values to make future predictions. One of the most important features of this method is its ability to allow the data from other series in the model to be incorporated in the final prediction values.
For example, in the model below, we can see a direct historic relationship between the price of gas and the price of oil. When the time series algorithm was used to build the model it looked at all of the past data points for both gas and oil to make the final prediction for the price of gas.
We can infer from this model, based on historic data points, that the price of oil affects the price of gas (no big surprise here). Please note the oil price value is normalized against Feb 2008 oil price in %.
But what affected oil prices? What caused the huge spike in oil prices July 2008 in our time series forecast? My experience has been that when we create these type models we are always going to be ask what factors caused or drove certain data points in the series to extreme values.
In this case, we might casually answer “ It was a lack of supply in oil with high demand” but this is not true as the following set of historic data charts prove. The chart below shows that in July 2008 production of oil was at an all time high as suppliers were willing increase production when the price point was high. Again no surprise, it’s simple Econ 101.
We would think that demand for oil during this period would also be high driving up the price point -but that is not the case. In fact, demand for oil during this period was very low as this historic data chart reveals. It started dropping in 2007 and hit a low in the summer of 2008 when the price of oil and gas were both at all time highs.
So how can we explain what was driving the price of oil up and thus the price of gas in July 2008? It turns out that one of the primary factors driving up the price of oil was the value of the dollars value against other currencies. In effect, because the value of the dollar was low oil suppliers wanted more dollars for the same units of oil thus driving up the price.
Many types of data that are typically used in time series analysis (think commodities, stock prices, long term weather forecasts…) are driven by complex factors that are often changing and may be non-stationary. The factors that drive a forecast today are not the same as what might be driving it tomorrow. This is what makes understanding what series need to be included in the mining models and what indirect factors drive them tricky.
If we take our oil example, many factors have driven the price around historically. These include: supply, demand, war, strikes, geopolitical tension (or lack of), weather… In some cases, the input factors are so complex and varied that the only way to predict future values is to use the historic time values as there is no way to determine all of the factors that move the data. This is certainly true of the stock market as well.
In this blog, I have taken a simple example of time series analysis based on historic oil and gas prices to show how once we have developed our mining model we can potentially dive deeper to understand what factors are driving the forecasts, and ultimately provide better insight to the business. This is the essence of what BI should be about and is often overlooked by technicians who are overly occupied in the complexity of the tools they are using rather than how to use the results they generate to impact the organization.
.