The problem with distance from Boston, for instance, is that as one first drives out into the suburbs, home penetration goes up with distance from Boston. After a while, however, distance from Boston becomes negatively correlated with penetra tion as people far from Boston do not care as much about what goes on there. Home price is a better predictor because its distribution resembles that of the target variable, increasing in the first few miles and then declining. The decision tree provides guidance about which variables to think about as well as which variables to use. Decision Trees 205 Applying Decision-Tree Methods to Sequential Events Predicting the future is one of the most important applications of data mining. The task of analyzing trends in historical data in order to predict future behav ior recurs in every domain we have examined.
One of our clients, a major bank, looked at the detailed transaction data from its customers in order to spot earlier warning signs for attrition in its checking accounts. ATM withdrawals, payroll-direct deposits, balance inquiries, visits to the teller, and hundreds of other transaction types and customer attributes were tracked over time to find signatures that allow the bank to recognize that a cus- tomer ’s loyalty is beginning to weaken while there is still time to take corrective action. Another client, a manufacturer of diesel engines, used the decision tree com ponent of SPSS’s Clementine data mining suite to forecast diesel engine sales based on historical truck registration data. The goal was to identify individual owner-operators who were likely to be ready to trade in the engines of their big rigs. Sales, profits, failure modes, fashion trends, commodity prices, operating temperatures, interest rates, call volumes, response rates, and return rates: Peo ple are trying to predict them all. In some fields, notably economics, the analy sis of time-series data is a central preoccupation of statistical analysts, so you might expect there to be a large collection of ready-made techniques available to be applied to predictive data mining on time-ordered data. Unfortunately, this is not the case.