During the data exploration phase of a data mining project, decision trees are a useful tool for picking the variables that are likely to be important for predict ing particular targets. One of our newspaper clients, The Boston Globe, was inter ested in estimating a town’s expected home delivery circulation level based on various demographic and geographic characteristics. Armed with such esti mates, they would, among other things, be able to spot towns with untapped potential where the actual circulation was lower than the expected circulation. The final model would be a regression equation based on a handful of vari ables. But which variables? And what exactly would the regression attempt to estimate? Before building the regression model, we used decision trees to help explore these questions. Although the newspaper was ultimately interested in predicting the actual number of subscribing households in a given city or town, that number does not make a good target for a regression model because towns and cities vary so much in size. It is not useful to waste modeling power on discovering that there are more subscribers in large towns than in small ones.
The tree shows that median home value is the best first split. Towns where the median home value (in a region with some of the most expensive housing in the country) is less than $226,000 dollars are poor prospects for this paper. The split at the next level is more surprising. The variable chosen for the split is one of a family of derived variables comparing the subscriber base in the town to the town population as a whole. Towns where the subscribers are similar to the general population are better, in terms of home delivery penetration, than towns where the subscribers are farther from the mean. Other variables that were important for distinguishing good from bad towns included the mean years of school completed, the percentage of the population in blue collar occupations, and the percentage of the population in high-status occu pations. All of these ended up as inputs to the regression model. Some other variables that we had expected to be important such as distance from Boston and household income turned out to be less powerful. Once the decision tree has thrown a spotlight on a variable by either including it or fail ing to use it, the reason often becomes clear with a little thought.