In order to compete better with the suburban papers, the Globe introduced geographically distinct versions of the paper with specialized editorial content for each of 12 geographically defined zones. Two days a week, readers are treated to a few pages of local coverage for their area. The editorial zones were drawn up using data available to the Globe, common sense, and a map, but no formal statistical analysis. There were some constraints on the composition of the editorial zones: The zones had to be geographically contiguous so that the trucks carry ing the localized editions from the central printing plant in Boston could take sensible routes. The zones had to be reasonably compact and contain sufficient popula tion to justify specialized editorial content. The editorial zones had to be closely aligned with the geographic zones used to sell advertising. Within these constraints, the Globe wished to design editorial zones that would group similar towns together. Sounds sensible, but which towns are similar? That is the question that The Boston Globe brought to us at Data Miners.
Before deciding which towns belonged together, there needed to be a way of describing the towns—a town signature with a column for every feature that might be useful for characterizing a town and comparing it with its neighbors. As it happened, Data Miners had worked on an earlier project to find towns with good prospects for future circulation growth that had already defined town signatures. Those signatures, which had been developed for a regression model to predict Globe home delivery penetration, turned out to be equally useful for undirected clustering. This is a fairly common occurrence; once a useful set of descriptive attributes has been collected it can be used for all sorts of things. In another example, a long-distance company developed customer Automatic Cluster signatures based on call detail data in order to predict fraud and later found that the same variables were useful for distinguishing between business and residential user .Although the time and effort it takes to create a good customer signature can seem daunting, the effort is repaid over time because the same attributes often turn out to be predictive for many different target variables. The oft quoted rule of thumb that 80 percent of the time spent on a data mining project goes into data preparation becomes less true when the data preparation effort can be amortized over several predictive modeling efforts.