The technical term for finding patterns that fail to generalize is overfitting. Overfitting leads to unstable models that work one day, but not the next. Building stable models is the primary goal of the data mining methodology. The Model Set May Not Reflect the Relevant Population The model set is the collection of historical data that is used to develop data mining models. For inferences drawn from the model set to be valid, the model set must reflect the population that the model is meant to describe, clas sify, or score. A sample that does not properly reflect its parent population is biased. Using a biased sample as a model set is a recipe for learning things that are not true. It is also hard to avoid.
Customers are not like prospects. Survey responders are not like nonresponders. People who read email are not like people who do not read email. People who register on a Web site are not like people who fail to register. After an acquisition, customers from the acquired company are not nec essarily like customers from the acquirer.› Records with no missing values reflect a different population from› records with missing values. › Customers are not like prospects because they represent people who responded positively to whatever messages, offers, and promotions were made to attract customers in the past. A study of current customers is likely to suggest more of the same. If past campaigns have gone after wealthy, urban consumers, then any comparison of current customers with the general population will likely show that customers tend to be wealthy and urban. Such a model may miss opportunities in middle-income suburbs.