SAMPLING THEORY WITH STATISTICS
 

 

 

 

 

 

 

 

 

 

 

 

 

Sampling theory is an important part of statistics. This area explains how results on a subset of data (a sample) relate to the whole. This is very important when planning to do a poll, because it is not possible to ask everyone a ques­ tion; rather, pollsters ask a very small sample and derive overall opinion. However, this is much less important when all the data is available. Usually, it is best to use all the data available, rather than a small subset of it. There are a few cases when this is not necessarily true. There might simply be too much data. Instead of building models on tens of millions of customers; build models on hundreds of thousands—at least to learn how to build better models. Another reason is to get an unrepresentative sample. Such a sample, for instance, might have an equal number of churners and nonchurners, although the original data had different proportions. However, it is generally better to use more data rather than sample down and use less, unless there is a good reason for sampling down.

 

 

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Time Dependency Pops Up Everywhere Almost all data used in data mining has a time dependency associated with it. Customers’ reactions to marketing efforts change over time. Prospects’ reac­ tions to competitive offers change over time. Comparing results from a mar­ keting campaign one year to the previous year is rarely going to yield exactly the same result. We do not expect the same results. On the other hand, we do expect scientific experiments to yield similar results regardless of when the experiment takes place. The laws of science are consid­ ered immutable; they do not change over time. By contrast, the business climate changes daily. Statistics often considers repeated observations to be indepen­ dent observations. That is, one observation does not resemble another. Data mining, on the other hand, must often consider the time component of the data. Experimentation is Hard Data mining has to work within the constraints of existing business practices. This can make it difficult to set up experiments, for several reasons