COMMERCIAL DATA MINING

 

 

 

 

 

 

 

 

 

 

 

There is always a lag between the time when new algorithms first appear in academic journals and excite discussion at conferences and the time when commercial software incorporating those algorithms becomes available. There is another lag between the initial availability of the first products and the time that they achieve wide acceptance. For data mining, the period of widespread availability and acceptance has arrived.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Many of the techniques discussed in this book started out in the fields of statistics, artificial intelligence, or machine learning. After a few years in uni­ versities and government labs, a new technique starts to be used by a few early adopters in the commercial sector. At this point in the evolution of a new tech­ nique, the software is typically available in source code to the intrepid user willing to retrieve it via FTP, compile it, and figure out how to use it by reading the author ’s Ph.D. thesis. Only after a few pioneers become successful with a new technique, does it start to appear in real products that come with user ’s manuals and help lines. Nowadays, new techniques are being developed; however, much work is also devoted to extending and improving existing techniques. All the techniques discussed here are available in commercial software products, although there is no single product that incorporates all of them.

 

 

Partial years might introduce inadvertent trends, so include entire years when using a bestfit line. The trend in this figure shows an increase in stops. This may be nothing to worry about, especially since the number of customers is also increasing over this period of time. This suggests that a better measure would be the stop rate, rather than the raw number of stops. The Lure of Statistics: Data Mining Using Familiar Tools 129 price complaint stops best fit line shows increasing trend in overall stops by day overall stops May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun. Standardized Values A time series chart provides useful information. However, it does not give an idea as to whether the changes over time are expected or unexpected. For this, we need some tools from statistics. One way of looking at a time series is as a partition of all the data, with a little bit on each day. The statistician now wants to ask a skeptical question: “Is it pos­ sible that the differences seen on each day are strictly due to chance?” This is the null hypothesis, which is answered by calculating the p-value—the probability that the variation among values could be explained by chance alone.