DATA CENSORED AND TRUNCATED
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The data used for data mining is often incomplete, in one of two special ways. Censored values are incomplete because whatever is being measured is not complete. One example is customer tenures. For active customers, we know the tenure is greater than the current tenure; however, we do not know which customers are going to stop tomorrow and which are going to stop 10 years from now. The actual tenure is greater than the observed value and cannot be known until the customer actually stops at some unknown point in the future. shows another situation with the same result. This curve shows sales and inventory for a retailer for one product. Sales are always less than or equal to the inventory. On the days with the Xs, though, the inventory sold out. What were the potential sales on these days? The potential sales are greater than or equal to the observed sales—another example of censored data. Truncated data poses another problem in terms of biasing samples. Trun cated data is not included in databases, often because it is too old.

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For instance, when Company A purchases Company B, their systems are merged. Often, the active customers from Company B are moved into the data warehouse for Company A. That is, all customers active on a given date are moved over. Customers who had stopped the day before are not moved over. This is an example of left truncation, and it pops up throughout corporate databases, usually with no warning (unless the documentation is very good about saying what is not in the warehouse as well as what is). This can cause confusion when looking at when customers started—and discovering that all customers who started 5 years before the merger were mysteriously active for at least 5 years. This is not due to a miraculous acquisition program. This is because all the ones who stopped earlier were excluded. Lessons Learned This chapter talks about some basic statistical methods that are useful for ana­ lyzing data. When looking at data, it is useful to look at histograms and cumu­ lative histograms to see what values are most common. More important, though, is looking at values over time. One of the big questions addressed by statistics is whether observed values are expected or not. For this, the number of standard deviations from the mean (z-score) can be used to calculate the probability of the value being due to chance (the p-value).