ERRORS IN BASIC DATA
 

 

 

 

 

 

 

 

 

 

 

 

 

Statistics originally derived from measuring scientific quantities, such as the width of a skull or the brightness of a star. These measurements are quantita­ tive and the precise measured value depends on factors such as the type of measuring device and the ambient temperature. In particular, two people tak­ ing the same measurement at the same time are going to produce slightly dif­ ferent results. The results might differ by 5 percent or 0.05 percent, but there is a difference. Traditionally, statistics looks at observed values as falling into a confidence interval. On the other hand, the amount of money a customer paid last January is quite well understood—down to the last penny. The definition of customer may be a little bit fuzzy; the definition of January may be fuzzy (consider 5-4- 4 accounting cycles). However, the amount of the payment is precise. There is no measurement error.

 

 

 

 

 

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

 

There are sources of error in business data. Of particular concern is opera­ tional error, which can cause systematic bias in what is being collected. For instance, clock skew may mean that two events that seem to happen in one sequence may happen in another. A database record may have a Tuesday update date, when it really was updated on Monday, because the updating process runs just after midnight. Such forms of bias are systematic, and potentially represent spurious patterns that might be picked up by data mining algorithms. One major difference between business data and scientific data is that the latter has many continuous values and the former has many discrete values. Even monetary amounts are discrete—two values can differ only by multiples of pennies (or some similar amount)—even though the values might be repre­ sented by real numbers. Is a Lot of Data Traditionally, statistics has been applied to smallish data sets (at most a few thousand rows) with few columns (less than a dozen). The goal has been to squeeze as much information as possible out of the data. This is still important in problems where collecting data is expensive or arduous—such as market research, crash testing cars, or tests of the chemical composition of Martian soil. Business data, on the other hand, is very voluminous. The challenge is understanding anything about what is happening, rather than every possible thing. Fortunately, there is also enough computing power available to handle the large volumes of data.