CONVERTING COUNTS TO PROPORTIONS

 

 

 

 

 

 

 

 

 

 

 

 

Convert Counts to Proportions Many datasets contain counts or dollar values that are not particularly inter­ esting in themselves because they vary according to some other value. Larger households spend more money on groceries than smaller households. They spend more money on produce, more money on meat, more money on pack­ aged goods, more money on cleaning products, more money on everything. So comparing the dollar amount spent by different households in any one 76 Chapter 3 category, such as bakery, will only reveal that large households spend more. It is much more interesting to compare the proportion of each household’s spend­ ing that goes to each category.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I

 

The value of converting counts to proportions can be seen by comparing two charts based on the NY State towns dataset. count of houses with bad plumbing to the prevalence of heating with wood. A relationship is visible, but it is not strong.where the count of houses with bad plumbing has been converted into the proportion of houses with bad plumbing, the relationship is much stronger. Towns where many houses have bad plumbing also have many houses heated by wood. Does this mean that wood smoke destroys plumbing? It is important to remember that the patterns that we find determine correlation, not causation. The details of this step vary from technique to technique and are described in the chapters devoted to each data mining method. In general terms, this is the step where most of the work of creating a model occurs. In directed data min­ ing, the training set is used to generate an explanation of the independent or target variable in terms of the independent or input variables. This explanation may take the form of a neural network, a decision tree, a linkage graph, or some other representation of the relationship between the target and the other fields in the database. In undirected data mining, there is no target variable. The model finds relationships between records and expresses them as associa­ tion rules or by assigning them to common clusters.