DATA CLASSIFICATION IN DATA MINING
 

 

 

 

 

 

 

 

 

 

 

 

 

 

Alert readers may notice that some of the splits in the decision tree appear to make no difference. For example, nodes 17 and 18 are differentiated by the number of orders they have made that included items in the food category, but both nodes are labeled as responders. That is because although the probability of response is higher in node 18 than in node 17, in both cases it is above the threshold that has been set for classifying a record as a responder. As a classi­ fier, the model has only two outputs, one and zero. This binary classification throws away useful information, which brings us to the next topic, using decision trees to produce scores and probabilities.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Clearly, a record in Node 18 is more likely to represent a responder than a record in Node 17. The proportion of records in the desired class can be used as a score, which is often more useful than just the classification. For a binary outcome, a classification merely splits records into two groups. A score allows the records to be sorted from most likely to least likely to be members of the desired class. decision tree annotated with the proportion of records in class 1 at each node shows the probability of the classification. 170 For many applications, a score capable of rank-ordering a list is all that is required. This is sufficient to choose the top N percent for a mailing and to cal­ culate lift at various depths in the list. For some applications, however, it is not sufficient to know that A is more likely to respond than B; we want to know that actual likelihood of a response from A. Assuming that the prior probabil­ ity of a response is known, it can be used to calculate the probability of response from the score generated on the oversampled data used to build the tree. Alternatively, the model can be applied to preclassified data that has a distribution of responses that reflects the true population. This method, called backfitting, creates scores using the class proportions at the tree’s leaves to rep­ resent the probability that a record drawn from a similar population is a member of the class.