THE PRESENCE OF MISSING VALUES
 

 

 

 

 

 

 

 

 

 

 

 

 

 

One of the nicest things about decision trees is their ability to handle missing values in either numeric or categorical input fields by simply considering null to be a possible value with its own branch. This approach is preferable to throwing out records with missing values or trying to impute missing values. Throwing out records due to missing values is likely to create a biased training set because the records with missing values are not likely to be a random sam­ ple of the population. Replacing missing values with imputed values has the risk that important information provided by the fact that a value is missing will be ignored in the model. We have seen many cases where the fact that a partic­ ular value is null has predictive value. In one such case, the count of non-null values in appended household-level demographic data was positively corre­ lated with response to an offer of term life insurance.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Apparently, people who leave many traces in Acxiom’s household database (by buying houses, getting married, registering products, and subscribing to magazines) are more likely to be interested in life insurance than those whose lifestyles leave more fields null. Decision trees can produce splits based on missing values of an input variable. The fact that a value is null can often have predictive value so do not be hasty to filter out records with missing values or to try to replace them with imputed values. Although splitting on null as a separate class is often quite valuable, at least one data mining product offers an alternative approach as well. In Enterprise Miner, each node stores several possible splitting rules, each one based on a different input field. When a null value is encountered in the field that yields Decision Trees 5 the best splits, the software uses the surrogate split based on the next best avail­ able input variable. Growing the Full Tree The initial split produces two or more child nodes, each of which is then split in the same manner as the root node.