NUMERIC INPUT VARIABLES
 

 

 

 

 

 

 

 

 

 

 

 

 

 

With a categorical target variable, a test such as Gini, information gain, or chi- square is appropriate whether the input variable providing the split is numeric or categorical. Similarly, with a continuous, numeric variable, a test such as variance reduction or the F-test is appropriate for evaluating the split regard­ less of whether the input variable providing the split is categorical or numeric. Splitting on a Numeric Input Variable When searching for a binary split on a numeric input variable, each value that the variable takes on in the training set is treated as a candidate value for the split. Splits on a numeric variable take the form X<N. All records where the value of X (the splitting variable) is less than some constant N are sent to one child and all records where the value of X is greater than or equal to N are sent to the other. After each trial split, the increase in purity, if any, due to the split is measured. In the interests of efficiency, some implementations of the splitting algorithm do not actually evaluate every value; they evaluate a sample of the values instead.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

When the decision tree is scored, the only use that it makes of numeric inputs is to compare their values with the split points. They are never multi­ plied by weights or added together as they are in many other types of models. This has the important consequence that decision trees are not sensitive to out­ liers or skewed distributions of numeric variables, because the tree only uses the rank of numeric variables and not their absolute values. Splitting on a Categorical Input Variable The simplest algorithm for splitting on a categorical input variable is simply to create a new branch for each class that the categorical variable can take on. So, if color is chosen as the best field on which to split the root node, and the train­ ing set includes records that take on the values red, orange, yellow, green, blue, indigo, and violet, then there will be seven nodes in the next level of the tree. This approach is actually used by some software packages, but it often yields poor results. High branching factors quickly reduce the population of training records available at each node in lower levels of the tree, making further splits less reliable. A more common approach is to group together the classes that, taken indi­ vidually, predict similar outcomes. More precisely, if two classes of the input variable yield distributions of the classes of the output variable that do not differ significantly from one another, the two classes can be merged. The usual test for whether the distributions differ significantly is the chi-square test.