Extracting Rules from Trees When a decision tree is used primarily to generate scores, it is easy to forget that a decision tree is actually a collection of rules. If one of the purposes of the data mining effort is to gain understanding of the problem domain, it can be useful to reduce the huge tangle of rules in a decision tree to a smaller, more comprehensible collection. There are other situations where the desired output is a set of rules. In Mastering Data Mining, we describe the application of decision trees to an industrial process improvement problem, namely the prevention of a certain type of printing defect. In that case, the end product of the data mining project was a small collection of simple rules that could be posted on the wall next to each press. When a decision tree is used for producing scores, having a large number of leaves is advantageous because each leaf generates a different score. When the object is to generate rules, the fewer rules the better. Fortunately, it is often pos sible to collapse a complex tree into a smaller set of rules.
The first step in that direction is to combine paths that lead to leaves that make the same classification. The partial decision tree yields the following rules: Watch the game and home team wins and out with friends then beer. Watch the game and home team wins and sitting at home then diet soda. Watch the game and home team loses and out with friends then beer. Watch the game and home team loses and sitting at home then milk. The two rules that predict beer can be combined by eliminating the test for whether the home team wins or loses. That test is important for discriminating between milk and diet soda, but has no bearing on beer consumption. The new, simpler rule is: Watch the game and out with friends then beer. 194 Chapter 6 Watch the game? No Yes Home team wins? No Yes Out with friends? Out with friends? No Yes No Yes Diet soda Beer Milk Beer Multiple paths lead to the same conclusion. Up to this point, nothing is controversial because no information has been lost, but C5’s rule generator goes farther. It attempts to generalize each rule by removing clauses, then comparing the predicted error rate of the new, briefer rule to that of the original using the same pessimistic error rate assumption used for pruning the tree in the first place. Often, the rules for several different leaves generalize to the same rule, so this process results in fewer rules than the decision tree had leaves. In the decision tree, every record ends up at exactly one leaf, so every record has a definitive classification. After the rule-generalization process, however, there may be rules that are not mutually exclusive and records that are not cov ered by any rule. Simply picking one rule when more than one is applicable can solve the first problem. The second problem requires the introduction of a default class assigned to any record not covered by any of the rules. Typically, the most frequently occurring class is chosen as the default.