Multiple Comparisons The discussion has so far used examples with only one comparison, such as the difference between two presidential candidates or between a test and con trol group. Often, we are running multiple tests at the same time. For instance, we might try out three different challenger messages to determine if one of these produces better results than the business-as-usual message. Because handling multiple tests does affect the underlying statistics, it is important to understand what happens. The Confidence Level with Multiple Comparisons Consider that there are two groups that have been tested, and you are told that difference between the responses in the two groups is 95 percent certain to be due to factors other than sampling variation. A reasonable conclusion is that there is a difference between the two groups. In a well-designed test, the most likely reason would the difference in message, offer, or treatment.
This requires several items Occam’s Razor says that we should take the simplest explanation, and not add anything extra. The simplest hypothesis for the difference in response rates is that the difference is not significant, that the response rates are really approximations of the same number. If the difference is significant, then we need to search for the reason why. Now consider the same situation, except that you are now told that there were actually 20 groups being tested, and you were shown only one pair. Now you might reach a very different conclusion. If 20 groups are being tested, then you should expect one of them to exceed the 95 percent confidence bound due only to chance, since 95 percent means 19 times out of 20. You can no longer conclude that the difference is due to the testing parameters. Instead, because it is likely that the difference is due to sampling variation, this is the simplest hypothesis. The Lure of Statistics: Data Mining Using Familiar Tools 149 The confidence level is based on only one comparison. When there are mul tiple comparisons, that condition is not true, so the confidence as calculated previously is not quite sufficient.