PARTITIONING THE MODEL SET

 

 

 

 

 

 

 

 

 

 

 

 

Partitioning the Model Set Once the preclassified data has been obtained from the appropriate time- frames, the methodology calls for dividing it into three parts. The first part, the training set, is used to build the initial model. The second part, the validation set1, is used to adjust the initial model to make it more general and less tied to the idiosyncrasies of the training set. The third part, the test set, is used to gauge the likely effectiveness of the model when applied to unseen data. Three sets are necessary because once data has been used for one step in the process, it can no longer be used for the next step because the information it contains has already become part of the model; therefore, it cannot be used to correct or judge.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I

 

People often find it hard to understand why the training set and validation set are “tainted” once they have been used to build a model. An analogy may help: Imagine yourself back in the fifth grade. The class is taking a spelling test. Suppose that, at the end of the test period, the teacher asks you to estimate your own grade on the quiz by marking the words you got wrong. You will give yourself a very good grade, but your spelling will not improve. If, at the beginning of the period, you thought there should be an ‘e’ at the end of “tomato,” nothing will have happened to change your mind when you grade your paper. No new information has entered the system. You need a val­ idation set! Now, imagine that at the end of the test the teacher allows you to look at the papers of several neighbors before grading your own. If they all agree that “tomato” has no final ‘e,’ you may decide to mark your own answer wrong. If the teacher gives the same quiz tomorrow, you will do better. But how much better?