SURVIVAL ANALYSIS AND STATISTICS
 

 

 

 

 

 

 

 

 

 

 

 

 

The discussion of survival analysis in this chapter assumes that time is discrete. In particular, things happen on particular days, and the particular time of day is not important. This is not only reasonable for the problems addressed by data mining, but it is also more intuitive and simplifies the mathematics. In statistics, though, survival analysis makes the opposite assumption, that time is continuous. Instead of hazard probabilities, statisticians work with hazard rates, which are turned into survival curves by using exponentiation and integration. One difference between a rate and a probability is that the rate can exceed 1, whereas a probability never does. Also, a rate seems less intuitive for many survival problems encountered with customers. The method for calculating hazards in this chapter is called the life table method, and it works well with discrete time data. A very similar method, called Kaplan-Meier, is used for continuous time data. The two techniques produce almost exactly the same results when events occur at discrete times. An important part of statistical survival analysis is the estimation of hazards using parameterized regression—trying to find the best functional form for the hazards. This is an alternative approach, calculating the hazards directly from the data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

easily include covariates in the process. Later in this chapter, there is an example based on such a parameterized model. Unfortunately, the hazard function rarely follows a form that would be familiar to nonstatisticians. The hazards do such a good job of describing the customer life cycle that it would be shocking if a simple function captured that rich complexity. We strongly encourage interested readers who have a mathematical or statistical background to investigate the area further. Proportional Hazards Sir David Cox is one of the most cited statisticians of the past century; his work comprises numerous books and over 250 articles. He has received many awards including a knighthood bestowed on him by Queen Elizabeth in 1985. Much of his research centered on understanding hazard functions, and his work has been particularly important in the world of medical research. His seminal paper was about determining the effect of initial factors (time­ zero covariates) on hazards. By assuming that these initial factors have a uni­ form proportional effect on hazards, he was able to figure out how to measure this effect for different factors. The purpose of this section is to introduce proportional hazards and to suggest how they are useful for understanding customers.