Do you understand the difference between logistic regression and linear regression? If your eyes just glazed over, you’re exactly the person who should be reading this post. If you understand these concepts fully, please read the companion piece from my colleague Dr. David Herberich (which will be posted soon) and is the geeky version of this post.
What is a logistic regression? It is a way to predict the probability of something happening. It answers questions like the probability of a customer canceling their account or the probability of a customer using a coupon. This kind of analysis is very common in academia, but after ten years of doing analyses at hundreds of companies, in dozens of industries, I have never found a case where it made sense for business operations to use that information directly. Let’s take the coupon example to get the the first reason you should never use logistic regression. Do you just want to know whether the customer will use the coupon or do you actually want to know what the increase is in the amount the customer will spend if they use the coupon? In the churn example, it may be somewhat useful to know a customer might cancel their account, but if you don’t know when they will cancel the account, you can’t really do much about it. For example, if two customers both have a 60% probability of churn, but one is expected to churn in the next day and the other is expected to cancel their account in 30 days, would you not want to focus your attention on the customer who is about to leave immediately?
So that’s the primary reason why you shouldn’t use logistic regression and why I urge customers to always predict a number that directly impacts how they will act on information, not information for the sake of information, but information that leads to ROI. Still not convinced? Here are three additional reasons you should never use logistic regression. Let’s take an example where you are trying to predict whether a customer will cancel their account after a customer support problem. But how do you define whether churn happened? Let’s say we are in a situation in which we are looking at a customer support interaction and analyzing whether a person cancelled their account soon after that interaction. In a logistic regression analysis, we would come up with some magical cut-off point, say 30 days, and anyone who cancelled within 30 days would be considered a case of churn related to that customer complaint and a cancellation after 30 days wouldn’t be considered churn. A different analyst might say the cutoff should be 180 days or the cut-off should be one week. There is obviously no objectively correct answer to where the cutoff should be. But once you have established this cutoff point, customers on the two sides of the cutoff point are treated as separate classes. So someone who cancelled 30 days after the call is identical to someone who cancelled 30 minutes after the call and completely different from someone who cancelled 31 days after the call. This may make academic sense but it certainly does not make sense in the world of business where the operational team would need more granular resolution so that they could figure out how to focus their scarce resources. They would definitely want to prioritize the guy who would cancel in 30 minutes over the guy who would cancel in 30 days.
These are the four primary reasons I give executives for why they should focus on linear regression rather than logistic regression. But the reason that is closest to my heart is that most business users can easily understand the results of a linear regression if the appropriate effort is made in explaining it to them, but because business users are not used to dealing in probabilities, they often fail to fully grasp what a logistic regression is trying to tell them. If we want to succeed in business, we have to empower the business user and ensure that they can easily overlay their years of experience and domain knowledge on that analysis. Disagree? Please make your argument in the comment section