linear vs. logistic

The Case for Linear Regression vs. Logistic Regression

Do you understand the difference between logistic regression and linear regression? If your eyes just glazed over, you’re exactly the person who should be reading this post.

What is a logistic regression?

A logistic regression is a way to predict the probability of something happening. It answers questions like the probability of a customer canceling an account or the probability of a customer using a coupon. This kind of analysis is very common in academia, but after 10 years of doing analyses at hundreds of companies, in dozens of industries, I have never found a case where it the logistic model made sense for business operations to use directly. In almost all cases, the linear model is better than the logistic model.

What is a linear regression?

A linear regression has a dependent variable (or outcome) that is continuous. In other words, the dependent variable can be any one of an infinite number of possible values. Logistic regression, alternatively, has a dependent variable with only a limited number of possible values.

Why you shouldn’t use logistic regression.

Let’s take the coupon example to get the the first reason you should never use logistic regression. Do you just want to know whether the customer will use the coupon or do you actually want to know what the increase is in the amount the customer will spend if they use the coupon? In the churn example, it may be somewhat useful to know a customer might cancel an account, but if you don’t know when the customer will cancel, you can’t really do much about it. For example, if two customers both have a 60% probability of churn, but one is expected to churn in the next day and the other is expected to cancel the account in 30 days, would you not want to focus your attention on the customer who is about to leave immediately?

That’s the primary reason you shouldn’t use logistic regression and why I urge customers to always predict a number that directly impacts how they will act on information, not information for the sake of information, but information that leads to ROI.

More reasons you shouldn’t use logistic regression.

Still not convinced? Here are three additional reasons you should never use logistic regression. Let’s take an example where you are trying to predict whether a customer will cancel an account after a customer support problem. How do you define whether churn happened? Let’s say we are in a situation in which we are looking at a customer support interaction and analyzing whether a person canceled the account soon after that interaction. In a logistic regression analysis, we would come up with some magical cutoff point, say, 30 days, and anyone who canceled within 30 days would be considered a case of churn related to that customer complaint, while a cancellation after 30 days wouldn’t be considered churn.

A different analyst might say the cutoff should be 180 days or the cutoff should be one week. There is obviously no objectively correct answer to where the cutoff should be. But once you have established this cutoff point, customers on the two sides of the cutoff point are treated as separate classes. So someone who canceled 30 days after the call is identical to someone who canceled 30 minutes after the call and completely different from someone who canceled 31 days after the call. This may make academic sense, but it certainly does not make sense in the world of business where the operational team would need more granular resolution so they could figure out how to focus their scarce resources. They would definitely want to prioritize the guy who would cancel in 30 minutes over the guy who would cancel in 30 days.