You will frequently hear the phrase “Correlation Not Causation” in Big Data discussions. What exactly does this mean and why is it important?
UPS provided a nice illustration of “Correlation Not Causation” when it employed a classic Big Data solution to save money on truck maintenance.
UPS cannot have its trucks breaking down on the road. So historically, it would replace key parts on a schedule rather than when needed. The problem is, it is expensive. You are often replacing parts that are working just fine and will continue to work fine well into the future.
Eventually, UPS was able analyze their vehicle records digitally, and one thing they discovered was that one expensive component was likely to fail when a particular sequence of events occurred. While there was a high correlation between the sequence of events and the component failure, there was no clear reason for this correlation. In short, they had found a correlation but could not determine the causation.
Nonetheless, UPS switched its maintenance strategy for that component. It began to replace the part when the sequence of events was detected. This resulted in a substantial cost savings with no degradation to the performance of their vehicles.
Here is another more personal example:
A close friend teaching at Georgetown University during the late eighties collapsed with a combination of rarely seen health issues. When her husband arrived and asked what could be done, the Doctor told him the following:
- He had access to a system being developed by DARPA which they were calling an “Internet” which enabled him to communicate with other doctors across the United States.
- He had found sixteen similar cases, and all but five had died.
- The doctor told him the only thing the five survivors had in common was that their spleen had been removed. None of the physicians could explain what connection the removal of the spleen had with the ailment, only that there was an apparent correlation with the removal of the spleen and survival.
The husband gave permission to operate, my friend’s spleen was removed, and she is alive today.
This need for us to explain correlation rather than simply accept it is interesting.
We have talked in another Blog about living in a Small Data World where sampling and statistics are needed to make inferences about populations. In that world, anything which can help us make predictions without having large quantities of data is critical. Physical relationships based on cause and effect have been an effective tool for doing this.
A consequence of this, is that we instinctively seek the underlying causal relationships whenever we encounter some form of correlation. But Big Data is about making predictions.
There is less emphasis in a big data world on determining underlying causes. In time, a connection may emerge, but that does not mean we need to understand that connection before we take action as a result of a correlation. This is the way UPS proceeded, as did the Georgetown University doctor who saved my friend’s life.
Now you’ve seen how Big Data principals and practices can be applied to vehicle maintenance and healthcare, and soon it will apply to every aspect of our lives. As a result, “Correlation Not Causation” represents an important shift in how we view our realities and our ability to make sense out of them.
If this type of information is interesting or useful to you, you’ll enjoy Omnikron’s
Big Data Fundamentals course.
Enterprise data collection and analytics is undergoing a revolution that will be here to stay, and you will need to be fluent in its topics if you do not want to be left behind.