SPURIOUS CORRELATIONS, STATISTICAL HURDLES AND MATHEMATICAL HICCUPS
There’s a risk with Big Data: Encouraging everyone to use the tools to explore the company’s big data lake will lead people to do so.
It’s what you want right?
But the risk is that people may find some significant, industry changing insight that will not only catapult the business to the pinnacle of their industry, but will give them the chance to make the name for themselves; one they so readily deserve.
Spoiler Alert: Every data scientists will agree that finding truly valuable and realistic insights is never easy. Actually, it can be very hard.
After all, while the numbers never lie they can be misleading.
PREPARE TO BE DISAPPOINTED
Data Analytics requires a great deal of exploration, trial and error, disappointment and perhaps grey hair. Yet, the business expects you to deliver the holy grail, week after week. They must think you’re Gandalf.
The answers aren’t just there, waiting to be found. They need to be teased out and when discovered, they need to be tested. They also have to be rock solid:
- They must stand up to the rigour of daily use
- They must be reliable and trust worthy
- Their boundaries must be discovered
Yet, the excitement of finding something that appears to be incredible is hard to suppress. When you’ve spent days, weeks or even months exploring data, finding nothing to then come across results that appear mind blowing, it’s hard not to think about the possibilities.
Even for the most seasoned Data Scientists.
But they know there’s more to do to prove, or disprove these findings. As most statisticians will attest, correlation does not imply causation. Just because two sets of data appear to be related, doesn’t mean that they are related and spurious correlation can be the undoing of many hypotheses.
Spurious correlations are statistical anomalies that involve two data sets that look like they are correlated but are far from it.
There are many examples around us, and there are some real clangers out there that are worthy of sharing
There’s even a website dedicated to spurious correlations: tylervigen.com
PEOPLE WHO DROWNED AFTER FALLING OUT OF A FISHING BOAT CORRELATES WITH THE MARRIAGE RATE OF KENTUCKY
PEOPLE WHO DIED FALLING OUT OF THEIR BED CORRELATES WITH THE NUMBER OF LAWYERS IN PUERTO RICO
MY FAVOURITE: CHEESE CONSUMPTION AND DEATH BY BEDSHEET ENTANGLEMENT
THE PITFALLS AND PROBLEMS WITH STATISTICAL ANALYSIS
It’s obvious that these examples are spurious, yet many correlated data sets are not so obviously spurious.
Any data analytics has to be tested rigorously before conclusions can be reliably made. There are mathematics tests to check correlation equates to causation but spurious correlations are just one of the pitfalls of working with big data.
It’s great to encourage the organisation to explore your big data but make sure the data scientist are on hand. Either to help avoid these statistical and mathematical pitfalls, or to console the budding Data Analysts when they fall back down to earth.