If you have taken any statistic class, the very first thing your where thought would most likely be:
Correlation does not imply causation
The next thing you then will learn is that:
Corelation does not imply causation. But sometimes it does.
If it wasn’t for the last bit statistics courses would be very short, since finding correlations is useless if it doesn’t explain causation.
However, the first bit is always the most important since bad statistics is often worse than no statistics, and the world is full of bad statistics.
The problem is that bad statistics can have a lot of real-life consequences from people believing eating chocolate will earn you a Nobel price, to justify raceme, or under/overestimate the effect and side-effects of medical treatments.
Thus learning a few tools in causal modelling may help you to find the real explanation behind the numbers, or call bullshit when someone points at a correlation and call to action.
So let me demonstrate with a textbook example.
Basic Causal modelling
If you look at the graph you will see that sale of ice-cream and drowning accidents per month are highly correlated. This may lead to two theories:
A: Selling ice-cream somehow courses more drowning accident
B: People drowning, increase ice-cream sale
In today’s world, someone would probably post theory A online in an emotional engaging post, and people would rally in protest and demand ice-cream made illegal, and everyone involved in the trade sent to jail.
But as scientists, we are committed to the truth. And the only way to find the truth is to design some experiment. To investigate hypothesis A we could close all ice-cream stands or give away free ice-cream, to see if it change the number of drowning accidents. Then given that we have learned, and can still remember, how to correctly make a t-test, we will probably find that there is no significant change between selling ice-cream and people drowning.
This will falsify hypothesis A, and we should congrats ourselves for good science (falsifying a hypothesis is just as important as to verify, them even though you don’t make the same headline). But we can’t call it a day because falsifying A says nothing about B.
The correct scientific way is to tie some people to a rock and drop them in the ocean, to see if we sell more ice-cream. The ethical complications of this experiment should be discussed in another article, right now we are doing ice-cold, ice-cream science. After performing this experiment we would most likely find the ice-cream sale, unlike our conscience, was unaffected.
Let's say that our intervention study on stopping ice-cream sale proved a small, but significant, effect on the number of people drowning, then we may wish to further investigate this phenomenon. In this case, we could come up with hypothesis A.1, to explain why.
A.1: Eating ice-cream makes you a worse swimmer. And bad swimmers have a higher chance of drowning.
In this case, the swimming ability is the mediator. The beauty of this hypothesis is that it is built on two sub hypothesis, that can be tested separately, in two different studies. It would also be easier to detect a significant correlation since the correlation between drowning and ice cream, is given as the correlation between ice cream and swimming multiplied by the correlation between swimming ability and a persons chance of drowning.
It is worth noting that when scientist investigated this no link is found between eating and swimming abilities.
So we can conclude that it’s both hypotheses A and B is false and should look elsewhere for an explanation. And I am sure it has been screaming in your mind this whole time, it is the season that courses both.
In this case, the season variable season is called a confounder, and Ice cream sale and drowning are both children of this confounder. A simple mathematical trick to see this is to condition on the confounder, for example, we could only look at drowning and ice cream sale for the same month in different years.
Now the correlation is completely gone.
Since you found out about the season, I hope you took a moment to feel clever, course your wrong. Few people turn the calendar, see it is June the first, and think time to increase my ice-cream 10% consumption, and then try to hold my breath underwater 10 seconds longer than in April.
But the season is defined by the earth’s rotation that courses more sunshine on one part, course the temperature to rise. Both the amount of sunshine, and the temperature effect ice cream sale, and the number of people going swimming. While this is just another example of mediators, its also an example of a collider.
The short point is that conditioning can be a powerful tool, but it doesn’t tell about what courses what, only interventions do.
Putting it all together
Trying to put what we have learned today into a model that looks like this.
This however is still a very simplified model that fails to capture much of the complexity of our little ice-cream problem. For example, if we added geography, culture, or economic we would see a much different story, but right now that’s all considered noise. Generally, in our attempt to fully understand the world we can keep, adding nodes to our causal model until we don’t have any noise unaccounted for.
We also haven’t discussed how notes affect each other, would one-degree extra average temperature translate to 10% more ice-cream sale (multiplier) or make everyone bye one more ice cream independent of all other factors (addition)?
However I think I have been rambling enough about this problem, so I will end with the point all scientist wish to make. The world is more complicated than it seems.
Drowning data: Injury fact
Icecream data: United states department of agriculture