The Illusion of Causality in Charts
How charts can mislead us by depicting causes that may not exist.
A while back, I wrote an article here titled “Implied Causality in Line Charts.” The article examined the notion that certain charts imply a causal relationship between an event and an outcome, when such a relationship may not actually exist. In that post, I used line charts as a running example. I identified three ways in which line charts can suggest this type of relationship: the change in a trend after an event, the comparison of different temporal evolutions of two or groups of entities that vary by a factor of interest, and the temporal correlation between two quantities. The picture below summarizes these three trends.
When I finished writing that post on line charts, I knew that I wanted to generalize the idea to a much broader set of charts and that some overarching principles must exist that cover most, if not all, charts. After several months, I now feel ready to make some advancements. In this post, I’ll start from where I stopped to include more charts and attempt to generalize this idea to any chart.
The key concept I will keep coming back to is the idea of “implied causality,” which refers to the fact that we tend to infer causality when we observe specific trends and patterns, even when, upon deeper analysis, the causal relationship may not exist.
Bar charts
This is probably the most common and straightforward scenario. A simple chart with two bars comparing two situations or entities can communicate a presumed causal relationship. The toy example below compares the incidence of heart disease in people over 50 who are either vegans or omnivores.
Consider how quickly we can come to conclusions. Vegans score so much better. We should all change our diet! Not so fast … It turns out that being vegan is associated with many other healthy behaviors, so the cause of the trend may be due to all these other healthy behaviors, rather than being vegan alone. This problem is so common that it even has a specific name in healthcare: it’s called the “healthy user bias,” and it’s one among the many biases that can exist in data.
Of course, we can observe the same effect in charts with more than two bars and areas beyond healthcare. For example, we can imagine a chart that shows the performance of different departments within a company, comparing departments that use a particular new agile management method.
The departments that use the new methods tend to perform better, but the real reason is that they all manage products that have experienced a surge in market demand during the same period.
Another common situation is when the bars measure something before and after a given event. This is equivalent to the line chart case, where causality is implied by examining how a line changes before or after an event. The only difference is that here, time is represented as before and after a given event, and the chart consists of bars rather than lines. A good example here is the comparison of a given metric before and after introducing a new policy. For example, a new law to reduce vehicle collisions in a given city.
In all these examples, I used simple bar charts; however, all variations of bar charts, including stacked, diverging, and grouped bars, can potentially lead to implied causality through similar mechanisms.
Scatter plots
Scatter plots are perhaps the most iconic type of plot when discussing causality. They are often used to show the presence of a strong correlation between two measures, and when such a strong correlation exists, it’s easy to conclude that one thing causes the other. For example, a scatter plot that shows a strong correlation between saturated fat consumption and heart disease incidence can lead the reader to quickly conclude that saturated fats are the main cause of cardiovascular disease.
However, similarly to what we have seen above, other factors can be correlated with fat consumption, so we cannot single it out as the main or only cause. Scatter plots can always be used persuasively to suggest that more of something (drugs, money, resources, etc.) leads to more (or less) of something else (some performance metric). Whether this is true or not depends on several factors, including, above all, whether the data were generated through a controlled, randomized experiment or simply by gathering recorded data. In any case, when examining a scatter plot that illustrates a correlation between an intervention and an outcome, we should always ask ourselves if the relationship could be the result of a spurious association rather than a direct causal link.
Maps
Maps have some very interesting ways to induce implied causality. The first one is equivalent to what we have observed with bar charts: a comparison of regions that differ according to a given characteristic or intervention. For example, a map could show regions that implemented a given policy versus those that did not, showing that they generally have a different outcome.
Another case, equivalent to what we observed with time series and bar charts that represent outcomes before and after an event, maps can lead to the same effect by showing two maps side by side, or by illustrating the amount of change between two time steps.
There is an additional case that’s unique to maps: spatial proximity to a causal source. The most classic example is proximity to a polluting element, like a river, a polluted spring, or other toxic agents. The most classic example of this type of effect is the famous John Snow’s cholera map, which shows how the number of deaths clustered around a polluted water pump.
The cholera map is an example of how the cause depicted on the map was identified correctly. But that was demonstrated only after the pump was removed and the cases declined. In a more general case, the mere proximity of a series of events to a suspected source is not proof of its causal effect. If anything, because we don’t know if those events existed before the cause had an effect (in other words, spatial proximity without temporal proximity is not enough of an indicator).
Interlude … Want to get better at spotting misleading data and developing data thinking skills? Join my Rhetorical Data Visualization course: 5 video lectures, 6 live meetings, plenty of hands-on practical work. If you'd like to learn more, schedule a 15-minute meeting with me, and I’ll walk you through it.
Generalization
If you think about it, many basic mechanisms recur across the charts we covered. For example, all charts can imply causality by comparing two groups that differ by a given factor. Are there general mechanisms that we can identify and apply to all charts? The answer is yes. Here are four main patterns I have identified:
Factor: The causality effect derives from the comparison of two or more elements that differ by one factor of interest.
Event: The causality effect derives from the comparison of the state of the world before and after an event takes place.
Covariation: The causality effect derives from the way two quantities change in related ways.
Proximity: The causality effect derives from the proximity to a given source that affects the outcome.
If you think about it, all represent specific combinations of variable types:
Factor → Categorie(s) + Quantity
Event → Time + Quantity
Covariation → Quantity + Quantity
Proximity → Space + Quantity
This generalization enables us to consider implied causality for all types of charts. If a chart contains these combinations and shows variation according to a factor, event, covariation, or proximity, then there is potential for an implied causality effect.
—
Let me know what you think about this idea by leaving a comment below.
Thanks for reading!
I see this a lot in practice.
What I see a lot is people looking for any explanation for why a number moved.
When one is found the job is done.
Whether that explanation is a/the cause doesn’t matter - because the uncertainty is gone - so all’s well again.
A good reminder to consider all the factors in play - not just what you're testing for. Thanks for sharing!