If you dabble with data analysis and visualization long enough, you must have come across the problem of “misleading visualization.” The literature is vast, fun, and interesting. I have contributed in many ways myself through scientific research, many posts in this newsletter, and, more recently, by developing courses that teach how to detect misleading visualizations.
One problem I see with all these approaches is that they seem to focus exclusively on teaching consumers of data visualizations how to spot problems. Still, they rarely focus on producers so that they can avoid some of these problems in the first place. It’s important to recognize that 1) developing truthful data visualizations is really hard, and 2) many issues with misleading data visualization stem from a limited ability to reason effectively with data.
As a first step toward teaching data professionals how to create more truthful data visualizations, I developed an initial set of 10 rules for creating truthful data visualizations. This is a work in progress, and I am not yet offering many details. My goal here is to gather your feedback and see if this resonates with you. Let me know what you think!
1. Develop an attitude
This is absolutely crucial. If you don’t start with having an explicit intent to be truthful and skeptical, there is no way you can avoid being fooled by data sooner or later. Skepticism is particularly important, especially when directed towards yourself. Being fooled by data is remarkably easy, and believing something simply because the data indicates it is also very easy. After many years of conducting data analysis and research, my experience with data has taught me to be extremely skeptical. Most large effects are either obvious or due to data errors, so my prior is always that something else explains the trend I am observing.
2. Understand the domain
It’s so tempting to jump on the data and start drawing conclusions. But the reality (and probably the most important secret of data analysis) is that drawing knowledge from data requires domain expertise. There is simply no way a person without domain knowledge can develop the same interpretive and evaluative skills as someone who understands the reality described by the data. Remember: we work with data because we are interested in the reality it (often partially and coarsely) represents. Yeah … dabbling with numbers and code is fun, but what matters is real-world facts and actions that stem from them, not numbers.
3. Understand the data
This is so often overlooked. Maybe it’s the most troubling problem I see. People tend to trust the data they find or receive from others. I have an exercise in class where students must select data from Our World in Data, a reputable source, and conduct research to understand where the data originates, how it was collected, and potential limitations associated with it. It’s always surprising to see how, even from such a reputable source, it’s hard to verify how the data was created, and it’s not rare to find several issues. This is important because there is simply nothing you can do at the data processing and visualization level to correct for problems that exist in the data. This is a classic case of garbage in, garbage out. If your data is not reliable or, more importantly, you have a wrong mental model of what it represents, you are at high risk of misrepresenting reality, no matter how careful you are with your visual representations.
4. Use appropriate numbers and calculations
You might ask, “What do you mean by appropriate?” When you work with data, the output of the process is to produce (implicitly or explicitly) a series of “data facts,” statements that represent your interpretation of patterns and trends you extracted from the data. This is often reflected in the titles and accompanying text you write, as well as the spoken words you use when presenting the results. Your interpretation, however, can be incorrect, and this is where insidious gaps can reside. So, appropriate numbers are those that accurately reflect your interpretations. If you think a number represents a given concept when in fact it does not, you are in trouble.
Imagine you want to analyze the NYC vehicle collision data set to determine the “most dangerous areas to drive in NYC.” You can use the number of collisions or the average number of people injured in each zip code area. However, if you use the number of collisions, you are in trouble, because those numbers depend on how many cars circulate in a given area, so it does not represent the level of “danger,” but rather how trafficked a given area is. Virtually all numbers have a gap between the concept you want to work with and the reality they represent, and being aware of these gaps is essential to producing truthful visualizations.
This problem becomes even more severe when the numbers we use are the product of complex calculations. So, while complex calculations and statistics are essential for deriving signals from noisy data, it is also crucial to be aware of the numerous limitations these numbers have.
5. Aggregate mindfully
In visualization, we often have the choice to represent the data as statistical aggregates or individual data points. We can display each vehicle collision on a map or simply count the number of collisions in each zip code area. Aggregation is useful because it reduces clutter and often makes it easier to observe trends and make comparisons. But it also necessarily hides information. There is a constant tension between these two needs: transparency vs. clarity. Higher granularity leads to greater transparency, but often results in less clarity. Lower granularity leads to less transparency but more clarity. There is no fixed rule on how to strike a balance between these two needs, but my experience suggests that people often err on the side of either too much or too casual aggregation. Exploring data at the lowest level of granularity possible is essential, at the very least, as a sanity check. Your highly granular charts may not end up in your slides, but having explored them gives you the peace of mind that you have checked what’s behind those aggregations.
6. Disclose uncertainty (with reason)
There are multiple sources of uncertainty when dealing with data. Uncertainty in how the values have been recorded, uncertainty in the summary statistics that you use, uncertainty in the interpretations and explanations you provide (to yourself and others). Being aware of these sources of uncertainty is the first step. You have to be conscious of the fact that data contains errors and inaccuracies, statistics are not exact numbers, and alternative explanations may exist for the same set of phenomena you observe in the data. Once you become aware of all these sources of uncertainty, you must decide how to disclose them to your readers. Jessica Hullman has a very interesting paper titled “Why Authors Don’t Visualize Uncertainty” on this very problem. She interviewed several professionals and asked them how they show uncertainty in their visualizations, and the results are very revealing. Many feel like exposing uncertainty dilutes the message. How do you want to deal with this problem? How do YOU want to deal with this problem? This is harder than it may seem because communicating all sources of uncertainty can become a daunting task and completely overwhelming for your audience. So, this is another area where a sensible trade-off is needed.
7. Segment your data
When we observe a trend on a visualization, we are often tempted to take note of it and move on to something else. However, when we do that, we miss the opportunity to use one of the most powerful moves in data analysis, which is data segmentation. Segmentation means that when you see a pattern, you want to see if the same pattern holds in a specific subset of your data. When I create a line chart showing the number of collisions by hour of the day, I also want to explore what this trend looks like in different geographical areas, seasons, vehicle classes, and other factors. These are all examples of “segmentations” that can reveal very relevant information. I have a personal habit of always segmenting the patterns I expose by potentially meaningful variables. Many trends change when you look at specific segments of the data, and exploring them mindfully is part of the procedures necessary to avoid fooling yourself and others.
8. Ask yourself: “Compared to what?”
Legendary economist Thomas Sowell wrote in Economic Facts and Fallacies, “Virtually nothing is going to be equally beneficial for all people, or equally detrimental. The real question is always: ‘Compared to what?’” In data visualization, we can use a similar principle. Whenever we see a trend and derive a conclusion from it, it’s useful to ask ourselves, “Compared to what?” This is because data is virtually always partial, and we are tempted to focus exclusively on what is under our eyes and not on what is missing. If we compare a set of objects, a large difference may appear very relevant within that set, but meaningless when viewed in the context of a much larger set of objects. Similarly, if we analyze a temporal trend for a specific time period, a significant shift may appear dramatic within that time horizon, but insignificant on a much longer timespan. Being mindful that all comparisons are potentially limited by the objects we have versus those that few could have is a good habit and often helps put data in a much broader context.
9. Scale visuals mindfully
Visualization is all about mapping abstract numbers into perceivable graphical properties of objects, that is, size, length, color, etc. The rules we use (or our tools use as defaults) to map values to visual properties have a strong impact on what we perceive in a chart. Because of that, it is crucial to be mindful of the different ways data can be scaled when mapped to visual properties. Many of the most well-known “misleading visualization” examples represent scaling problems at their core. Truncated axis? A scaling problem. Inappropriate area mapping? A scaling problem. Misleading double-axis charts? Another scaling problem. Often, choosing an appropriate scale is not as simple as avoiding blatant distortions. Some patterns may be easier to discern when data are scaled in a certain way than when scaled in another way. For example, cases exist where scaling data nonlinearly or truncating an axis can reveal important trends that would otherwise be difficult to discern. This is another example of the idea that there are no hard rules but just trade-offs that depend on the specific problem.
10. Use truthful titles
Titles are often the part of a visualization that is more prone to overclaiming. The need to be succinct, paired with the desire to capture people’s attention, often leads to titles that go beyond what can be inferred from the data. Titles, however, are incredibly important. Research demonstrates that people spend a significant amount of their attention on titles, and often, they are the only thing people remember from a chart. At the same time, titles are also where many others commit the sin of oversimplifying. One possible solution is to use only titles that describe what the chart is about, but this often results in very boring titles. Titles can also be more interpretive, but it’s important to ask yourself, “Does the data actually show that?” Another option is to use questions as titles. Questions draw people in and stimulate curiosity, and I think they should be used more often.
That’s all for now, folks. Let me know what you think. Do these guidelines resonate with you? Is there anything you’d like to add? Let me know in the comments below.
💡 Hey … I have a course to teach you these skills!
If you're interested in learning these skills with me, consider signing up for the upcoming cohort of my Rhetorical Data Visualization course. You’ll meet with me and other students for a total of six live online workshops. The course includes:
Recorded video lectures to watch at home
Quizzes to test your knowledge
Six live meetings with hands-on activities
A final project
👉 Book a call with me. I’d be happy to provide you with a comprehensive preview of the course and address any questions you may have. I’d love to learn more about you.
P.S. The dates are tentative, so if they don't work for you, please let me know and I’ll try to make accommodations.
All great points and nicely organized and clearly explained. Two expressions that captures concerns about data's provence, interpretation, and selective focus, is that: "data is never raw" and "data are made, not found".
Excellent list!
My reflection is that each point is essentially about applying the scepticism you mention in the first rule. At each stage you're testing out whether the chart could, in fact, say something different.
Also love the idea of using questions in titles to be more equivocal whilst still being engaging.