I was hit by this idea the other day. Isn’t it funny that we complain about how untrustworthy LLMs are in generating answers to our questions, and at the same time, we seem to take anything coming from humans pretty much at face value as long as it is supported by data? I am not writing this to defend LLMs. I am more concerned with how casually we seem to accept numbers and statistics as long as they come from sources we trust (or we think we should trust).
I had the opportunity to think more deeply about this idea while developing my Rhetorical Data Visualization course. More specifically, in two parts of the course: one where we review three data projects coming from very authoritative data graphics teams and one where we critically assess the meaning and trustworthiness of development data coming from the super popular Our World In Data, the leading reference for anything related to country-level development statistics. In these exercises, we look at data projects with a critical thinking lens. In the first exercise, we review the evidence in support of the main argument of the article. In the second, we try to understand what is the real meaning of the data and how it’s been collected.
All these steps made me realize something important. Every time we absorb information based on data, we participate in a long “trust chain” that starts from data collection and ends with various forms of “digested” data messages. Let me give an exemplifying example. Researchers often rely on data gathered in pre-existing databases. Then, they publish some papers. Somebody starts from these papers and builds a report. A data journalist uses the report to write an article, and the article gets to the reader, who forms a specific idea or mental model of issues based (in part) on the data graphics developed by the graphics team. Do you see how many levels of indirection? How do we know how much scrutiny each actor had at each level of this pipeline? We don’t. We mostly have to rely on trust.
My little experience teaching my course is that problems can arise everywhere. In fact, my course starts very early on with the idea of “data-reality gaps” (a term I borrowed from Ben Jones), which is the idea that gaps may exist between what we think the data represents and what it actually represents. These problems stem already at the level of data generation, and if the actors in between are not careful enough, they are inherited across the various steps.
In my course, I have a four-step pipeline: data generation, data transformation, data representation, and contextual factors. Each of these influences what kind of story and conclusions we make. It would be great if we could trust anything produced with data just because it’s based on numbers. However, the reality is that data messages are very malleable, and at each step, different actors can influence what is perceived in the following steps. People may not have the time or skills to evaluate the whole chain. Even worse, some steps may be so buried in a myriad of complications that even the goodwill of a skilled person may not be enough. A perfect example is the exercise I mentioned above, where my students choose a data set from Our World in Data and try to identify data-reality gaps. More often than not, it’s impossible to fully assess the data because there is not enough documentation about how it's been collected and integrated from many disparate sources. One has to put a lot of faith in the numbers in there.
I don’t know what the solution to this problem is. On the one hand, we can’t expect everyone always to check everything. We also can’t expect everyone to have the skills to check the integrity of all the steps involved in producing a data visualization. On the other hand, we also need to make people aware of this problem and provide learning material and experiences to increase awareness and build skills. At a minimum, data professionals involved in producing data communication artifacts should be more cautious and aware. My experience reviewing many journalistic pieces is that we use data too casually. Without making names, even top-class newsrooms often publish data graphics that are so slanted and limited that they make me question the whole enterprise. It does not have to be like that; there is no reason to be overly pessimistic. But it is essential to talk more about this problem and to provide educational experiences to prevent it.
What do you think? Have you ever thought about this problem? Have you ever realized how many different entities one needs to trust every time we consume information coming from numbers and statistics?
P.s. Also, if you are interested in my course and want to learn how to think critically about data and data visualization, leave a comment below, and I will add you to the course mailing list.
Hi Enrico,
Good that you describe the chain of trust in data and that it can go wrong in every step. However, you omit a couple of steps right at the start that may have a lot of (hidden) influence on what you see at the end in a paper or news article, like the funding (and the rules that came with it) that was provided to collect the data in the first place. See the framework of Heather Kraus https://weallcount.com/the-data-process/
Hi Enrico, what a great article (again). I come across so many people who judge their data at face value. As if data itself is enough to judge the usefulness. Trying to understand why data was collected, by whom and how is fundamental. But also a lot of hard work, as you mentioned. I would love to learn more about this, so please add me to your mailing list.
P.S. Thanks for these inspiring articles!