The Data Questions, Data Answers Model
An initial research agenda to study how good questions can help us become better data visualization designers
I wrote about the power of data questions before here a few times. Once in this post that elaborates on what are good data questions.
Earlier in my post on how I use “mini-projects” to teach visualization, in the part where I explain how I use data questions in this exercise.
What are data questions? Simple. They are questions we want to have answered with data. Whenever we tackle a problem with data, we fundamentally want to be able to answer some questions. “How did the temperature change last year? How does it compare to previous years? Is the temperature change uniform across all countries?”
While this is not the only way to think about data analysis and communication, over the years, I found that framing the problem of producing informative data visualizations as the problem of answering data questions is very instructive. Why? For a number of reasons:
It forces us to be explicit about what we want to know
It enables us to verify whether the visualization(s) we produce answer the questions we have
Both skills are essential, and over many years of teaching, I have learned that both are incredibly useful opportunities for growth when developing data analysis and presentation skills. Two crucial skills are exercised:
Formulating effective questions: When you ask people to formulate data questions, they often come up with questions that are either too ambiguous to be translated into data answers or too specific to be a good match for data visualization solutions. If I ask, “How is global warming affecting Italy,” I am formulating an important and legitimate question that can have, maybe, useful answers through data, but I am too far from responding to this question with data. It needs to be translated to something specific that works with my data. Conversely, if I ask, “What is the Italian city with the highest average temperature in 2023?” The answer is just a single datum: the city with the highest temperature.
Assessing visualization effectiveness: When people decide how to transform the data and develop a graphical solution for a question they have, they have a very large palette of choices to draw from. How do we know if they made good choices? Traditional data visualization theory and practice teach us that we have to choose a solution that is a good match for the data (hence, the many chart choosers that exist). This is incredibly limited because it does not help us predict how the reader will interpret the visualization, and it’s dominated by the idea that one can choose a good visualization by looking exclusively at what type of data they have. But a good visualization is one that matches the questions one wants to answer (or the message one wants to communicate). When we formulate specific questions, we can parse our visualizations to verify whether the solutions we developed effectively answer the questions we designed. I tested this evaluation model in my courses, and it works quite well.
So, formulating good questions and verifying whether the visualizations we develop are a good match seems to be a good way to frame the data analysis and visualization problem.
The problem
All these ideas stem from a specific problem I am confronted with as I design a new version of my Information Visualization course for next semester. How do I teach students to think about the data visualization problem of transforming data into effective visualizations? Traditionally, visualization pedagogy approaches this problem with one or both of these approaches:
Chart choosers: Teaching what type of graphs exist and what type of data is a good match for them. So, when you have a visualization problem, you will ask, “What type of data do I have, and what is the “right” graph for that type of data?” And you’ll choose the one that is a good match.
Theory of visual encoding: Teaching how to deconstruct graphs into their individual parts and encouraging designers to use appropriate and effective channels, that is, channels that are a good match for the type of values one wants to visualize (e.g., it’s not a good idea to visualize quantities with color hue, because we do not perceive hues as quantities) and channels that represent information more precisely (e.g., the vertical or horizontal position of a dot is a way more precise encoding than the area of a circle if I want to communicate a quantity).
All of these are centered on the idea that deciding which visual representation works best is possible by looking at what type of data you have. But this is not how visualization works. Effective visualizations are visualizations that make the answer to the question you have readily apparent.
So, the problem is, what should I teach instead of, or in addition to, these data-based theories? My answer over the last few years has always been this “data questions” idea (and studying the problem of “visualization affordances,” which I am not going to cover here but covered in this post describing my student Racquel’s research). But after several years of teaching this idea in class, I realize it is not developed enough, and I always end up hand-waving on the details. Yes, we do a lot of exercises in class, where 1) I criticize the questions the students create, and 2) I show them mismatches between the questions they formulated and the data visualization answers they developed. But this is mostly left to my ability to perform these two steps, and frankly, I do not have a way to turn it into something more systematic. This is where I think research in this area would be beneficial and the main focus of this post. How can we make progress with this “data questions, data answers” idea?
A research agenda
Maybe the solution to this problem is to do the research necessary to progress in this space. So, here I’ll try to propose an initial research agenda for this problem (if you get to know someone who wants to fund it, let me know!). What do we need in order to make progress in this space? Here are a few tasks I propose.
Study how people ask data questions
From my experience in class, there is a lot to learn by observing and analyzing how people formulate questions when working with data. As a first step, I propose to collect large quantities of data questions and see if it’s possible to a) organize them into a useful taxonomy of data question types and b) identify common pitfalls in formulating effective questions. In fact, we would need first to figure out how to decide if a question is a good question, which is not immediately obvious. If this is done well, by the end of this task, we should be able to teach learners the taxonomy of question types (hopefully helping them to think about questions more systematically), help them think about what is a good question, and teach common pitfalls they can avoid.
Study how people translate data questions into data answers
This second task also requires an observational qualitative study. Here, we can assign a set of questions and a data set to a group of people and ask them to create visualizations that answer these questions. Once we produce this collection of questions and corresponding data visualizations, we can analyze it to find mismatches between the data questions and the data visualizations. My hope is that if we have enough of them, we will be able to create a taxonomy of data questions data answers mismatches. If we succeed, this can also be used for teaching purposes. Such a taxonomy can help learners identify problems when translating their data questions into visualizations.
Create a data question, data answer model (the DQDA model)
This is probably the hardest part and the one I am less confident with. When I think about the data question data answer problem, I sense that it should be possible to develop a theoretical model that can help us abstract away from specific guidelines and support more generalizable knowledge and skills. One important problem in this space is to understand better what kind of questions data visualization helps answer. In fact, so far, I have described the problem as one where there is a one-to-one match between a question and visualization. However, any given visualization can be used to answer many different questions. Also, different visualizations can support answering the same (or a similar) set of questions, and they may differ in terms of which question they support the best. I am not even considering how the effectiveness of a visualization in answering questions depends on choices that go beyond visual encoding and include contextual elements such as emphasis, annotations, titles, captions, grids, legends, etc. I am not sure how to tackle this problem yet, but it seems it should be possible.
If we had the DQDA model available, then we could use it in teaching to directly train people how to think about this problem effectively and creatively.
Assess the validity of the DQDA model
No research proposal is complete without a validation step. In research, we must demonstrate the validity of our ideas. For this reason, another step is to validate these ideas through experimental interventions. Ideally, we should develop experiments that compare state-of-the-art teaching methods with methods that introduce the DQDA taxonomies and theoretical model. We can create three groups of learners and teach group 1 using the chart chooser model, group 2 using the ranking of visual variables model, and group 3 using the DQDA model and see if we can detect a difference between these three models in learning outcomes (in reality it would be useful to look at their combinations also because there are good reasons to teach all of them in a data visualization class - in fact, I believe they form good complements with each other).
Regarding how to measure learning outcomes effectively, I am out of my depth, but I am sure that a researcher working in the psychology of learning would be able to propose sensible and reliable metrics and procedures. One challenge I see is designing the experiment in a way that does not artificially favor the DQDA model. Another challenge is how to judge educational outcomes objectively. Again, I am pretty much out of my depth here, but I am sure an education researcher should be able to develop reliable solutions.
Additional interesting questions
There are many other questions that I am not sure how to fit into this agenda. The first is the relationship between data questions and communicative intent, and the second is how to reconcile data questions with task models that are common in data visualization research.
Communicative intent is important because many data visualizations are designed with the intent to communicate something specific to others. This means that visualizations can be designed by asking, “What is the best way to communicate this idea?” My intuition is that questions and communicative intents are just different ways to express the same thing. If my intent is to show that the global surface temperature in 2023 has been higher than in any other year in the recorded data, I can transform this into the question, “How does the global surface temperature for 2023 compare to previously recorded years?” Similarly, it seems possible to translate a question to a communicative intent, mostly by using the answer to the question depicted by the visualization one decides to use, I guess?
Tasks are also related to data questions. In visualization research, we like to talk about which tasks are supported by a given visual representation and what the best visualization technique to carry out a task is. We also like to produce “task taxonomies” to capture the main tasks one wants to perform with either specific types of data or in specific domains. I suspect the difference between questions and tasks might also be just a reformulation problem. It is possible that they are just two different ways to express the same idea. Reusing the same example above, I can translate the question, “How does the global surface temperature for 2023 compare to previously recorded years?” to the task, “Compare the global surface temperature trend of 2023 to the trends of previous years.” I am not completely convinced that it’s just a matter of reformulating these statements, but as a first approximation, it seems reasonable.
Conclusion
That’s all I have to say right now about a potential research agenda on this topic. I hope to receive your feedback and see if I can refine this idea further. I might eventually try to turn it into an actual research project. If you find this idea intriguing or you have something to suggest, please write a comment below. As you can see, this idea is still in its infancy, and I’d love to hear more from my readers. Thanks!
The whole research topic seems really promising and I was wondering how it could also fits the area of "information discovery".
I mean that it's not uncommon to face real-world situations where the user is not yet ready to think about the right questions to ask because he has very few information about the context or very few details are known about the available data.
But it doesn't mean that a good information visualization solution is not useful to identify "the right data questions" to ask.
Maybe this condition could be a preliminary stepof the "Data Question" part of the process.
I will check. Anyway, I am sending you a letter by email.