Asking the right data questions and asking the data questions right
Visualization problems are often data thinking problems
In my last post I lamented the fact that there is too little focus on data thinking in data science and data visualization, and I described how much of a struggle it’s been for me to figure out how to learn and teach these skills.
When I teach my academic visualization course, the thing the most exemplifies this need is the way I struggle to teach how to formulate good data questions, that is, the questions you want to answer with your data and their corresponding visualizations.
When I teach my visualization course, I spend a considerable amount of time talking about “data questions”. I tell my students that it does not really matter how good they are at creating visual representations if they do not have the right information and do not ask the right questions. The way I like to teach this is that good data visualization depends on having the right questions → producing the right information and then (and only then) → having the right visual representation.
But, how do you come up with good questions? And, from the perspective of an instructor, how do you teach students to ask the right questions?
Asking the right questions is about asking the questions that are needed to reach the goal you formulated for your project. So, if you goal is to shed a light on a given phenomenon, you want to make sure you ask the questions that do shed a light on that phenomenon. In particular you want the questions to be pertinent and exhaustive. Pertinent means that there is a good match between your goal and the questions you ask. Exhaustive means that they cover a set of aspects that is broad enough to give a full picture about that phenomenon.
But what I want to focus more on here is asking the question right. What I found absolutely crucial is to guide the students in asking questions the right way.
There is an assignment where my students have to come up with their own questions starting from a given data set. What is really interesting is to notice how hard it is for them to come up with something that is specific enough. When you ask people to come up with questions they want to answer with their data, they come up with questions that are either impossible to understand or that are so underspecified that they could be interpreted in many different ways. If you think about, it it’s an interesting process because sooner or later these vague questions will need to be transformed into something that is almost infinitely precise, that is, some actual data derived by some data processing and statistical procedures. If the question is ambiguous, how do you know whether the information extracted from the data is meaningful or not?
The first time I saw this problem described clearly in the context of data visualization is in Danyel Fisher and Miriah Mayer’s nice little book “Making Data Visual”. There is an early chapter in the book called “From Questions to Tasks”, which is a little gem. In that chapter Danyel and Miriah talk about the need to pay attention to how we translate domain questions into actual tasks. This step is related to what in experiments is called “operationalization”, the translation between the concepts you want to capture and the measures you can actually implement to measure those concepts. And the problem is that there is always a gap between the construct and the measure and being aware of that gap is super important.
But let’s go back to my personal experience with teaching these skills in Data Visualization.
There are a few things I noticed lately about how to teach this skill. The first assignment I give to train this skill is one where I provide a data set and a set of questions that I purposely designed to be ambiguous. For the first one I show how I transform the question from one that is ambiguous to one that is so specific that it can be unambiguously computed from the data. For the rest I ask students to do the same on their own. Let me show you an example. In the course we use the NYC Vehicle Collisions dataset, which collects information about collisions in NYC, including time of collision, location, etc. One question we analyze is the following:
“How did the situation change over time?”
If you analyze the terms used in this sentence you quickly realize it needs some disambiguation before it can be answered with data.
What is the meaning of “situation”? And what is the meaning of “over time”? Some precision is needed here!
So, situation can be interpreted as “number of collisions” or maybe “number of people injured” or maybe something else. And “over time” can be different time spans and at different levels of granularity (daily, weekly, monthly, etc.)
What is interesting about this exercise is that it boils down to doing some sort of lexical semantic analysis of the questions. The key part is to read each word and ask yourself: what do I mean by that? Is it clear enough?
Last week I had an illuminating experience regarding the value of this exercise. I met with a group of students to give feedback about their group project. As most students do, they started explaining why they use a given visualization and what else they thought could work, when I suddenly stopped them and said: “wait, wait, wait, ... what’s the question?” So we started working on the actual question and the students were struggling. I asked them to write down their question in an editor and they shared it with me on Zoom. Then I started asking questions that led them to revise the phrasing of the question multiple times. After 4-5 iterations, we eventually landed on a version we were all happy with and magically everything fell into place. Bang! Beautiful.
What is really revealing is that once we agreed on a well-designed data question, the students could immediately come up with a brilliant idea on how to visualize the data! Which makes me think that once the question is well designed, the rest is much easer to come up with. I have experienced this over and over again: many data visualization problems are not really visual representations problem, but they are more about defining more precisely what question you have and what information you need to answer that question. Many data visualization problems are really data questions problems and there’s no amount of “graphical massaging” that is going to solve them if the questions are not well defined.
I am going to end this with suggesting a tentative procedure to use when designing and evaluating a visualization that is based on data questions:
Write down the question (Writing is crucial! Also make sure it is formulated as an actual question with a final question mark)
Analyze verbs and nouns and ask yourself whether there is ambiguity in their interpretation. If yes, revise and try again
Keep going until you are satisfied with the result.
The trick is to ask yourself: if I give this question and the data to somebody else, would this person be able to produce the data that answer this question without me explaining it further? A useful variation that can be used in teams is to have one person create the questions and see if the others can interpret them correctly. If not, some refinement is needed.
Another good thing about data questions is that they are an excellent way to evaluate your visualizations! Once you come up with some solution you can verify with yourself or test with others: you show the visualization, ask the question and see if they can answer with your visualization. If not, it probably means you need to work on it a little bit more.
That’s all for now. Let me know what you think!
The problem itself, as it is framed here, is kind of out of focus. The correct question usually arrives thousands of hours after you started looking into the data. And, in order to get the right questions you need to:
- have domain-specific knowledge
- a sound statistical background
That is it.