Shape the Data, Shape the Thinking
A mini series on the impact of data transformation on data visualizations and data thinking
(This is the first post of a series I am preparing on the effects of data transformation in visualization.)
UPDATE: At the bottom of this post, I added links to the series posts as they are published in the newsletter.
Over the last two weeks, I have been covering data transformation in my class. I always felt that data transformation is one of the most essential topics in visualization and, at the same time, one of the most overlooked. Many of the things we care about in visualization depend on how we transform the data, so much so that I am now more convinced than ever that we tend to focus too much on visual representation and too little on data transformation.
Data transformation includes operations like selecting which variables to use, aggregating the data, computing aggregate statistics, deciding on the level of granularity of some values, etc. All these decisions have a significant impact on 1) what visual representations are available, 2) what patterns and trends will be visible in these visual representations, and ultimately, 3) what inferences will be possible to make and what type of inferences readers will be naturally inclined to make.
I have been mulling over this intermittently for many years, but I have been squeezing my brain quite a bit over the last few weeks to find a good angle to convey what I had in mind. In the end, I am pretty happy with what I came up with. I have a structure that brings together classes of transformations and potential issues that can stem from the choices one makes when applying these transformations. The issues are all framed as problems of inference and interpretation, mostly borrowing from the literature on statistical fallacies.
I like this framing because it ties choices the designer has to make to potential reasoning issues that may stem from these choices.
Speaking of choices, in this learning module, I stress the fact that you must make choices and that your choices are going to influence what information and messages readers are going to extract from your visualization. Some choices make people more likely to derive the wrong inferences from the data you show, and knowing when and how this can happen is a very powerful tool in your toolbox.
Since this can get quite long, I decided to structure this into a series of posts. It’s also an excellent opportunity for me to experiment with a new format, which, hopefully, you are going to like (let me know what you think!).
Here is a preview of what I plan to cover in the next few posts:
Variable selection: The first step with any data set is to decide which variables (fields) to include in the visualization. Decisions here can lead to entirely different perspectives and messages, and I feel we do not talk enough about what the effect of these decisions can be.
Summary statistics: A very large percentage of visualizations in the world use summary statistics such as counts, sums, averages, etc. Knowing the effect of using one statistic over another is a fundamental skill to avoid misinterpretation and to enable effective reasoning.
Filtering and ranges: Often, in visualization, we need to focus on subsets of the data to gain clarity and reduce complexity. However, as we simplify and reduce information, we run the risk of inducing faulty inferences. Reducing information while keeping integrity is a balancing act that requires careful consideration of what to include and exclude.
Granularity: Often, in visualization, we have to decide the right level of granularity to present information. Time and space are the most common attributes that can be presented at different levels of granularity. Geographical locations can be aggregated in areas of ever larger/smaller granularity. Similarly, time can be presented at the level of seconds, minutes, days, months, etc. Here, again, different choices can have a massive effect on what will be visible in a given visualization.
In the following posts of the series, I will cover these four aspects of data transformation, their impact on visual representation, and, ultimately, the type of reasoning different solutions can induce or promote.
Stay tuned for more information, and let me know if you find this useful!
Posts of the series
Stay tuned for more to come!
P.s. If you enjoy my newsletter, please consider sharing the articles on your favorite social media platform and encourage other people to sign up. Thanks!