14 Comments

Reminds me of Evan Miller's post "how to not sort by average rating" https://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Expand full comment

Great post!!! Thanks for sharing it.

Expand full comment

In the "Data Relationship Description model", I use 5 types of "Metrics" and it is very similar to your type of aggregation.

Sum - the total measurement or counting of items:

Total of salary in $

The income size in thousands of $

The number of students

Percentage – part of the total measured between 0-100 percent:

The percentage of overtime

The percentage of abandoning clients

The percentage of struggling students

Difference – the gap between two metrics, in numbers or percentage:

The gap between income and expenses

The gap between new and abandoning clients

The size of the deviation from the average number of students in the class

Calculation - any calculation result between several Metrics, except "difference“:

The number of employees per square-mile

The number of sales per work hour

The number of average students in class

Function – giving quantitative value to a certain phenomenon:

The employee satisfaction score, between 1-10

The correlation between the size of the product and the number of sales, between -1 and +1.

The level of density in class, between 1-5

Expand full comment

Thanks Bella. It's good to see you have a similar structure to mine! Difference is important too. I do not have it, except as percent change.

Expand full comment

Writing this here as a note to self reviewing its content. I should have added "deviation/change" as a type of statistic.

Expand full comment

Interesting commentary. Liked the concrete examples. Would be even better if, after demonstrating potential mistakes in naive visualization of NYC accident data, the article shows how to do it.

Expand full comment

Thanks for the comment Stephen. When you say "how to do it" do you mean how to fix the problems?

Expand full comment

... see my answer to Daniele's comment. It includes ideas on how to solve some of these problems, if this is what you meant with your comment. I agree that there is definitely space for explaining more how to overcome these problems!

Expand full comment

Really insightful post Enrico! Written in a super clear manner with an excellent ability to provoke thoughts on the impact of synthesizing information through statistical aggregations.

It might be interesting to delve deeper into the topic of multidimensional synoptic visualization in a subsequent post. Do you thnk this approach, unlike the single chart discussed in the current post, could provide readers with a more informed consumption of the underlying information by addressing the specific limitations of a single-chart representation that can amplify distortions due to the adoption of statistical aggregation criteria?

Expand full comment

Thanks for commenting! Would you mind expanding on the idea of "multidimensional synoptic visualization" I am not 100% sure I understand what you mean by that.

Expand full comment

Certainly, you are correct—my mistake. Allow me to clarify my point. When dealing with statistical aggregation tools, such as means, counts, and normalized ratios, it becomes essential to consider the implications involved. For example, when presenting information like the average number of persons per car accident on a map, the data alone might be misleading due to potential differences in data distributions. Exploring effective solutions to mitigate such implications is crucial. One approach could involve integrating additional interactive visualizations (of types A, B, C, etc.) that are synchronized, in terms of filtering criteria, with the main map. This synchronization may contribute to a clearer understanding of the information at hand.

I hope this clarifies my point.

Expand full comment

You got me thinking now ... my overall point is that we have to look at this from the perspective of the designer/analyst and the perspective of the reader. The reader needs to be aware of these problems and able to catch them. But analysts and designers need to find way to overcome the problems. I think this is also what Stephen suggested in his comment here.

Some problems require just not presenting data in certain ways or using different transformations. For example, for base rate bias one can use a different metric that is not affected by the bias. For averages, one can and should make sure that each element has a sufficient baseline frequency. Some other problems require making sure the representation is interpreted correctly. For example, for percentages, it has to be clear what kind of percentage it is.

More similar to your suggestion, I think one solution is sometimes to visualize the same data with different metrics in order to avoid faulty interpretations. One can for instance show the change over time in absolute and relative value in two paired plots. The graph here would be a good candidate for doing that: https://www.washingtonpost.com/education/interactive/2023/homeschooling-growth-data-by-district/. The percent change could be paired up with a plot that show change of the actual values, not the percentages.

In any case, everything starts from awareness. Without being aware of these problems there is simply nothing one can do!

Does my response here address your initial question? I'd be happy to elaborate!

Expand full comment

Yes, I believe you understood my initial question. Thank you very much, Enrico.

One aspect that has got me thinking is understanding what "best practices" or "best patterns" could assist designers and analysts in bridging the awareness gap that may exist between them and the "average user/reader."

It would be truly interesting.

Expand full comment

I think I should be able to come up with some ... I'll try!

Expand full comment