Data Visualization for Machine Learning

Introducing a new series on how visualization can be used to understand machine learning models and their behavior

Mar 27, 2024

(Hello friends! With this post, I am starting a new series. After completing the Data Transformation for Visualization series, we start this spring with a new series, this time on Visualization for Machine Learning. I have been working in this area for quite a few years, and I taught a new course at Northeastern a few years ago. The series is organized around the framework I developed when I prepared for that course. I hope you’ll enjoy it. I am nervous and excited at the same time. Let’s go! P.s. As I start this series on visualization for ML, I assume you are familiar with both data visualization and machine learning. If you have a hard time grasping some concepts, please let me know, and I’ll provide more details.)

If you are even vaguely familiar with technology, you must have heard machine learning (ML) is all the rage right now (note that people often talk about ML and AI interchangeably even if, in principle, ML is a subset of methods for AI). While ML has existed for several decades, recent breakthroughs have made it extremely popular.

In this new series of posts, we will explore connections between data visualization and machine learning. More precisely, I will focus on how data visualization can help practitioners and end-users better understand machine learning data, behavior, structure, and decision mechanisms.

You may ask, “Why use data visualization for machine learning? What is special about Data Visualization that ML can benefit from?” Also, if you are familiar with Data Visualization, you may ask, “What is special about ML that requires specific visualization solutions?” Let me answer the two questions separately.

Why use Visualization in ML?

Data visualization is needed because ML models are complex objects increasingly used for applications that impact many people’s lives, directly or indirectly. ML is everywhere. It’s in your phone, in the stores you use to buy stuff, in your bank, in your watch, in the computers and devices your doctor uses to take care of you, in the phone you use to take pictures of your family, etc. This is an interesting combination because ML is pervasive, takes place in sensitive settings, and is very complex. Increasingly complex.

Let me clarify what I mean by complex. The most fundamental feature of machine learning is that 1) the machine learns through a series of examples (often a massive amount of them in modern applications) and 2) the program that is learned is (most often) not intelligible; that is; humans can’t directly observe and comprehend what the model has learned and how it has learned it. In most cases, there is no explicit representation a human can review to understand and predict how a model will behave with new unseen data. It’s quite amazing if you think about it. We can teach machines to do something without giving them explicit rules. This is a remarkable feature, but it comes with a huge cost: we either blindly trust what these machines do or find ways to verify what they have learned and how they behave. Since we do not provide explicit rules to the model and the logic is not explicitly encoded, we have to find ways to observe the model to help us draw inferences about its behavior and logic.

In a way, the fundamental problem is one of abstraction. We need to find abstraction layers that translate the language of models into languages humans understand. This is effectively not new. Humans often build very complex things that perform complex tasks, and we must find ways to make people interact with such complexity through well-crafted abstractions.

In his legendary “The Design of Everyday Things,” Don Norman explains the problem very well. Even something as simple as kitchen appliances has complex internal logic and mechanisms that users do not need to know to operate them. However, when designers build interfaces for these appliances, they can make choices that considerably impact their usability and understanding.

What is new, however, is the sophistication of the type of decisions and outputs ML systems generate and their potential impact on society. An unusable toaster is not that big of a deal compared to an AI system providing recommendations to medical doctors or hiring managers.

Caruana et al. provide a remarkable example of what can go wrong with ML if a human does not verify what the model has learned. In a landmark paper on intelligible ML models published in 2015 (before the advent of super complex deep learning models!), the authors recount the story of a series of models trained to predict the probability of death of patients affected by pneumonia on a clinical trial. When the researchers trained an intelligible model, they discovered that the model had learned “that patients with pneumonia who have a history of asthma have lower risk of dying from pneumonia than the general population.” This is clearly absurd because asthma is a major risk factor. The researchers later found that patients with asthma were sent directly to the ICU, which in turn made patients with asthma have a higher probability of survival. This is a good example of what can happen with ML models: they can pick patterns from the data that do not reflect reality accurately and can lead to dangerous decision logic. For this reason, models often need human supervision, and human supervision needs the careful design of visual representations that make the logic and behavior of models understandable.

Why do we need ML-centric Visualization?

The second question we must address is why we need to study data visualization specifically for machine learning. Can’t we just use what we know about data visualization and apply it to ML? In principle, yes, but ML poses some unique challenges. ML models are complex objects, and it’s not obvious what aspects of the model development and use should be visualized. For example, one can visualize the data used to train a model, the model's structural components, or the model's behavior when it’s used in production. In addition, ML models are dynamic objects that respond to inputs and generate outputs (e.g., using the example above, they can take data about a patient and return the probability of that person dying); they are not just static data. In a way, they are more similar to simulation models, which produce outcomes on demand according to the different information and parameters one feeds them with.

In turn, this means that visualizing ML models often requires the non-trivial ideation of model probing and querying mechanisms that guide the user towards behaviors of interest. In a way, the data visualization problem we need to solve with ML visualization is not mainly about which visual representation to use, even though this is very important, but more about what information to extract in the first place and how to interact with the model so that we can understand how it works and behaves. Models can generate as much data as you want; you “only” have to feed them with some information, and they’ll respond with something. Understanding what aspects of a model one needs to investigate is one of the main design decisions one has to make.

For this reason, the main principle I’ll use to organize existing ML visualization techniques in the series is what information these techniques visualize rather than how they visualize it. Accordingly, I plan to organize the content around three main classes of visualizations:

Visualizing ML Data: Every model receives some data as an input and produces some data as an output. What can we learn by visualizing these data?
Visualizing ML Explanations: In ML, there are techniques to create “explanations” of model decisions. How can visualization help us understand and explore these explanations?
Visualizing ML Internals: ML models have an internal structure (and architecture) made of model components. What can we learn by visualizing the behavior of these components?

Of course, this is not carved in stone. I might make changes or add new categories as I develop the series.

Overview

As of now, these are the posts I plan to write and post for the series:

Who Needs Visualization for ML?
What Is There To Visualize?
Visualizing ML Output
Visualizing Model Explanations
Visualizing Model Internals

This is very tentative. There is a high chance that I will need to break these topics down into smaller parts and add or remove pieces as I develop the individual posts.

I am writing this with a bit of trepidation. Starting a new series is a big task and this seems bigger than the previous one. Wish me luck! I hope you’ll enjoy what I have to offer in this series. In the meantime, if you have any questions, suggestions, or requests, add a comment below.

FILWD

Discussion about this post