VisML #4: Visualizing Input-Output ML Data

We look at ML models as black-boxes and explore what we can learn by visualizing their input and output data.

May 11, 2024

This is Post #4 of my series on Visualization for Machine Learning. You can find the other articles in the Vis for ML section (now we have a whole section for the series!). In this post, we focus on how to visualize model behavior by looking at the data it receives as an input and produces as an output, that is, by treating the model as a “black box.” As always, your comments are very welcome. I am eager to learn how this can be improved and whether you find it useful. If you like what I am writing subscribe to the newsletter and let other people know!

In my previous post of the series, I provided a broad-brush description of what kind of data is available in ML and how visualization can help extract useful information from it. In this post, I will focus on the first class of data sources I mentioned in the post, which is input-output data. In exploring visualization solutions in this area, I will mostly first focus on tabular data, that is, classic data sets with instances in rows and attributes in columns. Narrowing down the focus to this type of data will make it easier to describe the methods and the variations around the methods. At the end of this post, I will discuss ways in which the techniques I describe may extend or be adapted to other data types and highlight possible challenges we will need to overcome in the future to achieve that goal. The post is organized as follows: First, we cover what type of data we have with ML models. Second, we cover data visualization techniques for different data configurations. Third, we clarify what type of questions one may or may not be able to answer with these methods. Fourth, in the final part, we cover open issues and challenges. Enjoy the reading!

Model Data

When we observe models from the perspective of input and output we can identify the following primary sources of information:

Data features and values. This is just the data used to train and test the model and the data that the model will eventually receive when used to make the actual prediction in production. If we focus exclusively on tabular data, the features are just nominal, ordinal, or quantitative variables with their associated value domains. For example, a customer data set may have demographic and behavioral data like age, region of residence, years the person has been a customer, visit frequency, etc.
Model output (classes, scores, quantities). This is the output generated by the model when receiving a specific input (often called a “prediction” even if it does not have to be about predicting a future event). There are many possible types of outputs; here we will first focus on two basic types, classes and quantities. Classes (nominal or ordinal values) are the output of classifiers, and quantities are the output of regressors. One additional element to consider is the model scores many classifiers use to produce a specific class as an output. Many classifiers do not directly provide a specific predicted class as an output but a set of scores for each class that determine the predicted likelihood of a class. This is important because many data visualizations use these scores directly rather than the predicted class.
Ground truth (and errors). Most machine learning models are trained using data that record past outputs so that the model can learn to mimic predictions in the future. For this reason, in ML, we often have data sets with “ground truth,” that is, the prediction the model should be able to mimic if it has learned to make predictions correctly. As you can imagine this piece of information is essential because it’s the one that allows us to discriminate between correct and incorrect predictions. In practice, this means that for each element in the data table (when we use training or test data), we have two fundamental values, the model prediction and the correct actual value. The difference between these two is essential to understanding where and when the model makes mistakes.

In addition to these primary sources, we have to also consider what kind of data we can derive from these sources. In other words, how can we transform these primary sources to generate useful information that derives from them? We will consider two main types of transformations: transformations of data features and values and transformations of model output. As we will see later on, transformations of the data input are essential to focus on specific aspects of the data and the model. In particular, the data input can be processed to derive the following structures:

Data cubes: Groups of data formed by segmenting the data according to the values of one or more attributes.
Data clusters: Groups of data formed by using a clustering algorithm that groups data items according to a similarity function.
Data embeddings: Projection of the data into a low-dimensional space characterized by (typically) 2 or 3 axes that capture the main structure and variation of the data.

One final characterization we need is to distinguish between training, test, and production data and also between (output) data coming from different models. These distinctions are important because visualization tools can support the comparison of these different data sources.

Visualization Techniques

When we look at these characteristics we can start reasoning about what kind of visualizations one could build and for what purpose. We have two main classes of visualizations.

Visualizations that focus on model output (error analysis)

These visualizations focus exclusively on model output and the errors the model generates. Before focusing on visual representations, it’s important to characterize even further what kind of output and errors are possible because these affect what kind of visualization techniques are available for error analysis.

One way to look at this is to identify what kind of information we can gather from the output generated by individual instances. We have three possible cases according to what is predicted and what is the ground truth:

Predicted class + ground truth class
Predicted score(s) + ground truth class
Predicted quantity + ground truth quantity

Data and error distributions

The simplest analysis is to look at how output and errors are distributed. How many data points are there in each class/value? How many errors are there in each one?

When the output is a single class or value, bar charts and histograms (or density plots) are the best representations for these simple cases. A bar chart can show the number of data points in each class and the number of errors (a stacked bar chart can do that). A histogram or density plot can do the same when the output of a model is a quantity. If one wants to look at specific instances rather than aggregate values, the strategy used in Model Tracker is a possible solution. The data items are arranged along the horizontal axis according to their model score. If the model output is discrete, a dot plot or any of the many variants of strip plots can be used for the same purpose.

Visualization technique used in Model Tracker, a data visualization tool developed by researchers at Microsoft Research.

One trickier case is when we want to visualize the quantitative scores generated by a classifier to decide which class is the most probable. In this case, the output is not just one single value but a multidimensional value. For this situation, multidimensional visualization techniques such as parallel coordinates, scatter plot matrices, and multidimensional glyphs can be used to visualize model output. The image below shows the Squares tool, which employs a bespoke version of parallel coordinates that enables the analysis of model output for this specific case.

Visualization technique based on parallel coordinates and implemented in a tool called Squares. Squares was developed by researchers at Microsoft Research.

Actual vs. predicted comparison

Another type of analysis one can do is to look at how the actual and predicted outcomes compare. When the model is a classifier, the output is a matrix of values with actual values in rows and predicted values in columns (or vice-versa). This kind of matrix is commonly called a “confusion matrix” and can be easily visualized as a matrix visualization. When the model is a regressor, one can do the same type of analysis by either binning the values and using the confusion matrix strategy or simply using a scatter plot, with one axis for the actual and one for the predicted values. A useful variant for this last case is a residual plot where instead of visualizing the predicted values, one visualizes the distance between the predicted and actual values.

One complication for confusion matrices is the very high number of classes. In that case, the matrix may not scale very well, and additional strategies are needed to make it more effective. A good example is Blocks, a visualization system developed to address this specific case. In Blocks, the hierarchical nature of the classes is used to group rows and columns together. The tool also employs matrix sorting strategies to make it easier to detect trends in how the model confuses certain groups of classes with others.

Visualizations that relate input and output (inferred logic and subset analysis)

The analysis of model data can also focus on the relationship that exists between input and output data. Relating input to output permits us to learn more about how the model behaves in specific conditions other than what type of errors it makes. The existing methods can be grouped according to three types of data organizations.

Feature-outcome plots

The most basic relationship we can inspect is the one between feature values and model outcomes. Depending on what types of features and outcomes one inspects, we can have different types of data arrangements and associated visualizations. In general, we have all the possible combinations between nominal, ordinal, and quantitative features and outcomes, therefore a total of nine types of relationships. But if we collapse nominal and ordinal together (by just remembering to keep the order of categories intact) we have only four possible combinations by matching quantitative/categorical input and quantitative/categorical output. To visualize these data it is sufficient to use a set of standard plots (my favorites!):

Quantitative input + quantitative output: Scatter plots
Categorical input + quantitative output: Dot plots or bar charts (not scaling well)
Quantitative input + categorical output: Stacked area chart (or?)
Categorical input + categorical output: Stacked bar charts or heatmaps/matrices

Visualizing these data, however, is more complex than it seems because there are several possible complications. The first one is that we often want to distinguish between correct and incorrect predictions, therefore all the plots above need to accommodate a comparison between these two sets. The second complication is that categorical inputs or outputs can have a high number of categories. When this happens some of the visualization techniques outlined above do not scale well. More specifically, bar charts, stacked area charts, stacked bar charts, and (to a lesser extent) heatmaps/matrices do not scale well when they have to accommodate a high number of categories. I will devote a separate post to this problem, but for now, it is important to be aware of the fact that this problem exists and is very concrete because many real-world data sets can have features and outcomes with a high number of categories.

A special case of input-output plots is PDP plots. These plots are built through a two-step procedure. In the first step, each data point is modified by changing the value of the feature being plotted (in a range between the domain’s minimum and maximum value) and the output of the model with the modified data point is recorded. In the second step, all the values are aggregated to generate a mean model response for a given value of the input feature. The result is one plot for each feature that depicts the relationship between input and output with a line chart. The image below shows an example with a tool called PDPilot which we developed in our lab (this is also Daniel’s work).

Visualization coming from PDPilot, a data visualization tool to explore model behavior through Partial Dependence and Individual Conditional Expectation plots. The tool was developed by Daniel Kerrigan in my lab.

Each plot in this grid represents one of the features in the data set, and the black lines show the relationship between input and output calculated with the partial dependence plot procedure (the green lines represent the individual data points with the extrapolation across the feature value range described above).

Data cubes and clusters

When you combine multiple features at once, you obtain what is called a “data cube” in databases. The idea is that the data space can be partitioned into a data cube if we split the space according to the values of three data attributes. Imagine a data set with demographic data. The data space can be partitioned into small cells if we use the combinations of values of age, gender, and state. The same idea can be extended to combinations of any number of data features. This is relevant in our context because analyzing the behavior of a model in data cubes can be insightful. Imagine, for instance, if our goal is to find data subsets where model performance degrades considerably. Imagine a situation with a customer data set where the model's overall accuracy is high but much lower with a specific subpopulation of interest. That’s certainly something that can be a reason for concern. This is exactly the problem addressed in a 2019 paper titled “Slice finder: Automated data slicing for model validation,1” which introduces an algorithm to look for such subpopulations where performance degrades. A more visualization-oriented approach can address the same problem by creating visualizations that partition the data space according to selected data features. I can count at least three papers where this approach is used (SliceLens was developed in our lab by my student Daniel Kerrigan):

Here is an image from the What-If tool developed by Google, which includes visualization to carry out this type of analysis.

This is an example of data cube/subset analysis implemented in the What-If tool, a data visualization system developed by Google researchers to explore and validate model data.

The data cubes approach is one of many ways to build subgroups from data. Another approach is to use clustering algorithms that group data according to a similarity function. These algorithms analyze the data and produce groups in which the data points tend to be similar according to the definition of similarity used in the similarity function. This is sometimes useful when we look for ways to group the data and want to use something other than a specific set of features to create data cubes. Once the clustering method returns the groups, one can perform the same type of analysis used for data cubes, which is mostly about inspecting and comparing model performance within and between groups.

Embeddings

One final option is to visualize data through “embeddings.” An embedding is obtained by transforming the input space into a new space that “describes” the data with a reduced number of synthetic features. Dimensionality reduction methods permit the creation of such embeddings. They they the original data set as an input and produce a user-defined set of new axes (features) one can use to project the data in 2D or 3D space. An example here will be more evocative than many words. In the image below you can see an example of embedding with the classic MNIST data set.

The method receives the pixel values of each image as an input and produces two or three axes to project the data in a point cloud visualization. As you can see the method preserves most of the structure. Interestingly, the visualization also helps detect potentially critical cases. Do you see the red squares in a cloud of green or cyan squares? This is where embeddings can be useful. They can help detect edge cases and areas of the input space where a model might have difficulties in making predictions. These methods can be used with the input data exclusively, even before the data are used to train a model, but they can also be used to see how the model makes decisions in this space and how errors are distributed. For example, a region where errors concentrate could be a hint that the model needs adjustments specific to the set of data points that lie in that region.

What can we learn from these visualizations?

Now that you have a full picture of what information and corresponding visualizations are available it is important to take a step back and reflect on what can and cannot be learned from analyzing models with these techniques.

Visualizations that focus exclusively on model output can answer questions regarding how (training, test, and production) data and errors are distributed across the output space. For example, these methods can help identify predictions that are particularly problematic, either because items of one class are often confused with others or because certain quantities are harder to estimate. In turn, this information can lead to different actions according to what the source of the problem is. In some cases, problems can stem from inaccurate labeling of the data or extreme outliers. In other cases, one may realize that the model does not have enough training data to learn to make a specific kind of prediction.

Visualizations that focus on input and output carry additional information that relates the input and the output space. For example, it is possible to learn that errors concentrate in a specific subpopulation of interest or that model output tends to have a specific relationship with model output. This is where ML visualizations can be helpful in testing the mental model of the user and identifying model behaviors that are counterintuitive or do not make sense. This is also where users can start drawing hypotheses about what cases are problematic for the model and produce additional testing (maybe with synthetic data) to see if the model makes systematic mistakes.

This second class of visualizations is characterized by the fact they help users draw inferences about model behavior and logic. What one needs to keep in mind, however, is that inferences made from data could always be inaccurate or even wrong, therefore further testing of the hypotheses and mental models generated from using these data visualizations is necessary.

Open issues and challenges

The techniques I described do not cover all possible situations, and they can easily break down when applied to more complex situations. One problem I already mentioned is scalability. When the number of classes or data items to show is too high, visualizations can easily reach a visual scalability limit.

Another big issue is how to analyze model data with other data types. In the beginning, I mentioned that all these methods apply to cases where the data handled by the model is tabular data, do these techniques work with other data types? It depends. The techniques that focus on model output can be applied to any other case where the output is a class or a quantity. However, some models have different or more complex types of outputs. For example, time series forecasting has a whole time series as an output. Machine learning methods that produce ranked lists have a whole ranking as an output. Some models also produce a probability and an associated level of uncertainty. All in all, we can’t assume that models produce only these simple outputs we covered and specific adaptations are needed to cover other cases.

The problem becomes even more complicated when we consider methods that associate the input and the output space. This is where the nature of the data can make a big difference because we can no longer treat data as a collection of features and associated values. Images, videos, text, time series, etc., all have completely different structures. Embedding can be used seamlessly if one has a way to calculate a distance function between the data objects but the analysis based on features does not apply easily. Similarly, the data cubes analysis may or may not apply depending on the nature of the data (image metadata can be used for data cubes for example).

Finally, I want to mention that more and more problems in ML are configured as an unbounded input and output space, and it is not at all evident how model data visualizations could be applied to these cases. Models that handle unbounded input and/or output spaces require a complete rethinking of the problem.

Thanks for reading until the end! Please like the post and leave a comment. It’s very useful for me to learn about my readers’ thoughts about the articles I post here.

FILWD

Discussion about this post