Verification Is the Bottleneck in AI-driven Data Analysis
Reflections on how to deal with the verification problem when using LLMs for data analysis and visualization
A while back, I posted an article on what AI could do for data visualization. In that article, I used a data pipeline to explain where AI could intervene and what kind of support it could provide. More recently, I challenged myself to perform data analysis with ChatGPT. I recorded myself performing the analysis and then wrote a post reflecting on the opportunities and challenges I faced. Behind the curtains, we have been playing with LLM in our lab and have spent quite some time thinking about how we can advance the state of the art in this space. All these experiences led me to believe that verification is the biggest bottleneck to the effective and valuable use of LLMs in data analysis and visualization. What do I mean by verification? It’s simple. When an LLM produces a result for you, verification is any procedure and tool that helps you verify the correctness (or veracity) of the results.
I am certainly not the first to point this out for the general use of LLMs. Hallucinations are a highly researched topic in this space. However, when we focus on using LLMs for data analysis, the goal is more specifically to verify that the data processing steps performed by the system are a valid interpretation of the instructions given by the user. This is made more complicated by the fact that data analysis in this space is driven by natural language expressions, which by their very nature can be ambiguous and pose additional challenges over more traditional ways to express what to do with data through computational procedures.
Verification difficulties
One interesting aspect of verification is that not all problems are equally hard. Some problems are trivial to detect, whereas others are much harder. If I expect a given type of plot as output and receive another, the discrepancy is obvious. However, if the problem resides in how the data values are computed, it may be hard to detect.
As things stand now, we don’t have a good understanding of what kind of mistakes LLMs make in this space, and we don’t have a good characterization of the issues we should be aware of. Knowing what kind of mistakes may be possible is crucial for three reasons. First, to educate people and increase their awareness of possible errors. If analysts have a better sense of possible errors, they know what to look for. Second, to develop tools that help perform verification more easily (more on this below). Third, to develop LLMs that make fewer mistakes (even though it’s not evident to me that these errors can be eliminated completely).
Will the need for verification vanish?
After detailing the problem, a worthy question is, what can we do about it? One possible answer is to build LLMs that do not make mistakes in the first place, but is that even possible? It’s very hard to make predictions because this is an area where major breakthroughs could happen anytime. However, I have a strong intuition that the need for verification is here to stay for mainly two reasons. First, it is not evident to me that it’s possible to build “perfect” LLMs. The diminishing returns problem seems relevant here. AI tools may be capable of being 80% or even 90% accurate on a given task, and still, it may be an insurmountable challenge to get from 90% to 95% and beyond. This is not unheard of. It’s precisely the problem self-driving cars have been experiencing for years. Second, even if we had “perfect” LLMs, we would still have the ambiguity problem. Expressing data needs in natural language means that LLMs will always have to interpret the actual intent of the user. While more skilled users may learn to use less ambiguous expressions, expecting everyone to become an expert in this space would significantly limit the promise of LLMs. In other words, if LLMs represent an opportunity to democratize data work even further, we will have to find a way to make it usable and effective for people who don’t have 20+ years of data analysis experience under their belt. I also have a strong suspicion that even focusing exclusively on data experts, having verification tools will always be preferred to blindly trusting the LLM.
Possible verification strategies
If you try to perform data analysis with major LLM tools today, you’ll notice that they already provide a rudimentary verification tool. They allow you to inspect the code they generated from your prompt so that you can get a sense of what data operations it actually performed. While powerful, I think this tool is very blunt. Reviewing code is tedious and limited to people who understand code, which is a (probably small) subset of all potential users. So, if code is not the best option, what else can we do?
One solution is to use prompts to ask the LLM to explain how it arrived at the generated output. I have not tried this solution yet, but I suspect it might produce some useful results. One problem with this approach, however, is that verification is left to the user's ability to produce good verification prompts unless it’s possible to find some standard prompts that produce a desired verification output. Another problem is that the verification itself may produce unreliable explanations.
Another solution is to expose elements of the data processing steps the generated code goes through to produce the output. The LLM produces code that is run in a specific environment. By analyzing the code, we can make it easier for the user to understand the data processing steps. Here two not mutually exclusive strategies are possible: one focusing on describing what processing steps have been applied (e.g., selecting, aggregating, filtering) and one on the intermediary data tables these preceding steps produce. In both cases, visualization and interface design can play a major role. By integrating verification elements in the user interface, it will be possible for the user to quickly grasp what kind of data processing steps have been applied to the data and verify that they correspond to the desired outcome. More research is needed in this space.
Conclusion
There is a lot more to say about the verification problem. We do not yet have enough information to understand the potential of this powerful tool for data analysis and visualization. Still, I suspect we will see a lot of advancements and an increasing integration of language models in the data pipeline. The challenge is how to develop these tools with user needs in mind. To build tools that empower people and allow them to retain agency, it is crucial to study what these tools can and cannot do and to build interfaces that allow for proper verification of the results. I am sure we will see a lot more happening in this space soon.
In my summer course for masters students I ran an in-class activity to demonstrate this by giving students a somewhat messy dataset and a question and then having them compete to be the first to use gen AI to get the correct answer. Everyone gets it wrong, again, and again, until they learn that they have to dissect what the llm is doing analytically, how it's pulling variables, analytic definitions clear, etc. I do think there should be some nice UI designs you can come up with to make analytic output faster / easier to verify...
good post! I mainly use Claude 3 Sonnet via Perplexity AI and the growing capability of these models are definitely creating an automation cognitive bias in me leading me to spend less time verifying the outputs because of how impressive the outputs are most of the time. Most of my use case are not analytical and more research around texts.
1. The following prompt for my perplexity AI account does a fairly good job at AI explainability. there are other technical methods as well like SHAP and decision trees. https://www.perplexity.ai/search/what-are-common-technical-meth-XjlMH5PyRrunNlZ_hqEKVg
"AI explainability (What were the most important variables and factors impacting this prompt and what are the percentage weights to the variables and factors were used in driving your answer.)
Also summarize the attention weights into a simple table and show which parts of the input the model is focusing on when generating the output. "
2. another useful approach for verification is just to feed the results of one LLM into another LLLM. LLMs have different model weights based on the training data, parameter size and fine tuning work done to them.
3. Lastly with the release of Mistral Agents today, shows the rise of agents specifically built to verify output of other agents which might help lessen the verification bottleneck you described above.
https://docs.mistral.ai/capabilities/agents/
Use case 4: Data analytical multi-agent workflow