Evaluating AI Is Harder Than Building It


For the past few months the mentions of AI evaluation by leaders in the industry have became more and more frequent, with greatest minds tackling the challenges of ensuring AI safety, reliability and alignment. It got me thinking about the topic, and in this post I’ll share my view on it.

The Problem

Creating a robust evaluation system is a tricky engineering challenge. With so many diverse tasks that we’re trying to solve with AI, it will become increasingly complex to get it right. In pre-agentic era, most of the problems were narrow and specific. An example is making sure that the user gets better recommended posts, measured by engagement time, likes and so on. More engagement – better performance.

But as AI advanced and unlocked new experiences and scenarios, things became much more difficult. Even without agentic systems we started facing the challenge of getting the measurement right, especially with things like conversational AI. In contrast to the previous example, the exact thing to measure here is practically unknown. Here we have criteria like customer satisfaction rate (for customer support applications), “vibe” for creativity tasks, various benchmarks like SWE for pure coding ability and so on. The problem is that these criteria are actually proxy values for our evaluation approaches. This prevents us from achieving the same quality of measurement as we had with simpler tasks.

Today’s Main Concerns

As we accelerate in the agentic era, existing eval issues compound. Imagine you have a multi-step process that you’re designing the agent system for. For each of these steps you have to create a proper quality control system to prevent points of failure or bottlenecks. Then, given that you’re working with a pipeline, you must ensure that the chain of small steps that depend on each other completes flawlessly. What if one of the steps is an automated conversation with the user? This one is tricky to evaluate by itself, but when an abstract task like this becomes a part of your business pipeline, it will affect the entire thing.

A Proposed Solution

This might seem concerning, and it really is. In my opinion, we can still get it right if we apply systematic thinking to such problems. I propose the following framework:

  1. ==Decompose the pipeline into small steps==
  2. ==Design a measurable and reproducible evaluation approach==
  3. ==Assess the interactions between steps and adjust accordingly==

When we decompose the pipeline, we should try to match the step complexity with the current intellectual capacity of agentic tools that we currently have available. A good eval design will ensure that the results of each step are reliable and robust. And if we get the interplay of these steps in check, we can harden the overall pipeline integrity. When there are many moving parts, it’s important to get this step right, especially at scale.

Conclusion

Of course, the complexity doesn’t end there. There’s a huge amount of diverse problems that need careful and thoughtful approach, individual to the specific domain.

An example that excites me personally is how we apply non-invasive BCI technology to previously unimaginable things. From properly interpreting abstract data like brain states, to correctly measuring the effectiveness of incremental changes as we make progress in this field, this will require much more advanced approaches to evaluation than we have now.

So far things look promising, and with many great minds dedicating their time to designing better systems alongside the primary AI research I’m sure we’ll get a safe and aligned technology. Let me know what you think!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.