Know Thy Process — Socrates (on AI automation)

Why evaluation clarity must come before scaffolding

Jan 28, 2026

Before we dive deep into specific scaffolding patterns, there is one foundational question you need to get clarity on:

How are you evaluating your AI automation of operational workflows?

This is the second article in a multipart series on AI-driven workflow automation. If you have not read the previous article on scaffolding, I recommend starting there, as this piece builds directly on it.

---

The problem with evaluation metrics

Many pilot programs, including ones I have personally worked on, begin with a simplistic and idealistic view of evaluation:

“Replicate humans. Period.”

By how much?

“Let’s say 99 percent.”

That sounds ambitious. Too ambitious?

“Fine, let’s start with 90 percent.”

Nice round numbers. Comforting. Familiar.

They are chosen much like p-values in statistics and machine learning. Rarely questioned. And even when they are, the discussion is usually framed using analogies to surveys or industry benchmarks.

What is missing is a customized, objective view grounded in the reality of the specific process being automated.

An arbitrary evaluation criterion can do real damage.

Set it too low, and you build weak guardrails that create downstream risk.
Set it too high, and the pilot never scales because the targets are unnaturally steep.
Choose the wrong metric, and you gain false confidence.

So how should this be done?

Yes, the intent is often to replicate human decision-making.

But it needs to be done in a deeper and more honest way.

Step 1: Understand the subjectivity of the metric you are trying to compare

Think of any operational decision as lying on a spectrum.

At one extreme, it is completely objective.

For example, counting the number of cheque bounces in a bank statement.

At the other extreme, it is highly subjective.

For example, deciding whether a slightly blurry bank statement is acceptable.

A simple rule of thumb helps here:

Given the same input, is it possible that two competent operators arrive at different outcomes?

If the answer is no, the decision is straightforward and strict matching may be appropriate.

Most of the time, however, the answer is yes.

One important caveat:

If subjectivity stems from an ambiguous policy no evaluation metric will stabilize automation. Those issues must be addressed first.

Step 2: Identify the sources of subjectivity

Once you acknowledge subjectivity, the next step is to understand where it comes from.

Consider a common step in lending workflows: identifying the industry of a business based on its name and description.

Different analysts may arrive at different classifications for several reasons:

1. The business may operate across multiple activities that do not map cleanly to a single industry code.

2. The business description may vary across sources such as the website, directories, and registrations.

3. The bank statement may reveal a revenue pattern that conflicts with the stated business activity.

4. Analysts may prioritize different sources or interpret the same information differently based on experience.

These are not edge cases. They are the norm.

The best way to uncover these nuances is by combining process intelligence, such as process mining and task mining, with Gemba. Observing operators in action builds intuition about both the causes and the extent of variation in your evaluation metric.

Step 3: Choose an evaluation criterion that reflects reality

With this understanding, you can now choose an evaluation approach that mirrors how the process actually works.

In the industry classification example, is a strict one-to-one match the right metric?

It can be measured. But it may not be the right gating criterion.

Alternatives such as top-k matches or common-intersection matches between humans and AI often better represent alignment with human reasoning.

If your downstream system insists on a single auditable value, you can still enforce deterministic resolution rules, while using top-k or intersection-based alignment as the evaluation gate.

There is no universally correct answer here.

The right metric is the one that best reflects your process reality, risk tolerance, and decision context.

Step 4: Build a statistically sound golden dataset

Once the metric is defined, select a statistically significant golden batch.

This is not new territory. Use standard sampling principles to ensure the dataset is representative of real volume, variability, and complexity.

Also remember that golden datasets have an expiry date. They remain golden only as long as they continue to represent the process. As policies, volumes, and behaviors drift, evaluation datasets must be refreshed.

Step 5: Measure human-to-human variation

Have multiple operators, with varying levels of expertise, independently execute the process on the golden batch in a blind manner.

Compare their outcomes.

This exercise gives you a clear picture of the natural variation that already exists among humans. This variation defines feasibility bounds and highlights where judgment, ambiguity, or inconsistency lives.

To be clear, this is not an argument to institutionalize human variation in automated systems. The goal is to minimize it. However, once you choose the metric you want to hold your AI system to, you need a well-rounded understanding of current human performance in the system.

Why this matters

By following these five steps, you achieve three outcomes.

First, you define an evaluation metric that is fair, defensible, and grounded in your specific process.

Second, you quantify existing human variation, which allows you to set realistic and meaningful targets for your GenAI pilot.

Third, you give decision makers confidence that automated performance is comparable to or better than current operations.

Closing thoughts

Replicating humans is not the ultimate goal of automation. Consistency, risk reduction, and scalability often matter more. But human performance remains a reasonable and practical starting reference.

Workflow automation using GenAI is often a deeply reflective exercise. It forces organizations to confront their own process maturity, policy clarity, and decision consistency. Done systematically and objectively, this approach provides the confidence needed to move beyond pilots and into scaled deployment.

With evaluation clarity in place, we can now move forward to assessing AI performance against meaningful metrics and strengthening workflows through scaffolding.

Well begun is half done.

PS: Following are links to some of the arXiv papers I found useful on evaluation methodologies. The measurement system analysis in Six Sigma also deals elaborately with quantification of variation in the measurement system.

https://arxiv.org/abs/2307.03109 - A survey on evaluation of large language models

https://arxiv.org/abs/2309.16349 - Human feedback is not gold standard

Process → Insights→ Action

Discussion about this post

Ready for more?