Measuring What Matters: Consistency Is the Real Test of Automation
Variation is the spice of life (and spoiler for automation) - Wise old elf
This is the fourth article in a five-part series on progressively transforming legacy processes using GenAI.
In the previous articles, we established three foundational ideas.
First, successful automation depends on scaffolding - both technical and human - not just better prompts or models.
Second, evaluation metrics must reflect the real subjectivity in your process.
Third, momentum comes from winnowing - separating straightforward cases from judgment-heavy ones so you can ship safely and learn quickly. This increases automation momentum — and equally important, it creates visbility for the policy teams to resolve the ambiguities that volume had previously hidden.
Once these foundations are in place, as you winnow more, a new question emerges:
How consistent are your automated decisions?
Especially in a Gen-AI powered system which almost always gives different responses for the same questions.
Needless to mention, in the world of operations - consistency is what turns isolated success into dependable execution.Inconsistent automated decisions aren’t just a technical nuisance. They carry a real business cost - they’re a compliance and repricing liability.
Measuring consistency
A useful way to think about consistency is through an analogy familiar to many operations leaders: Gauge R&R studies
In those studies, the same measurement is repeated multiple times under identical conditions to test repeatability. If the same part produces different measurements each time, the measurement system cannot be trusted—no matter how sophisticated the tool.
GenAI systems require the same discipline.
The principle is simple:
Given the same inputs, how often does your system produce the same decision?
Let us return to the running example from the previous articles - “identifying industry codes from business names”
To measure consistency, we ran the same few hundred businesses through the system multiple times, often 30 or more iterations, with identical inputs.
We then categorized the outcomes:
- Cases that produced the same code every time
- Cases that alternated between two plausible codes
- Cases that fluctuated across three or more codes
That alone provides useful insight. But operationally, it is still incomplete.
The real question is:
Does this variation actually matter?
So we went one step further. What counts as material variation is not universal - it is defined entirely by your downstream decision engine.
We traced the downstream impact of these variations. In particular:
- How many fluctuating codes led to different pricing outcomes?
- When pricing changed, how large was the impact?
- Did the variation create material risk, or just harmless noise?
This step is often overlooked. It should not be.
Because improving consistency always comes at a cost. It could be more prompts, more validation logic, more tokens, more latency. Before investing in tighter consistency, you must first determine whether the variation reduction is economically meaningful within the bounds of your system.
Consistency improvement is not merely a technical exercise. It is an ROI decision.
Understanding the sources of variation
Once variation is measured, the next task is to understand why it exists.
GenAI systems, at their core, are probabilistic engines. Even with identical inputs, slight shifts in interpretation can lead to different outputs. But in practice, most variation is not random. It usually traces back to identifiable causes. In fact, the largest driver of variation was not the model, but disagreement between reference sources that humans had silently worked around for years. The list boils down to the usual suspects - conflicting signals in inputs, broad prompts, inconsistent reference material, genuine business ambiguity and some inherent randomness attributable to LLMs.
Let us understand it better with an example. In the example of business code determination,
There could be multiple businesses with the same name, especially with the common names
Your reference manual to classify the business could be inconsistent or overlapping
The business could claim to do multiple things - which doesn’t neatly fit into a classification code
Improving consistency - knowing the trade-offs
Once sources of variation are understood, the next step is choosing interventions. Each intervention improves consistency - but at a cost. Operational leaders must choose deliberately.
Some of the ways that worked for us,
Strengthen input gating
Many inconsistent outcomes originate from ambiguous inputs. By tightening input acceptance criteria—an idea introduced earlier in the winnowing discussion—you prevent problematic cases from entering automated paths.
Trade-off: Reduced coverage.
You improve stability, but fewer cases qualify for automation. In one of our implementations, we could only process 2/3rd of the eligible volume due to the strict gating criteria. Needless to mention, as we bring more clarity on the rules, we will be able to increase this number by winnowing more.
Break problems into atomic steps
Complex decisions are often better handled as sequences of smaller, explicit judgments rather than a single large prompt.
Best practices include:
- Using structured prompts
- Providing clear examples
- Encouraging the system to abstain when uncertain
Trade-off:Increased latency and token usage
You gain clarity and control but consume more compute. In our case, each additional call to a reasoning LLM costed us more tokens and significantly more time (about 30 seconds per reasoning call)
Stabilize reference material
If your decision depends on external knowledge, the reference layer must be reliable.
If reliance on model memory introduces ambiguity, consider using curated reference systems such as retrieval-based architectures(RAG)
Trade-off: Latency and infrastructure cost
But the gain in determinism is often substantial. But it costs more. Even with a semantic search, this increased our token consumption by 3x.
Introduce corroboration logic
When ambiguity cannot be avoided, narrow the set of possible outcomes and use additional evidence to confirm the final decision. This mimics how experienced operators behave—cross-checking before committing.
Trade-off: Latency and system complexity. This requires you access to the web or database or other details of the application making the system more complex.
Use consensus mechanisms
In high-risk decisions, multiple models or evaluators can be used to reach consensus.
Agreement across independent evaluations increases reliability.
Trade-off: Token consumption and processing time
This approach should be reserved for decisions where the cost of error is significant.
To sum it up, all of our endeavors in improving consistency from the initial implementation increased the cost by 3X token and latency by 6X. But it was a deliberate and conscious choice as the automation formed the bedrock of all the decisions that follow.
It was a deliberate choice to get the foundation right, but not a permanent one. We are actively working on optimizing the system now that the consistency baseline is established.
The sequence matters: stabilize first, then optimize. Doing it the other way around is how you end up optimizing for the wrong thing.
Consistency is not a one-time certification
Many teams treat consistency testing as a milestone. That is a mistake.
Consistency drifts over time.
Inputs evolve. Policies change. Reference data shifts. Models are upgraded.
Without periodic testing, drift goes unnoticed until failures surface in production.
Two practices are essential:
- Repeatability testing at defined intervals
- Continuous monitoring for decision drift
Golden datasets—introduced in earlier discussions on evaluation—remain valuable here. They provide a stable reference point for tracking change across system versions.
The real lesson
Consistency improvement is not free.
Every additional layer of determinism consumes time, tokens, infrastructure, or coverage.
Before tightening your system, ask:
Does the variation materially affect outcomes?
Does it introduce operational risk?
Does reducing it produce measurable value?
If the answer is yes, invest deliberately.
If not, accept bounded variation and move forward.
Operational excellence is rarely about perfection.
It is about controlled reliability.
The wise old elf was right — variation is the spoiler that forces you to choose what actually matters. You can’t fight all of it. Pick your battles deliberately.
Where this leads next
At this stage in the journey, most organizations have:
Defined realistic evaluation metrics
Built momentum through winnowing
Stabilized decision consistency
What remains is the final challenge - in the ‘think big, start small and scale fast’ triad.
How do you scale the automation, measure the impact and govern these systems at scale. That’s where we go next.

