Preserve Deterministic Work Deterministic – O’Reilly

That is the second article in a collection on agentic engineering and AI-driven improvement. Learn half one right here, and search for the following article on April 2 on O’Reilly Radar.

The primary 90 p.c of the code accounts for the primary 90 p.c of the event time. The remaining 10 p.c of the code accounts for the opposite 90 p.c of the event time.
Tom Cargill, Bell Labs

One of many experiments I’ve been operating as a part of my work on agentic engineering and AI-driven improvement is a blackjack simulation the place an LLM performs a whole bunch of fingers in opposition to blackjack methods written in plain English. The AI makes use of these technique descriptions to resolve the best way to make hit/stand/double-down selections for every hand, whereas deterministic code offers the playing cards, checks the maths, and verifies that the principles have been adopted appropriately.

Early runs of my simulation had a 37% move fee. The LLM would add up card totals incorrect, skip the vendor’s flip solely, or ignore the technique it was imagined to observe. The large drawback was that these errors compounded: If the mannequin miscounted the participant’s whole on the third card, each resolution after that was based mostly on a incorrect quantity, so the entire hand was rubbish even when the remainder of the logic was superb.

There’s a helpful approach to consider reliability issues like that: the March of Nines. Getting an LLM-based system to 90% reliability is the primary 9, and it’s the “straightforward” one. Getting from 90% to 99% takes roughly the identical quantity of engineering effort. So does getting from 99% to 99.9%. Every 9 prices about as a lot because the final, and also you by no means cease marching. Andrej Karpathy coined the time period from his expertise constructing self-driving methods at Tesla, the place they spent years incomes two or three nines and nonetheless had extra to go.

Right here’s a small train that exhibits how that form of failure compounding works. Open any AI chatbot operating an early 2026 mannequin (I used ChatGPT 5.3 Immediate) and paste the next eight prompts one after the other, every in a separate message. Go forward, I’ll wait.

Immediate 1: Observe a operating “rating” by way of a 7-step sport. Don’t use code, Python, or instruments. Do that solely in your head. For every step, I provides you with a sentence and a rule.

CRITICAL INSTRUCTION: You could reply with ONLY the mathematical equation exhibiting the way you up to date the rating. Instance format: 10 + 5 = 15 or 20 / 2 = 10. Don’t record the phrases you counted, don’t clarify your reasoning, and don’t write some other textual content. Simply the equation.

Begin with a rating of 10. I’ll provide the first step within the subsequent immediate.

Immediate 2: “The sudden blizzard chilled the small village communities.” Add the variety of phrases containing double letters (two of the very same letter back-to-back, like ‘tt’ or ‘mm’).

Immediate 3: “The intelligent engineer wanted seven excellent items of cheese.” In case your rating is ODD, add the variety of phrases that comprise EXACTLY two ‘e’s. In case your rating is EVEN, subtract the variety of phrases that comprise EXACTLY two ‘e’s. (Don’t depend phrases with one, three, or zero ‘e’s).

Immediate 4: “The great sailor joined the keen crew aboard the wood boat.” In case your rating is larger than 10, subtract the variety of phrases containing consecutive vowels (two totally different or equivalent vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). In case your rating is 10 or much less, multiply your rating by this quantity.

Immediate 5: “The short brown fox jumps over the lazy canine.” Add the variety of phrases the place the THIRD letter is a vowel (a, e, i, o, u).

Immediate 6: “Three courageous kings stand beneath black skies.” In case your rating is an ODD quantity, subtract the variety of phrases which have precisely 5 letters. In case your rating is an EVEN quantity, multiply your rating by the variety of phrases which have precisely 5 letters.

Immediate 7: “Look down, you shy owl, go fly away.” Subtract the variety of phrases that comprise NONE of those letters: a, e, or i.

Immediate 8: “Inexperienced apples fall from tall bushes.” In case your rating is larger than 15, subtract the variety of phrases containing the letter ‘a’. In case your rating is 15 or much less, add the variety of phrases containing the letter ‘l’.

The train tracks a operating rating by way of seven steps. Every step offers the mannequin a sentence and a counting rule, and the rating carries ahead. The proper ultimate rating is 60. Right here’s the reply key: begin at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).

I ran this twice on the identical time (utilizing ChatGPT 5.3 Immediate), and bought two utterly totally different incorrect solutions the primary time I attempted it. Neither run reached the proper rating of 60:

Step Right Run 1 (transcript) Run 2 (transcript)
1. Double letters 10 + 6 = 16 10 + 2 = 12 ❌ 10 + 5 = 15 ❌
2. Precisely two ‘e’s 16 − 4 = 12 12 − 4 = 8 ❌ 15 + 4 = 19 ❌
3. Consecutive vowels 12 − 7 = 5 8 × 7 = 56 ❌ 19 − 5 = 14 ❌
4. Third letter vowel 5 + 5 = 10 56 + 5 = 61 ❌ 14 + 3 = 17 ❌
5. Precisely 5 letters 10 × 7 = 70 61 − 7 = 54 ❌ 17 − 4 = 13 ❌
6. No a, e, or i 70 − 7 = 63 54 − 7 = 47 ❌ 13 − 3 = 10 ❌
7. Phrases with ‘a’ or ‘i’ 63 − 3 = 60 47 − 3 = 44 ❌ 10 + 4 = 14 ❌

The 2 runs inform very totally different tales. In Run 1, the mannequin miscounted in Step 1 (discovered 2 double-letter phrases as a substitute of 6) however really bought the later counts proper. It didn’t matter. The incorrect rating in Step 1 flipped a department in Step 3, triggering a multiply as a substitute of a subtract, and the rating by no means recovered. One early mistake threw off your entire chain, despite the fact that the mannequin was doing good work after that.

Run 2 was a catastrophe. The mannequin miscounted at virtually each step, compounding errors on prime of errors. It ended at 14 as a substitute of 60. That’s nearer to what Karpathy is describing with the March of Nines: Every step has its personal reliability ceiling, and the longer the chain, the upper the prospect that not less than one step fails and corrupts every little thing downstream.

What makes this insidious: Each runs look the identical from the surface. Every step produced a believable reply, and each runs produced ultimate outcomes. With out the reply key (or some tedious handbook checking), you’d haven’t any approach of realizing that Run 1 was a near-miss derailed by a single early error and Run 2 was incorrect at almost each step. That is typical of any course of the place the output of 1 LLM name turns into the enter for the following one.

These failures don’t exhibit the March of Nines itself—that’s particularly in regards to the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s attainable to breed the total compounding-reliability drawback in a chat, however a immediate that did it reliably could be far too lengthy to place in an article.) As an alternative, I opted for a shorter train which you’ll be able to simply check out your self that demonstrates the underlying drawback that makes the march so onerous: cascading failures. Every step asks the mannequin to depend letters inside phrases, which is deterministic work {that a} brief Python script handles completely. LLMs, alternatively, don’t really deal with phrases as strings of characters; they see them as tokens. Recognizing double letters means unpacking a token into its characters, and the mannequin will get that incorrect simply typically sufficient to reliably screw it up. I added branching logic the place every step’s end result determines the following step’s operation, so a single miscount in Step 1 cascades by way of your entire sequence.

I additionally wish to be clear about precisely what a deterministic model of this simulation appears to be like like. Fortunately, the AI may also help us with that. Go to both run (or your individual) and paste another immediate into the chat:

Immediate 9: Now write a brief Python script that does precisely what you simply did: begin with a rating of 10, apply every of the seven guidelines to the seven sentences, and print the equation at every step.

Run the script. It ought to print the proper reply for each step, ending at 60. The identical AI that simply failed the train can write code that does it flawlessly, as a result of now it’s producing deterministic logic as a substitute of making an attempt to depend characters by way of its tokenizer.

Reproducing a cascading failure in a chat

I intentionally engineered the train earlier to provide you a technique to expertise the cascading failure drawback behind the March of Nines your self. I took benefit of one thing present LLMs genuinely suck at: parsing characters inside tokens. Future fashions may do a a lot better job with this particular form of failure, however the cascading failure drawback doesn’t go away when the mannequin will get smarter. So long as LLMs are nondeterministic, any step that depends on them has a reliability ceiling beneath 100%, and people ceilings nonetheless multiply. The particular weak point modifications; the maths doesn’t.

I additionally particularly requested the mannequin to indicate solely the equation and skip all intermediate reasoning to stop it from utilizing chain of thought (or CoT) to self-correct. Chain of thought is a way the place you require the mannequin to indicate its work step-by-step (for instance, itemizing the phrases it counted and explaining why each qualifies), which helps it catch its personal errors alongside the best way. CoT is a standard approach to enhance LLM accuracy, and it really works. As you’ll see later after I speak in regards to the evolution of my blackjack simulation, CoT minimize sure errors roughly in half. However “half as many errors” remains to be not zero. Plus, it’s costly: It prices extra tokens and extra time. A Python script that counts double letters will get the precise reply on each run, immediately, for zero AI API prices (or, in case you’re operating the AI regionally, for orders of magnitude much less CPU utilization). That’s the core stress: You may spend engineering effort making the LLM higher at deterministic work, or you’ll be able to simply hand it to code.

Each step on this train is deterministic work that code handles flawlessly. However most fascinating LLM duties aren’t like that. You may’t write a deterministic script that performs a hand of blackjack utilizing natural-language technique guidelines, or decides how a personality ought to reply in dialogue. Actual work requires chaining a number of steps collectively right into a pipeline, or a reproducible collection of steps (some deterministic, some requiring an LLM) that result in a single end result, the place every step’s output feeds the following. If that seems like what you simply noticed within the train, it’s. Besides actual pipelines are longer, extra advanced, and far more durable to debug when one thing goes incorrect within the center.

LLM pipelines are particularly vulnerable to the March of Nines

I’ve been spending numerous time eager about LLM pipelines, and I believe I’m within the minority. Most individuals utilizing LLMs are working with single prompts or brief conversations. However when you begin constructing multistep workflows the place the AI generates structured knowledge that feeds into the following step—whether or not that’s a content material technology pipeline, a knowledge processing chain, or a simulation—you run straight into the March of Nines. Every step has a reliability ceiling, and people ceilings multiply. The train you simply tried had seven steps. The blackjack pipeline has extra, and I’ve been operating it a whole bunch of instances per iteration.

The blackjack pipeline in Octobatch
The blackjack pipeline in Octobatch, an open supply batch orchestrator for multistep LLM workflows that I launched in “The Unintentional Orchestrator.”

That’s a screenshot of the blackjack pipeline in Octobatch, the device I constructed to run these pipelines at scale. That pipeline offers playing cards deterministically, asks the LLM to play every hand following a method described in plain English, then validates the outcomes with deterministic code. Octobatch makes it straightforward to alter the pipeline and rerun a whole bunch of fingers, which is how I iterated by way of eight variations—and the way I actually discovered the onerous approach that the March of Nines wasn’t only a theoretical drawback however one thing I may watch taking place in actual time throughout a whole bunch of knowledge factors.

Operating pipelines at scale made the failures apparent and fast, which, for me, actually underscored an efficient strategy to minimizing the cascading failure drawback: make deterministic work deterministic. Which means asking whether or not each step within the pipeline really must be an LLM name. Checking {that a} jack, a 5, and an eight add as much as 23 doesn’t require a language mannequin. Neither does wanting up whether or not standing on 15 in opposition to a vendor 10 follows primary technique. That’s arithmetic and a lookup desk—work that odd code does completely each time. And as I discovered over the course of enhancing the failure fee for the pipeline, each step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure fee.

Counting on the AI for deterministic work is the computation aspect of a sample I wrote about for knowledge in “AI, MCP, and the Hidden Prices of Knowledge Hoarding.” Groups dump every little thing into the AI’s context as a result of the AI can deal with it—till it might’t. The identical factor occurs with computation: Groups let the AI do arithmetic, string matching, or rule analysis as a result of it largely works. However “largely works” is pricey and gradual, and a brief script does it completely. Higher but, the AI can write that script for you—which is precisely what Immediate 9 demonstrated.

Getting cascading failures out of the blackjack pipeline

I pushed the blackjack pipeline by way of eight iterations, and the outcomes taught me extra about incomes nines than I anticipated. That’s why I’m writing this text—the iteration arc turned out to be one of many clearest illustrations I’ve discovered of how the precept works in observe.

I addressed failures two methods, and the excellence issues.

Some failures known as for making work deterministic. Card dealing runs as an area expression step, which doesn’t require an API name, so it’s free, instantaneous, and 100% reproducible. There’s a math verification step that makes use of code to recalculate totals from the precise playing cards dealt and compares them in opposition to what the LLM reported, and a method compliance step checks the participant’s first motion in opposition to a deterministic lookup desk. Neither of these steps require any AI to make a judgment name; after I initially ran them as LLM calls, they launched errors that have been onerous to detect and costly to debug.

Different failures known as for structural constraints that made particular error patterns more durable to provide. Chain of thought format compelled the LLM to indicate its work as a substitute of leaping to conclusions. The inflexible vendor output construction made it mechanically tough to skip the vendor’s flip. Specific warnings about counterintuitive guidelines gave the LLM a cause to override its coaching priors. These don’t eradicate the LLM from the step—they make the LLM extra dependable inside it.

However earlier than any of that mattered, I needed to face the uncomfortable indisputable fact that measurements themselves could be incorrect, particularly when counting on AI to take these measurements. For instance, the primary run reported a 57% move fee, which was nice! However after I seemed on the knowledge myself, numerous runs have been clearly incorrect. It turned out that the pipeline had a bug: Verification steps have been operating, however the AI step that was imagined to implement didn’t have ample guardrails, so virtually each hand handed whatever the precise knowledge. I requested three AI advisors to evaluate the pipeline, and none of them caught it. The one factor that uncovered it was checking the mixture numbers, which didn’t add up. When you let probabilistic conduct right into a step that needs to be deterministic, the output will look believable and the system will report success, however you don’t have any technique to know one thing’s incorrect till you go searching for it.

As soon as I fastened the bug, the actual move fee emerged: 31%. Right here’s how the following seven iterations performed out:

  • Restructuring the info (31% → 37%). The LLM saved dropping observe of the place it was within the deck, so I restructured the info it obtained to eradicate the bookkeeping. I additionally eliminated break up fingers solely, as a result of monitoring two simultaneous fingers is stateful bookkeeping that LLMs reliably botch. Every repair got here from taking a look at what was really failing and asking whether or not the LLM wanted to be doing that work in any respect.
  • Chain of thought arithmetic (37% → 48%). As an alternative of letting the LLM soar to a ultimate card whole, I required it to indicate the operating math at each step. Forcing the mannequin to hint its personal calculations minimize multidraw errors roughly in half. CoT is a structural constraint, not a deterministic substitute; it makes the LLM extra dependable throughout the step, but it surely’s additionally costlier as a result of it makes use of extra tokens and takes extra time.
  • Changing the LLM validator with deterministic code (48% → 79%). This was the only largest enchancment in your entire arc. The pipeline had a second LLM name that scored how precisely the participant adopted technique, and it was incorrect 73% of the time. It utilized its personal blackjack intuitions as a substitute of the principles I’d given it. However there’s a proper reply for each scenario in primary technique, and the principles could be written as a lookup desk. Changing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected fingers.
  • Inflexible output format (79% → 81%). The LLM saved skipping the vendor’s flip solely, leaping straight to declaring a winner. Requiring a step-by-step vendor output format made it mechanically tough to skip forward.
  • Overriding the mannequin’s priors (81% → 84%). One technique required hitting on 18 in opposition to a excessive vendor card, which any standard blackjack knowledge says is horrible. The LLM refused to do it. Restating the rule didn’t assist. Explaining why the counterintuitive rule exists did: The immediate needed to inform the mannequin that the unhealthy play was intentional.
  • Switching fashions (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was straightforward to do as a result of Octobatch permits you to run the identical pipeline with any mannequin from Gemini, Anthropic, or OpenAI. I lastly earned my first 9.

Discover the perfect methods to earn your nines

When you’re constructing something the place LLM output feeds into the following step, the identical query applies to each step in your chain: Does this really require judgment, or is it deterministic work that ended up within the LLM as a result of the LLM can do it? The technique validator felt like a judgment name till I checked out what it was really doing, which was checking a hand in opposition to a lookup desk. That one recognition was price greater than all of the immediate engineering mixed. And as Immediate 9 confirmed, the AI is commonly the perfect device for writing its personal deterministic substitute.

I discovered this lesson by way of my very own work on the blackjack pipeline. It went by way of eight iterations, and I feel the numbers inform a narrative. The fixes fell into two classes: making work deterministic (pulling it out of the LLM solely) and including structural constraints (making the LLM extra dependable inside a step). Each earn nines, however pulling work out of the LLM solely earns these nines quicker. The most important single soar in the entire arc—48% to 79%—got here from changing an LLM validator with a 10-line expression.

Right here’s the underside line for me: When you can write a brief operate that does the job, don’t give it to the LLM. I initially reached for the LLM for technique validation as a result of it felt like a judgment name, however as soon as I seemed on the knowledge I noticed it wasn’t in any respect. There was a proper reply for each hand, and a lookup desk discovered it extra reliably than a language mannequin.

On the finish of eight iterations, the pipeline handed 94% of fingers. The 6% that also fail could also be trustworthy limits of what the mannequin can do with multistep arithmetic and state monitoring in a single immediate. However they might simply be the following 9 that I must earn.

The subsequent article appears to be like on the different aspect of this drawback: As soon as you understand what to make deterministic, how do you make the entire system legible sufficient that an AI may also help your customers construct with it? The reply seems to be a form of documentation you write for AI to learn, not people—and it modifications the best way you consider what a person handbook is for.

Muhib
Muhib
Muhib is a technology journalist and the driving force behind Express Pakistan. Specializing in Telecom and Robotics. Bridges the gap between complex global innovations and local Pakistani perspectives.

Related Articles

Stay Connected

1,857,103FansLike
121,208FollowersFollow
6FollowersFollow
1FollowersFollow
- Advertisement -spot_img

Latest Articles