Defeating Nondeterminism in LLMs? Only Part of the Story
“Reproducibility is the minimum requirement of science. Reliability is the minimum requirement of systems.”
A Personal Observation
When I test LLM-based agents, I sometimes run the exact same prompt multiple times. One run loops endlessly, another shortcuts to a shallow answer, a third succeeds cleanly. The divergence isn’t noise in the arithmetic—it’s something deeper. LLMs are statistical approximators, not causal reasoners. They generate outputs through local, distribution-based prediction rather than global, causal understanding—an architectural form of bounded rationality.
This pattern took me back to my doctoral research in motor neuroscience. Human movement is never identical from trial to trial. Variability is intrinsic. But the nervous system manages it by planning within boundaries and embedding feedback loops to keep actions reliable. Reliability in biology comes not from eliminating variability, but from designing around it.
What Thinking Machines’ Recent Paper Fixes
Thinking Machines recently showed how to “defeat nondeterminism” in LLM inference by fixing a quiet culprit: floating-point arithmetic (https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). On GPUs, the order of operations changes with batch size and parallelism, which means two identical inputs can yield slightly different outputs.
Their solution—batch-invariant kernels—removes this source of nondeterminism. For fixed inputs, on fixed hardware, under fixed conditions, the model will now produce bitwise identical outputs.
That’s important for reproducibility in testing, debugging, and compliance. It ensures you can tell whether a change in output was due to a prompt or model change, not some hidden kernel effect.
Bounded Determinism
But here’s the subtle limitation: this determinism holds only in bounded cases—template-driven prompts and controlled inputs where the expected output is also narrow and templated.
In those cases, it’s enormously valuable:
Regression testing during model updates.
Compliance audits where fixed prompts must return fixed answers.
Controlled evaluation pipelines.
Outside of those boundaries, the guarantees dissolve. In agentic workflows with diverse, shifting, or open-ended inputs, kernel determinism does not prevent divergence. The nondeterminism shifts from arithmetic to epistemics—from how the model computes tokens to how it interprets, generalizes, and plans.
The Deeper Problem
In my experience, this is where most failures actually arise:
Decision branching: Tiny variations in context cascade into divergent paths.
Generalization limits: The model converges toward statistical means, often ignoring underlying causal structure.
Environmental noise: Prompts evolve, APIs fail, states shift.
The motor system analogy fits again. If you constrain a task—say, tapping a metronome—you get reproducible patterns. But in open-world tasks—walking across uneven ground, catching a falling object—variability dominates, and reliability depends on adaptive scaffolding, not on repeatable execution.
LLMs today are closer to the motor system without the scaffolding. They can generate bitwise identical outputs in bounded tasks, but they lack the causal structure and adaptive feedback that would make them reliable in unbounded ones.
Business Implication
For decision-makers, the distinction is critical:
If you need reproducibility in bounded tests—scientific pipelines, compliance checks, controlled evaluations—the Thinking Machines fix is a true advance.
If you need reliability in real-world, agentic workflows—customer service agents, planning systems, autonomous assistants—the fix doesn’t address the core problem. Your risk is not kernel-level nondeterminism, but bounded rationality and epistemic fragility.
Closing Thought
My work in neuroscience taught me that reliability emerges not from sameness but from structure. Biological systems embrace variability yet stay robust by layering constraints and feedback.
Deterministic kernels sharpen the ruler. But in open systems, the ground itself still shifts. The real challenge remains: designing architectures that can stay reliable when their core components are variable function approximators.
If you found this perspective useful, consider subscribing. Free readers get my essays. Paid subscribers get exclusive analyses where I merge systems neuroscience with AI strategy—practical tools to stress-test decisions in your own organization.
Paying subscribers get access to my AI Reliability Checklist—7 practical questions you can use tomorrow to stress-test whether your LLM systems are reliable, not just deterministic.
Keep reading with a 7-day free trial
Subscribe to The Decision Cortex - Dr. Amit K. Shah (GNS-AI LLC) to keep reading this post and get 7 days of free access to the full post archives.

