What I mean when I say AI assurance

The most dangerous output an AI system produces is not the obviously wrong one. It's the confidently wrong one: the response that looks complete, reads as correct, and contains no signal that anything needs checking.

I've spent twenty years in regulated industries watching a version of this problem play out in different forms. Financial risk systems that satisfied their model validation requirements without satisfying reality. Cybersecurity programmes that achieved certification while remaining genuinely insecure. The pattern is consistent: the appearance of assurance fills the space that the substance of it should occupy, because the incentives reward the appearance. AI is the latest and largest instance of that pattern, and this is an attempt to say clearly what genuine assurance would actually require.

What assurance means when it has teeth

The formal definition, drawn from software safety engineering and standards like ISO/IEC 15026, is this: assurance is a structured argument, supported by evidence, that a system meets its specification. Not a feeling of confidence. Not a vendor's claim. A documented, reasoned case that a system behaves as intended, within defined bounds, under defined conditions, and that an independent party can evaluate and challenge.

Three properties make that definition meaningful in practice.

The first is independence. The argument has to be evaluable by someone outside the team that built the system. Internal confidence is not assurance. It's an opinion, and opinions have a well-documented tendency to align with incentives.

The second is reproducibility. The claimed behaviour has to be demonstrable reliably, under defined conditions, by someone who wasn't present the first time. Reproducibility is the evidentiary foundation of the assurance case. Without it, the argument has nothing to stand on.

The third is defined criteria. A system is not assured in the abstract. It's assured against a particular claim: it prices within these tolerances, it classifies within these error bounds, it fails in these defined ways and no others. Without the criteria, you have no basis for deciding whether the evidence is sufficient.

In industries where this has legal teeth, these aren't aspirational requirements. DO-178C governs airborne software. IEC 61508 covers functional safety in industrial systems. FDA frameworks regulate medical device software with structured pre-market evidence requirements. These industries built rigorous assurance frameworks because the cost of getting it wrong is visible, attributable, and tends to make the news. The frameworks are hard to satisfy, expensive to maintain, and not reducible to a checklist. Which is precisely the point.

In regulated financial services, the frameworks are less mature than aviation but more serious than most of what I now see in AI. They're layered and continuous: process documentation, independent model validation, ongoing monitoring, formal sign-off at defined checkpoints. The checkpoint is not bureaucracy. The checkpoint is the assurance. Without it you have a process that looks like assurance from a distance but doesn't hold up under scrutiny.

Where AI breaks the definition

Apply that definition to an AI system and it breaks almost immediately, at the foundation.

The same input does not guarantee the same output.

This isn't a bug. It's a fundamental property of how current language models work. Temperature, sampling, context length, subtle variations in phrasing: all of these shift the output. The system is non-deterministic by design, and that single property cascades into everything downstream.

You can't write a test that reliably passes or fails, because the system's response to a given input isn't fixed. You can't define a meaningful pass/fail threshold across the full input space, because that space is effectively unbounded. And you can't hold anyone straightforwardly accountable for an output that can't be reproduced, because the evidentiary foundation of the assurance case is gone.

A reasonable objection at this point is that non-determinism isn't unique to AI. Aviation software operates in variable environments. Medical devices encounter unanticipated edge cases. Safety-critical industries have dealt with probabilistic systems for decades and built frameworks that accommodate them. This is true and worth taking seriously, rather than waving away.

The difference with AI is one of degree, but the degree matters more than that phrase usually implies. In aviation, the input space is bounded and well-characterised. The behaviour of a flight management system emerges from a specification written by engineers who understood what it needed to do. In a language model, the input space is effectively the full range of human expression, and the behaviour emerges from training on data rather than from a specification anyone wrote. The gap between "here is the output" and "here is why the system produced it" is orders of magnitude larger. The assurance tools built for DO-178C don't transfer cleanly, and pretending they do is one of the main ways compliance theatre gets substituted for genuine assurance.

There's a second problem, and in my view it's the less discussed one.

AI systems are structurally inclined toward confident output. Not because they're deceptive, but because they're optimised to produce fluent, complete-seeming responses. Uncertainty and edge cases don't naturally surface. A model that has handled 80% of cases correctly and is uncertain about the remaining 20% doesn't say so. It produces something that looks like it handled all of them.

This matters enormously for assurance. A traditional system fails in ways that are legible: exceptions, errors, null outputs, boundary violations. An AI system fails in ways that require domain expertise to detect, because the failure looks like success. The output is confidently wrong rather than obviously broken, and confident wrongness is much harder to catch because the normal signal that triggers human review is absent.

This is also, I suspect, a large part of why AI gets trusted so quickly in organisational settings. Confidence is socially legible as competence. A tool that produces fluent, complete-looking output gets treated as a tool that's working. Nothing in the output says otherwise, so the review step gets skipped, not out of carelessness, but because nothing triggered it.

What assurance has to become

The answer isn't to abandon assurance. It's to rebuild it honestly for a non-deterministic system, which requires being explicit about what you are and aren't claiming.

My working position is that AI assurance needs to be probabilistic, layered, and continuous.

Probabilistic means accepting that you're characterising a distribution, not verifying deterministic behaviour. The system behaves correctly in X% of cases within Y bounds, under Z conditions. That's a weaker claim than traditional assurance makes. It's also an honest one, which is more than can be said for applying a deterministic framework to a non-deterministic system and filing the paperwork as though it means something.

Layered means no single control is sufficient. Evaluations against defined test sets. Behavioural monitoring in production. Input and output controls. Traditional deterministic testing where it still applies, which is mostly the scaffolding around the model rather than the model itself. Mechanistic interpretability where it's mature enough to be useful (a qualifier that currently does a lot of work). Each layer catches what the others miss, and the combination is what makes the overall case credible.

The point here isn't that every organisation needs to replicate what the major AI laboratories have built for their own model development. Those frameworks exist and organisations should know them. The point is that every organisation deploying AI needs some version of this, sized to their context and their risk. The mature safety-critical industries didn't wait for perfect frameworks. They built what they could with what they had and improved it over time. That approach is available now, and in most organisations it isn't being taken.

Continuous means the assurance doesn't end at deployment. A model's effective behaviour shifts with context: with the prompts and plugins layered on top of it, the users interacting with it, the data it encounters. The system deployed six months ago isn't quite the system running today. The monitoring needs to be permanent, not periodic.

A small and instructive example

Let me make this concrete with something recent and undramatic: not a catastrophic failure, but the kind of quiet structural problem that tends to surface by accident rather than through any review process.

A platform team rolled out a service mesh across a large number of engineering teams. They wrote a guide. Each team was responsible for following it. Partway through the rollout, someone introduced an AI coding tool to automate the bulk of the changes: faster, less manual, and conveniently, it counted toward an AI adoption target the team was being measured against.

The tool was imperfect. It missed cases. It didn't account for all the variations in the codebases it was touching. But because the task was seen as administrative overhead with no direct impact on anyone's own work, and because the output looked complete and coherent, nobody was reading it carefully. There was no checkpoint in the process that required them to. The original workflow had assumed human implementation with human judgement at the point of submission. When the AI tool substituted for that human effort, nobody asked whether the assurance assumptions still held.

The gap surfaced not through any review mechanism but through a demo, a conversation in which someone happened to show the owning team what the tool was doing. The owning team then had to explicitly ask for the generated code to be reviewed. That ask should have been structural from the moment an AI tool entered the workflow. It wasn't.

A sceptical reader might say this is a change management failure. Someone introduced a tool without proper governance and the process lacked a review gate. That would be true of any unsanctioned tool.

Partly, yes. But two things made this worse than an equivalent failure with a conventional tool. The first was the incentive structure: an AI adoption target actively accelerated the introduction in a way that generic tooling rarely does. The second was the output. A misconfigured script throws errors. A broken pipeline fails visibly. The AI tool produced something that looked done in every case, whether it had handled the case correctly or not. The signal that normally triggers human review was absent. The review step was skipped not because engineers were careless, but because nothing in the output told them they needed to look harder.

That's the AI-specific property that makes the missing checkpoint load-bearing in a way it isn't for conventional tooling. The system won't surface its own uncertainty. It won't indicate when its output needs checking. That responsibility sits entirely with the process around it, and most processes haven't been designed with that in mind.

The gap, plainly stated

The external assurance that comes with a well-tested foundation model, whatever the developer has done in terms of safety evaluation and red-teaming, doesn't reach your deployment context. It couldn't. The model developer has no knowledge of your workflows, your users, your prompts, your plugins, or your organisational incentives. The application context is yours. The assurance of it is yours.

Most organisations currently have no systematic way of providing it. They have a non-deterministic system, traditional assurance frameworks that don't fit it, an incomplete toolset for building new ones, and incentive structures that reward adoption over governance. The result looks a great deal like what happened in cybersecurity: frameworks that exist but aren't applied seriously, compliance-led approaches that satisfy regulators without addressing the underlying risk, and a growing gap between the systems being deployed and any honest account of their reliability.

I don't have a complete answer to that. Nobody does yet. What I have is a practitioner's pattern recognition and a working suspicion that AI is next in a sequence that has played out before, with stakes that are harder to bound.

The question worth sitting with: if the application context is where the real assurance gap lives, and existing frameworks don't reach it, what would it actually take to build something that does?

That's the problem worth working on.

Seiya Alger-Hilton is an engineering manager with twenty years in regulated financial systems. Hilton Labs is his outlet for thinking carefully about AI assurance and governance. Correspondence welcome: seiya@hiltonlabs.org