LLM Hallucination in Healthcare: Engineering Risk Mitigation That Satisfies FDA

21 CFR 820

FDA Quality System Regulation requiring documented risk controls for software used in medical devices

LLM hallucination in healthcare isn't a research problem â€” it's an FDA Software as a Medical Device (SaMD) post-market surveillance problem. The mitigation architectures that satisfy FDA's Real-World Performance monitoring requirements include RAG with source validation, deterministic guardrails, and human-in-the-loop checkpoints that produce auditable decision records. Deploying an LLM in a clinical workflow without these architectures in place is not a compliance gap â€” it's a product liability exposure.

The FDA's guidance on Artificial Intelligence and Machine Learning-Based Software as a Medical Device (AI/ML-Based SaMD) addresses predetermined change control plans, real-world performance monitoring, and transparency to users. It does not use the word "hallucination." It does not need to. When an LLM deployed in a clinical decision support system produces a confident, detailed, factually incorrect output about drug interactions, contraindications, or diagnosis criteria, the regulatory consequence is covered by 21 CFR Part 820 Quality System Regulation, 21 CFR Part 803 Medical Device Reporting, and the FDA's post-market surveillance obligations — regardless of what the technical failure mode is called.

What "Hallucination" Means in Regulatory Terms

LLM hallucination — the generation of fluent, confident output that is factually unsupported or directly contradicted by the input context — creates two distinct regulatory problems in healthcare. First, it is a product defect under the FDA's definition: a malfunction that could cause or contribute to serious injury or death is reportable under 21 CFR Part 803. Second, it creates liability exposure under state medical practice acts and professional liability frameworks that may not have been designed with stochastic AI outputs in mind. The engineering response must address both: prevent the defect from reaching clinical use, and produce the evidence trail that demonstrates the prevention mechanism functioned.

The Engineering Reality

The FDA's 2021 AI/ML SaMD Action Plan introduced the concept of "Real-World Performance" monitoring — continuous post-deployment evaluation against real clinical outcomes. For LLMs, this requires collecting ground truth data on AI outputs (which requires clinical workflow integration), computing accuracy metrics against that ground truth, and triggering the predetermined change control process when performance degrades. Most current clinical LLM deployments have none of this infrastructure.

RAG with Source Validation

Retrieval-Augmented Generation (RAG) reduces hallucination rates by grounding LLM outputs in retrieved context. For healthcare applications, the retrieval corpus must be authoritative: clinical guidelines from specialty societies, FDA drug labelling (DailyMed), current ICD-10/CPT code sets, and peer-reviewed clinical evidence. The validation step that most RAG implementations skip: verifying that the LLM's output is actually supported by the retrieved context, not merely adjacent to it. This requires either a second LLM pass (expensive) or deterministic extraction of claims from the output and matching them to source passages (complex but auditable).

The audit trail requirement is specific: for each AI output, the record must show which context documents were retrieved, what the retrieval query was, and whether the output claims are traceable to source passages. This is the evidence the FDA's post-market surveillance process requires, and it is also the evidence a malpractice defence requires if the output contributed to a clinical decision. Build the audit trail into the RAG pipeline, not as a logging afterthought.

Deterministic Guardrails

Drug interaction checking: route all medication-related LLM outputs through a deterministic drug interaction database (DrFirst, Medi-Span, Multum) before presenting to clinicians — the LLM output is a draft, the database is the authority
Diagnosis code validation: any diagnosis or procedure code the LLM generates must be validated against the current ICD-10-CM/PCS or CPT code set before inclusion in clinical documentation
Dosing range enforcement: for any dosing recommendation, apply deterministic min/max range checks from the drug label before presenting to prescribers
Out-of-scope detection: classify every clinical query against a defined scope boundary — queries about topics outside the validated scope should be refused or escalated, not answered with LLM generation
Confidence thresholding: require calibrated confidence scores (not just generation probability) and refuse to present outputs below a threshold calibrated against clinical validation data

Human-in-the-Loop as an FDA Control

The FDA's guidance on locked versus adaptive AI/ML distinguishes between systems where a human is in the loop and systems where the AI output directly drives clinical action. For SaMD, human-in-the-loop mechanisms are not a UX preference — they are a risk control that affects the device classification and the required level of clinical validation. Engineering human-in-the-loop correctly means ensuring the human actually reviews the output (not just clicks through), that the review is documented in the audit trail, and that the clinical decision maker has the information needed to override the AI recommendation. A checkbox labelled "reviewed" is not a human-in-the-loop control.

Architecture

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an Engineer See Case Studies →

LLM Hallucination in Healthcare: Engineering Risk Mitigation That Satisfies FDA

What "Hallucination" Means in Regulatory Terms

RAG with Source Validation

Deterministic Guardrails

Human-in-the-Loop as an FDA Control

What Happens to Your HIPAA BAAs When You Migrate to Cloud

Agentic AI in Healthcare: The HIPAA Problems Nobody Is Talking About

Why NHS DSPT Failures Are an Engineering Problem, Not a Policy Problem

The engineering behind this article is available as a service.