Clinical decision support systems built on large language models introduce a failure mode that classical rule-based CDS does not have. A rules engine that checks a drug-drug interaction database either has the rule or it does not. An LLM-based CDS system can generate a confident, plausible, well-formatted clinical recommendation that is factually incorrect -- a fabricated contraindication, a misremembered dosing threshold, a hallucinated drug name. In a clinical context, this is not a model quality problem. It is a patient safety event.
Why Hallucination Is a Patient Safety Risk
Hallucination occurs when an LLM generates text that is not grounded in its training data or retrieved context, but is presented with the same confidence as accurately recalled information. In clinical decision support, the consequences map directly to the harm taxonomy that FDA uses to classify medical device risk: incorrect medication recommendations, missed contraindications, and inaccurate diagnostic suggestions can contribute to patient harm.
The FDA classification of AI-based CDS under the 21st Century Cures Act distinguishes between CDS that a clinician can independently review and CDS that influences a clinical decision without independent review being practical. An LLM-based system that presents a drug dosing recommendation in a workflow where the clinician is expected to act on it without separately verifying against the primary literature falls into the higher-risk category. The FDA's enforcement discretion policy for CDS does not extend to high-risk SaMD functionality.
Retrieval-Augmented Generation as a Mitigation
Retrieval-augmented generation grounds LLM outputs in authoritative source documents rather than relying solely on the model's parametric knowledge. For clinical CDS, this means retrieving relevant content from a curated corpus -- FDA drug labelling, clinical practice guidelines, peer-reviewed literature -- and including that content in the model prompt before generating a response. The model is instructed to base its response on the retrieved context and to indicate when the retrieved context does not support a claim.
RAG substantially reduces hallucination frequency for factual questions covered by the retrieval corpus. It does not eliminate hallucination. The model can still misinterpret retrieved content, generate incorrect inferences from correct premises, or produce hallucinated text in sections of its response that go beyond the retrieved context. RAG is a necessary component of a hallucination mitigation strategy; it is not sufficient on its own.
Confidence Scoring and Human Review Workflows
Production clinical CDS systems require a confidence scoring layer that flags low-confidence recommendations for mandatory human review before they are presented to clinicians. Confidence scoring approaches include calibrated probability estimates from the model, consistency sampling (asking the model the same question multiple times and measuring response variance), and retrieval relevance scoring that indicates how well the retrieved context supports the generated recommendation.
Human review workflows must be designed around the clinical workflow, not around the model architecture. A review workflow that adds 30 seconds of latency to every recommendation will be bypassed in high-acuity settings. Tiered review -- automatic presentation for high-confidence recommendations, mandatory review for low-confidence ones -- requires the confidence scoring to be reliable enough that the high-confidence tier is genuinely trustworthy.
Post-Market Surveillance for Hallucination
FDA SaMD requirements include post-market surveillance obligations for cleared devices. For AI-based CDS, this must include mechanisms for detecting hallucination events in production: clinician feedback channels that capture instances where a recommendation was identified as incorrect, systematic comparison of model outputs against authoritative clinical sources, and adverse event monitoring that connects patient harm events to CDS recommendations. This is not a standard MLOps monitoring capability -- it requires integration with clinical workflow systems and incident reporting processes.
The Engineering Minimum for Safe Clinical CDS
A clinical CDS system built on LLMs requires, at minimum: a RAG architecture grounded in a versioned, curated clinical knowledge corpus; confidence scoring with calibrated thresholds; human review workflows for low-confidence recommendations; comprehensive audit logging of every recommendation and its inputs; and post-market surveillance integration. Teams that begin with a general-purpose LLM and add safety features later discover that the architectural changes required are more extensive than building the safety architecture from the start.
The engineering behind this article is available as a service.
We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.