Synthetic Data Generation for Regulated AI Training Sets

Synthetic data has been proposed as a route around the PHI and PII problem in AI training: generate synthetic patients or customers, train the model, never touch real regulated data. The practical reality is more constrained. HIPAA Expert Determination for synthetic data requires a qualified statistician to certify that re-identification risk is very small. GAN-based and diffusion-based generators can memorise rare patterns from training data and reproduce them in synthetic outputs, enabling re-identification through feature correlation. The FDA draft guidance on synthetic data in medical device submissions introduces representativeness and bias requirements that purely statistical approaches may not satisfy. The architecture has to be built around the privacy model, not the privacy assumption.

Synthetic data has been proposed as a route around the most difficult constraint in regulated AI development: you cannot train a model on data you cannot access at scale. The pitch is compelling -- generate synthetic patients, synthetic transactions, synthetic clinical notes, train your model on those, and never touch the regulated source data. The reality is that synthetic data does not eliminate the regulatory problem. It restructures it into a set of questions that most teams are not prepared to answer.

What Synthetic Data Actually Provides

Synthetic data generation produces new data records that statistically resemble the source dataset without directly copying any original record. The spectrum runs from purely statistical methods -- fitting a multivariate distribution and sampling from it -- to generative AI approaches including variational autoencoders, generative adversarial networks, and diffusion models. Statistical methods preserve macro-level distributions but may not capture complex feature interactions. Generative AI methods can produce highly realistic synthetic records but introduce new privacy risks through memorisation.

The privacy guarantee that synthetic data provides depends on the generation method and the evaluation methodology applied. It is not a property that follows automatically from calling data synthetic. A GAN trained on rare disease patient records can memorise low-frequency training examples and reproduce them in synthetic output, creating synthetic records that are effectively reconstructed real patients. This has been demonstrated empirically on medical imaging and clinical record datasets.

HIPAA De-Identification Requirements for Synthetic Data

HIPAA provides two methods for de-identifying protected health information: Safe Harbor, which requires removing 18 specific identifier categories, and Expert Determination, which requires a qualified expert to certify that the risk of re-identification is very small. Synthetic data claimed to be de-identified under HIPAA must satisfy one of these standards. The Expert Determination method is the more defensible path for generative synthetic data, as Safe Harbor's 18-identifier list does not address re-identification through complex feature combinations that generative models may preserve.

The Expert Determination process requires an expert -- typically a biostatistician or privacy engineer -- to evaluate the synthetic data generation method, assess the re-identification risk given the specific dataset characteristics and threat model, and certify the risk level in writing. This certification carries professional liability. Most organisations that have used synthetic PHI without Expert Determination have done so on the assumption that synthetic data is inherently de-identified. That assumption does not satisfy HIPAA.

FDA Requirements for Synthetic Data in Medical Device Submissions

The FDA draft guidance on the use of synthetic data for AI/ML-based medical devices introduces requirements beyond the HIPAA de-identification standard. For SaMD submissions that include synthetic training or validation data, FDA expects evidence that the synthetic data is representative of the clinical population the device will encounter, that the generation methodology preserves the relevant clinical relationships in the data, and that device performance validated on synthetic data is predictive of performance on real patient data. Passing a HIPAA de-identification standard does not address these representativeness and validity requirements.

GDPR and the Synthetic Data Boundary

GDPR applies to the processing of personal data. Whether synthetic data constitutes personal data under GDPR depends on whether individuals can be identified from it, directly or indirectly, by any means reasonably likely to be used. Synthetic data generated from personal data and retaining statistical properties of the source population may still be personal data under GDPR if re-identification is reasonably possible. The Article 29 Working Party opinion on anonymisation makes clear that the standard for GDPR anonymity is higher than many data teams assume. Legal review of the specific synthetic data methodology and the GDPR status of the output is required before treating synthetic data as outside GDPR scope.

Building a Defensible Synthetic Data Program

A defensible synthetic data program for regulated AI development requires: documented source data provenance and the legal basis for processing the source data; a generation methodology appropriate to the data type and privacy requirements; privacy risk evaluation including re-identification testing against realistic adversary capabilities; a qualified expert assessment where HIPAA Expert Determination is required; and ongoing monitoring of the synthetic data quality and its relationship to real population distributions. The investment is substantial, but it is the only path to synthetic data use that survives regulatory examination.

Compliance Engineering

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an Engineer See Case Studies →