Building the Compliant Data Platform: A Complete Architecture Guide

The compliant data platform synthesises the engineering patterns that regulated organisations need across data ingestion, storage, processing, and access into a coherent architecture. The platform must enforce data classification at ingestion, apply access controls at the column and row level in query execution, maintain immutable lineage from source to report, produce audit logs that satisfy HIPAA, GDPR, BCBS 239, and SOX simultaneously, and automate retention enforcement without manual intervention. No single product delivers all of these capabilities. The architecture is an integration of lakehouse storage, streaming ingestion, a query engine with policy enforcement, a data catalog with lineage, and an observability layer. This guide describes how those components connect in a production environment.

The compliant data platform is not a product that can be purchased from a single vendor. It is an integration of capabilities: storage, processing, access control, lineage, quality enforcement, observability, and retention management, that together provide the compliance properties regulated organisations require. Each capability is available from multiple vendors; the platform is the specific combination of products, configurations, and operational disciplines that makes them function as a coherent, auditable, compliance-native system. This guide describes the reference architecture and the integration decisions that turn a collection of data tools into a compliant data platform.

The Five Compliance Properties a Data Platform Must Provide

Before selecting components, the target compliance properties must be specified. A compliant data platform for regulated industries must provide five properties simultaneously. First, data classification: every field containing PHI, PII, financial data, or other regulated content must be classified at ingestion and that classification must persist through every transformation and storage tier, driving access control, masking, and retention decisions. Second, access control: classified fields must be accessible only to authorised identities, enforced at query execution time through column-level and row-level policies that cannot be circumvented by application code. Third, lineage: every data transformation from source to report must be traceable with sufficient granularity to answer regulatory reconstruction questions.

The fourth property is quality enforcement: data quality rules mapped to regulatory requirements must run in the pipeline and produce machine-readable evidence of execution. The fifth is audit logging: every access to regulated data must be logged with identity, timestamp, and access scope, in tamper-evident storage with retention matching regulatory requirements. The platform architecture must demonstrate all five properties simultaneously. A platform that satisfies four of the five fails compliance requirements because access control cannot enforce minimum necessary access without knowing which fields are regulated.

Ingestion Layer: Classification at the Source

The ingestion layer is where data classification must be applied. Data that enters the platform without classification cannot be governed correctly downstream. The ingestion pipeline must apply classification tags to every field, drawing on three sources: schema-defined classification where a field named patient_ssn is classified PII by convention, content-based detection using NER models that detect PHI and PII values in text fields, and manual classification governance through a data steward review process for datasets that automated classification cannot fully classify.

For streaming ingestion via Kafka, Schema Registry combined with field-level metadata tagging in the Avro or Protobuf schema carries classification through the pipeline. For batch ingestion via Spark or dbt, source-to-target mapping documents annotated with classification metadata propagate classification to the target table metadata. Apache Atlas or the Databricks Unity Catalog data catalog stores the classification tags and makes them queryable for access control policy evaluation and retention rule application.

Storage Layer: The Regulated Lakehouse

Delta Lake or Apache Iceberg on object storage within a BAA-covered cloud account provides the storage foundation. Object storage in the appropriate region for data residency requirements, with customer-managed key encryption, satisfies both data residency and encryption at rest requirements. Delta Lake's transaction log provides immutable mutation history for audit reconstruction. Partitioning by data classification category and ingestion date enables partition-level retention policy enforcement and minimum necessary access query optimisation.

Hot-warm-cold tiering within the storage layer manages cost against retention obligations. The hot tier is queried directly by the analytical layer and holds the operational retention window. The warm tier holds the mid-term retention window for regulatory reconstruction, stored as compressed Parquet in standard object storage and queryable through Athena or BigQuery. The cold tier holds data for the full regulatory retention period at the lowest possible storage cost using compliance lock storage. Automated lifecycle policies at the object storage layer manage tier transitions; the retention rule engine manages the deletion workflow when data reaches the end of its regulatory retention period.

Query and Processing Layer: Policy Enforcement at Execution

The query layer is where access control policies are enforced. Databricks Unity Catalog, Snowflake Business Critical, or Apache Ranger on a Spark cluster each provide the column-level masking and row-level security enforcement required. The policy enforcement must be transparent to application code: the same query returns differently masked results based on the querying identity's roles, without requiring the application to implement masking logic. This transparency makes the access control auditable because the masking cannot be bypassed by a developer who knows the application's database credentials.

For transformation workloads, the dbt project structure for regulated pipelines must separate models by data classification tier: raw models that expose full regulated data accessible only to platform engineers, staging models with masking applied accessible to data engineers, and mart models with the appropriate access controls for analytical consumers. dbt's meta field in schema.yml files carries data classification metadata that can be read by CI/CD pipelines to enforce that no model promotes regulated data to a tier with insufficient access controls.

Observability and Audit Layer

The observability layer serves both operational and compliance purposes. Structured logs from every pipeline component, enriched with data classification context, correlation IDs, and user identity attributes, feed into the SIEM for both security monitoring and compliance audit. OpenTelemetry traces across ingestion, transformation, and query services provide the access pattern reconstruction capability that compliance investigations require. Platform-level query audit logs provide column-level access evidence that standard application logs do not.

All audit log streams must be routed to a write-once, long-retention audit archive in addition to the operational SIEM. The operational SIEM has retention optimised for detection use cases, typically 90 days hot and one year warm. The compliance audit archive retains all access events for the full regulatory retention period of the data accessed: six years for HIPAA and seven years for SEC broker-dealer records. The two retention windows require two storage tiers: the SIEM for operational use and the immutable archive for compliance evidence preservation.

Organisational Operating Model

The compliant data platform requires an operating model that matches its technical architecture. A platform engineering team owns and operates the infrastructure: the storage layer, the query engine configuration, the access control policy framework, the observability pipeline, and the audit archive. A data governance team owns the classification schema, the retention rule catalogue, and the access control policy definitions that the platform enforces. Domain data teams own their data products and are accountable for meeting the platform's compliance standards. Compliance and legal teams own the regulatory requirements that translate into technical requirements for the platform.

The four-team operating model maps accountability to domain expertise: platform engineers build enforcement mechanisms, governance teams define what is enforced, domain teams implement within the framework, and compliance teams verify the framework matches regulatory obligations. No single team is responsible for all of it. And no compliance gap falls through the boundary between them if the accountability model is designed correctly from the outset rather than retrofitted after the first audit finding.

Compliance Engineering

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an Engineer See Case Studies →