Data Quality Engineering: Great Expectations in Production

Data quality failures in regulated pipelines are not just operational problems â€” they are compliance failures. A financial report built on miscounted records violates BCBS 239 Principle 2. A clinical dataset with duplicate patient identifiers undermines HIPAA-compliant care coordination. Great Expectations provides a framework for codifying data quality rules as executable, version-controlled tests that run in the pipeline rather than after it. Connecting expectation suites to regulatory data quality requirements transforms a quality framework into a compliance control â€” one that produces machine-readable evidence for auditors.

Data quality failures in regulated pipelines have consequences that extend beyond operational inconvenience. A regulatory capital report built on misclassified positions is a BCBS 239 violation. A clinical dataset with duplicate patient records creates HIPAA compliance gaps and patient safety risks. A GDPR subject access request fulfilled from an incomplete dataset exposes the organisation to enforcement action. Data quality in regulated industries is a compliance control, not just an engineering best practice. Great Expectations provides the framework to implement that control in code, version-control it alongside the pipeline, and generate machine-readable evidence of its execution.

Expectation Suites as Regulatory Control Implementations

Great Expectations organises quality rules into Expectation Suites. For regulated pipelines, each expectation suite should trace directly to a regulatory requirement. The expectation that patient_id is not null and matches the MPI format is an implementation of HIPAA de-identification standard requirements. The expectation that a financial report field falls within a prescribed tolerance is an implementation of BCBS 239 Principle 2 data accuracy controls. Documenting this traceability in expectation suite metadata creates a compliance evidence chain: the regulatory requirement, the expectation that enforces it, the validation result that proves it ran, and the Data Docs report that an auditor can inspect.

Great Expectations' profiling capability bootstraps expectation suites from existing datasets by inferring statistical properties. For a legacy regulated pipeline where quality rules exist in tribal knowledge rather than documented controls, profiling against known-good historical data generates an initial expectation suite that captures the implicit quality contract. The engineering team then reviews and augments the profiled suite with explicitly regulatory-required expectations that statistical profiling cannot infer.

Integration Points in Regulated Pipelines

Great Expectations integrates natively with Apache Spark, pandas, SQLAlchemy-compatible databases, and cloud data platforms including Snowflake, BigQuery, and Databricks. The integration point in a regulated pipeline should be immediately before any transformation that produces regulatory output. Data entering a capital calculation pipeline should be validated at ingestion, after each major transformation stage, and immediately before submission to the regulatory reporting system. Multi-stage validation creates a quality audit trail that identifies exactly where in the pipeline a quality failure occurred.

The Checkpoint abstraction in Great Expectations defines a reusable pipeline integration point: which expectation suites to run, against which data assets, with which action list on failure. Actions can include sending alerts, writing results to an S3 bucket for audit retention, or triggering an Airflow DAG failure that halts the pipeline. For regulated pipelines, halting on quality failure is the correct default. A capital report submitted with known data quality failures is worse than a late report with a documented quality gate hold.

Data Docs and Audit Evidence Generation

Great Expectations automatically generates Data Docs: HTML reports showing expectation suite definitions, validation results, and data profiles. For regulated organisations, Data Docs serve as machine-generated audit evidence. A BCBS 239 auditor asking for evidence of data quality controls receives a link to the Data Docs site showing every validation run, every expectation result, and every failure with timestamp and pipeline context. This is substantively different from a manual data quality log because it is generated directly from the same code that enforces the controls, eliminating documentation-to-reality divergence.

Data Docs can be stored in S3, Azure Blob Storage, or GCS with versioning enabled, creating an immutable archive of every quality validation result. Retention of Data Docs for the full regulatory retention period creates the historical quality evidence trail that long-horizon regulatory investigations require. An SR 11-7 model validation that asks whether the input data for a credit model met quality standards at the time the model was deployed can be answered from the Data Docs archive rather than from reconstructed evidence.

Managing Expectation Suite Evolution

Expectation suites must evolve as regulatory requirements change and as the data being processed changes. A new BCBS 239 regulatory technical standard that adds a required field means adding an expectation for that field's presence and format. Managing these changes requires that expectation suites live in version control alongside the pipeline code. A suite change is a code change that goes through the same review and approval process as any other pipeline modification.

For SOX-regulated financial pipelines, the change management audit trail for expectation suite modifications is itself a compliance requirement. A pull request that modifies the revenue recognition pipeline's expectation suite must be reviewed and approved by designated reviewers before merging. The pull request history serves as the ITGC change management evidence. Treating expectation suites as infrastructure code closes the gap between code-enforced quality controls and compliance-auditable change management.

Beyond Great Expectations: When to Supplement

Great Expectations handles column-level expectations and cross-column rules within a single dataset well. For complex cross-system data quality rules such as reconciling a reported position against two independent source systems, purpose-built reconciliation logic in the pipeline may be more maintainable than forcing the rule into an expectation suite. Great Expectations is the primary data quality framework for regulated pipelines, not the only tool. The reconciliation layer that validates regulatory report totals against source system snapshots is a complementary control, not a replacement for expectation-based quality enforcement.

Compliance Engineering

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an Engineer See Case Studies →