Skip to content
The Algorithm
InsightsAI & Machine Learning
AI & Machine Learningfinancial-services13 min read · 2025-10-29

SR 11-7 Model Risk Management for ML Models in Lending

SR 11-7 was written for traditional statistical models, but the Federal Reserve and OCC have made clear that the guidance applies with full force to machine learning models used in credit decisions. The validation frameworks that ML teams typically apply are insufficient for regulatory purposes: they address predictive performance but not model risk in the supervisory sense. This article explains what bank examiners look for in model risk management programmes for ML-based lending models and how to build validation infrastructure that passes scrutiny.

The Federal Reserve and OCC issued SR 11-7 / OCC 2011-12 in April 2011, establishing supervisory guidance on model risk management that has since become the de facto standard for model governance at US banking organisations. The guidance predates the widespread adoption of machine learning in credit decisioning by several years, but the Fed has been explicit in subsequent communications that ML models are models for purposes of SR 11-7 and must be subject to the same rigorous development, validation, and ongoing monitoring processes. Most banks have model risk management frameworks that are well-designed for traditional statistical models and seriously inadequate for ML models used in consequential credit decisions.

What SR 11-7 Actually Requires

SR 11-7 defines a model as a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates. An ML-based credit scoring model fits this definition clearly. The guidance requires that models be subject to conceptual soundness assessment, data quality review, outcome analysis, and stress testing during development; independent validation before deployment; and ongoing performance monitoring with defined escalation triggers for models that exhibit performance degradation.

The validation requirement is the one that ML teams most consistently underestimate. SR 11-7 validation is not the same as the hold-out test set evaluation that data science teams run during model development. Independent validation under SR 11-7 requires a team that is separate from the model development team, operating with defined independence and authority, conducting their own assessment of conceptual soundness, data representativeness, and the appropriateness of the model for its stated purpose. Statistical performance metrics — AUC, KS statistic, Gini coefficient — are part of the validation report but not the whole of it.

The Engineering Reality

A gradient boosting model with an AUC of 0.78 that was built on two years of origination data from a specific economic environment, with no assessment of performance under stress conditions, no analysis of potential proxy discrimination, and no documentation of the conceptual basis for its feature selection, is not a validated model under SR 11-7 regardless of its predictive accuracy.

Conceptual Soundness for ML Models

Conceptual soundness — SR 11-7's requirement that models be grounded in sound theoretical and empirical principles — is the most difficult requirement to satisfy for ML models, because many ML approaches are explicitly atheoretical: they find statistical patterns in data without requiring those patterns to have a causal mechanism. Bank examiners have raised conceptual soundness findings for ML credit models that could not articulate why their features should be predictive of credit risk on theoretical grounds, even when their statistical performance was strong.

The practical response is to constrain ML model feature selection to variables with defensible theoretical rationale — credit bureau attributes, income and employment history, account tenure and utilisation patterns — and to require documentation of the rationale for each feature retained in the final model. Features that improve statistical performance but lack clear rationale either should not be included or should be subject to additional scrutiny in the validation process. This is a design constraint that data science teams trained in maximising predictive accuracy find counterintuitive, but it is the constraint that SR 11-7 imposes.

Ongoing Monitoring Requirements

Once deployed, an ML credit model must be subject to ongoing monitoring that assesses both statistical performance and population stability. Population stability indices compare the distribution of model inputs in production scoring populations against the development population, flagging when the two have diverged enough that the model may no longer be calibrated correctly. Performance monitoring tracks model discrimination and calibration on new vintages as they season. Both monitoring processes must have defined thresholds that trigger escalation to model owners and, at higher thresholds, full re-validation.

Banks that build ML models without instrumenting them for population stability monitoring — which is easy to omit when time-to-deployment is the primary objective — discover the gap during examination. The finding is not merely that the model needs monitoring; it is that the bank has been making credit decisions with a model it cannot demonstrate is still performing as intended, which raises questions about the adequacy of the entire MRM programme.

The Model Inventory Problem

SR 11-7 requires banks to maintain a comprehensive model inventory covering all models in use. Banks that have adopted ML widely — in underwriting, fraud, collections, pricing, and customer retention — often have a model inventory problem: they have more models than their MRM infrastructure was designed to handle, and the models turn over faster than the validation cadence can accommodate. Addressing this requires either scaling the validation function or developing a tiered model risk classification that allows lower-risk models to receive lighter-touch validation, freeing capacity for higher-risk models. Both approaches are acceptable under SR 11-7 if the classification methodology is sound and documented.

Related Articles
Data Engineering

Real-Time Streaming Compliance: Kafka Governance at Scale

Read →
Data Engineering

Data Mesh Governance: Domain Ownership in Regulated Enterprises

Read →
Data Engineering

Time-Series Data Management for Financial and Operational Data

Read →
Facing This?

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an EngineerSee Case Studies →
Engage Us