Skip to content
The Algorithm
InsightsFinancial Services Engineering
Financial Services Engineeringfinancial-services11 min read · 2024-07-09

Graph Database Applications in Fraud Detection

Financial fraud increasingly exploits the connections between entities rather than the anomalies within individual transactions. Synthetic identity fraud, first-party fraud rings, and money laundering networks are all graph problems — they involve patterns of relationships between accounts, devices, addresses, and people that are invisible in row-based analysis. Neo4j, Amazon Neptune, and TigerGraph provide the traversal semantics that detect these patterns in near-real-time. Integrating graph database queries into a fraud decision engine requires understanding the latency characteristics of different traversal patterns, the licensing and compliance implications of storing PII in a graph store, and the model risk management requirements for AI-assisted fraud scoring under CFPB examination standards.

Fraud detection based on individual transaction anomalies has a fundamental limitation: the most sophisticated and damaging fraud patterns are not detectable in any single transaction. Synthetic identity fraud, first-party fraud rings, and money laundering networks are all graph problems. They involve patterns of relationships between accounts, devices, addresses, and people that are invisible in row-based analysis. Graph databases are the technical infrastructure that makes these relationship patterns computationally tractable at the latency required for real-time fraud decisions.

Graph Data Models for Financial Fraud

A financial graph database for fraud detection models the entity-relationship structure of the financial ecosystem. Nodes represent entities: customers, accounts, devices, IP addresses, email addresses, phone numbers, physical addresses, merchants, and beneficial owners. Edges represent relationships: a customer owns an account, an account transacted with a merchant, a device logged into an account, a phone number was associated with an account, an address is shared by multiple customers. The fraud signal emerges from graph traversal: a cluster of synthetic identities shares phone numbers and IP addresses that link them in the graph even when no single identity appears anomalous in isolation.

The specific Cypher query in Neo4j, Gremlin query in Amazon Neptune or JanusGraph, or GSQL query in TigerGraph that detects a fraud ring traverses the graph to find communities of accounts connected through shared identity elements. A query that finds all accounts sharing a device fingerprint with an account under fraud investigation, and all accounts sharing an email domain with those accounts, then ranks the result by the density of shared elements, returns a ranked candidate list of potentially related fraudulent accounts in milliseconds. The equivalent query in a relational database would require multiple self-joins and significant execution time.

Real-Time vs. Batch Graph Analytics

Graph fraud detection operates at two timescales with different architectural requirements. Real-time graph lookup at transaction authorisation time requires a graph database with sub-100-millisecond query latency for single-hop and two-hop traversals. Neo4j's native graph storage, TigerGraph's distributed graph engine, and Amazon Neptune with Gremlin can satisfy this latency requirement for the specific traversal patterns involved in real-time authorisation. The graph database must be pre-populated with the current state of entity relationships, updated as new relationships are discovered, and queried with traversal patterns that are indexed for the specific real-time use cases.

Batch graph analytics for ring detection, community discovery, and network topology analysis operate at a different timescale and can use different tooling. GraphX on Apache Spark, graph neural network libraries such as PyTorch Geometric and Deep Graph Library, and in-database graph extensions are used for batch fraud network analysis that does not require sub-second response times. The architectural decision is which analyses must operate in the real-time path and which can operate in the batch path.

PII in Graph Stores: Compliance Architecture

Graph databases store PII and financial data that is regulated under GLBA, CCPA, GDPR, and FCRA. The access control models of major graph databases are less mature than those of SQL databases. Neo4j Enterprise provides role-based access control with property-level access restrictions in Neo4j 4.4 and above. Amazon Neptune uses IAM authentication for API access with resource-based policies. TigerGraph provides RBAC for schema object access. None of these approaches match the column-level security maturity of Snowflake or PostgreSQL with row-level security.

The compliance architecture for a regulated graph fraud database typically implements access control at the application service layer rather than relying solely on the database's native controls. A fraud intelligence API that sits between the graph database and consuming applications enforces authorisation decisions, logs every graph query with the requesting analyst's identity, and redacts PII from results returned to analysts who are not authorised to see specific fields. This pattern compensates for graph database access control immaturity and is the pragmatic approach given the current state of graph database security tooling.

Model Risk Management for Graph-Based Fraud Scoring

Graph-based fraud scores that contribute to adverse action decisions such as declining a transaction, closing an account, or filing a Suspicious Activity Report are subject to model risk management requirements under SR 11-7 and CFPB examination scrutiny for fair lending compliance. A graph traversal algorithm that assigns higher fraud risk to accounts connected to certain network topologies must be validated for both predictive accuracy and disparate impact. If the graph connectivity patterns that predict fraud also correlate with protected class membership, the fraud model may have discriminatory impact that requires validation and monitoring.

SR 11-7 validation for graph-based fraud models requires documenting the traversal logic and feature engineering, validating the model on out-of-time and out-of-sample test sets, stress-testing against novel fraud patterns not present in the training data, and ongoing monitoring of model performance and protected class impact. The graph model built by a data science team without a model risk management framework creates regulatory exposure that surfaces when the CFPB examines the bank's adverse action model inventory.

Integration with Existing Fraud Infrastructure

Graph fraud detection supplements rather than replaces existing rule-based and ML-based fraud detection infrastructure. The integration architecture must route transaction events to both the existing fraud engine and the graph feature extraction service, combine graph-derived features with traditional features, and produce a composite fraud score within the authorisation latency budget. The graph feature extraction service must meet the same SLAs as the core fraud engine. A degraded-mode implementation that returns a neutral graph feature score when the graph database is unavailable, falling back to non-graph fraud scoring, is the operational resilience pattern for production graph fraud integration.

Related Articles
Data Engineering

Real-Time Streaming Compliance: Kafka Governance at Scale

Read →
Data Engineering

Data Mesh Governance: Domain Ownership in Regulated Enterprises

Read →
Data Engineering

Time-Series Data Management for Financial and Operational Data

Read →
Facing This?

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an EngineerSee Case Studies →
Engage Us