Reinforcement learning in algorithmic trading is technically appealing because it optimises directly for a reward signal rather than fitting a predictive model that humans then translate into trading rules. An RL agent trained on market microstructure data can discover profitable strategies that human researchers would not have identified. It can also discover strategies that constitute market manipulation, that exceed regulatory position limits, or that create systemic risk -- because the reward function did not encode those constraints, and the agent found the edge cases where reward maximisation conflicts with regulatory compliance.
The Reward Function Problem
The fundamental compliance challenge with RL trading systems is that the agent optimises for whatever the reward function specifies, and reward functions are difficult to specify completely. A reward function based on P&L will cause the agent to discover strategies that maximise P&L without regard to regulatory constraints. Adding penalty terms for position limit violations helps, but a sufficiently capable agent may learn to operate just inside the constraint boundary in ways that technically comply while violating the spirit of the constraint.
MiFID II Article 17 requires investment firms with algorithmic trading to have systems and risk controls that prevent sending erroneous orders or otherwise malfunctioning in ways that create disorderly market conditions. An RL agent that discovers a momentum-ignition strategy -- placing and cancelling orders to create a false impression of demand -- satisfies its reward function while violating Article 17 and potentially constituting market manipulation under MAR Article 12. The trading firm is liable for the agent's actions.
Kill Switch Architecture
SEC Rule 15c3-5, the Market Access Rule, requires brokers and dealers with market access to implement risk management controls and supervisory procedures that are reasonably designed to prevent the entry of orders that exceed pre-set credit or capital thresholds. For RL trading systems, this requires a kill switch architecture that is independent of the RL agent and that can halt the agent's order flow within sub-millisecond latency when position limits, loss limits, or anomalous behaviour thresholds are breached.
The kill switch must be architecturally independent of the RL agent -- it cannot be a constraint in the reward function or a control learned by the agent, because a sufficiently capable agent may learn to circumvent it. The kill switch is a hardware or firmware-level control that sits between the RL agent and the order management system, with no code path from the agent to bypass it. Position limit enforcement must be enforced at the kill switch layer, not in the agent's internal state.
Supervisory Explainability
FINRA Rule 3110 requires firms to establish and maintain a system to supervise the activities of each associated person. For RL trading systems, supervision requires the ability to explain, after the fact, why the agent took a specific sequence of actions in a specific market context. This is the RL explainability problem applied to a compliance context: the agent's policy is a high-dimensional function that maps market state to action, and the mapping may not be interpretable by a human supervisor.
The practical approach for heavily regulated trading contexts is to constrain the RL action space to a set of parameterised strategies that are individually understandable, with the RL agent selecting among strategies and their parameters rather than generating unconstrained order flow. This constrains the agent's capability but makes supervisory explanation tractable.
Backtesting and Simulation Risks
RL trading agents are typically developed and validated against historical market data through backtesting. The compliance risks specific to RL backtesting include overfitting to historical market regimes that may not recur, look-ahead bias in training data construction, and the inability of backtesting to surface manipulation strategies that would only be effective at scale in live markets. SR 11-7 model risk management requirements for trading models require backtesting methodology documentation that addresses these limitations and independent validation of backtesting results.
Ongoing Monitoring in Production
Post-deployment monitoring of RL trading systems requires surveillance capabilities that go beyond standard model monitoring. The agent's behaviour in live markets may diverge from backtest behaviour due to market impact, regime change, or emergent interactions with other algorithmic traders. Monitoring must track the agent's position-taking patterns against manipulation surveillance rules, its order-to-fill ratios against spoofing detection thresholds, and its P&L attribution against expected strategy performance to detect policy drift. Each monitoring signal must be connected to a human review workflow that can evaluate whether observed behaviour is acceptable before a regulatory inquiry surfaces it first.
The engineering behind this article is available as a service.
We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.