Case Study : Architecture Drift : Architecture Control Plane
- Chandrasekar Jayabharathy
- 6 minutes ago
- 4 min read
Level : 3 - Architecture as code

Banking Case Study and Key References
The following case study is a realistic composite, not a description of one specific institution. The platform is a credit-risk system in a bank that computes decisions such as underwriting responses and umbrella credit limits. It uses microservices, Kubernetes, Git-based delivery, CQRS-style separation between command and query paths, event-driven updates, and a service mesh for observability and selective enforcement. Those choices are typical of modern banking platforms, while the governance requirements reflect EBA, MAS, and DORA expectations around change control, monitoring, logging, and evidence.
The intended architecture
The design intent is simple:
risk-command-service accepts applications and publishes decision-createdevents.
risk-query-service serves read-only decision views and must not call command services directly.
policy-service evaluates product and regulatory policy.
reporting-service may consume masked read models only.
Any PII in logs must be masked.
Changes to these controls must be versioned and approved.
That intent is captured as code.
apiVersion: architecture.bank.io/v1alpha1
kind: ArchitectureDefinition
metadata:
name: credit-risk-platform
spec:
owners:
primary: group:risk-platform
controls:
communicationMode: event-first
piiLogging: masked
components:
- name: risk-command-service
role: command
allowedCalls:
- policy-service
- decision-write-store
emitsEvents:
- decision-created
- name: risk-query-service
role: query
allowedCalls:
- decision-read-store
forbiddenCalls:
- risk-command-service
- name: reporting-service
role: reporting
allowedCalls:
- decision-read-store
forbiddenCalls:
- decision-write-storeA supporting control binding compiles part of that model into deploy-time and runtime targets.
apiVersion: architecture.bank.io/v1alpha1
kind: PolicyBinding
metadata:
name: credit-risk-guardrails
spec:
sourceRef:
kind: ArchitectureDefinition
name: credit-risk-platform
targets:
- type: ci
mode: validate
- type: kubernetes-admission
engine: validatingadmissionpolicy
mode: validate
- type: service-mesh
engine: istio-opa
mode: enforce-selectedThe drift that appears in production
To reduce latency under pressure, an engineer adds a synchronous HTTP call from risk-query-service to risk-command-service and caches the response locally. That solves an immediate product issue, but it breaks the CQRS boundary and creates a new hot dependency on the write path. At the same time, reporting-service starts logging a full upstream JSON payload during a troubleshooting exercise, inadvertently including fields that should never appear unmasked in logs.
A representative observed interaction looks like this.
timestamp: "2026-04-15T09:12:44Z"
trace_id: "8b0c6856d2f74aa8b8a8f3b0c8f9e5a1"
span_id: "4f1b7cbac8f9d201"
resource:
service.name: risk-query-service
service.namespace: lending
http:
route: /credit/umbrella-limit/48291
server.address: risk-command-service.credit-risk.svc.cluster.local
status.code: OK
latency_ms: 43
traceparent: "00-8b0c6856d2f74aa8b8a8f3b0c8f9e5a1-4f1b7cbac8f9d201-01"The log looks like this.
timestamp: "2026-04-15T09:12:45Z"
severity: INFO
resource:
service.name: reporting-service
body: "decision payload forwarded account_number=4389910023419821 risk_band=H"
trace_id: "8b0c6856d2f74aa8b8a8f3b0c8f9e5a1"
span_id: "da41b8f1f4a11209"The detection rules
The drift engine normalises observed edges and compares them with allowed and forbidden edges in the architecture model. That comparison can be written in a graph engine, SQL, or Rego. A simple policy example is enough to make the logic concrete.
package architecture.drift
deny contains msg if {
edge := input.observed.edges[_]
edge.source == "risk-query-service"
edge.target == "risk-command-service"
input.model.components[edge.source].role == "query"
input.model.components[edge.target].role == "command"
msg := "CQRS boundary violated: query service called command service"
}
deny contains msg if {
record := input.observed.logs[_]
record.resource["service.name"] == "reporting-service"
regex.match("\\b\\d{16}\\b", record.body)
msg := "Potential sensitive account or card number detected in logs"
}Because OPA supports decision logs containing input, policy, bundle metadata, and decision ID, the control plane can retain not just the finding but the exact logical basis for the finding. That is the crucial difference between a “monitoring alert” and a piece of auditable governance evidence.
The enforcement and remediation path
The bank chooses progressive control rather than an immediate hard block for unknown findings.
PR stage: any future code change that introduces a direct generated client or declared dependency from query to command fails validation.
Admission stage: workloads in the domain must carry architecture ownership and component IDs, so that runtime findings always map to named owners.
Runtime stage: only the highest-risk edge—risk-query-service to risk-command-service—is moved into mesh-time enforcement after a short warning period.
Logging stage: a policy blocks releases that disable the standard masking library, while runtime log scanning remains in detect-and-alert mode.
A simplified mesh enforcement policy might look like this.
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: command-path-architecture-check
namespace: credit-risk
spec:
selector:
matchLabels:
app: risk-command-service
action: CUSTOM
provider:
name: arch-opa
rules:
- from:
- source:
principals:
- "cluster.local/ns/credit-risk/sa/risk-query-service"OPA-Envoy is an appropriate pairing here because Envoy can delegate context-rich request decisions to OPA without modifying the application, and local sidecar evaluation avoids introducing an extra network hop for every decision.
The remediation the control plane proposes is not mystical. It is very ordinary, which is exactly the point:
restore the query path to read from decision-read-store;
republish decision-created events with any additional projection data needed by the query path;
remove the synchronous client from risk-query-service;
re-enable or fix the masking middleware in reporting-service;
open an exception only if product leadership and risk agree on a short-lived deviation, with explicit expiry and compensating controls.
The control plane records the whole lifecycle as one coherent artefact set: the architecture rule version, the finding, the trace evidence, the decision log or admission evidence, the owner, the exception if any, and the remediation action. That is the operational heart of banking-grade architecture governance.
The measured value
For this kind of pilot, the most meaningful outcomes are operational rather than cosmetic:
Measure | Before the control plane | After the pilot control plane |
Time to discover forbidden service edge | Often after incident or manual review | Near-real-time to same day, depending on trace delay |
Confidence in ownership of a finding | Mixed; often requires detective work | High if owner metadata is mandatory |
Evidence available for audit or risk review | Screenshots, tickets, oral explanations | Versioned model, decision logs, trace samples, exception record |
Ability to prevent recurrence | Low; knowledge stays local | High once the finding is compiled into CI or admission policy |
Those pilot outcomes are plausible because every component already exists in the ecosystem: resource APIs and controllers from Kubernetes, declarative delivery from Argo CD, mesh telemetry and ext_authz from Istio and Envoy, policy bundles and decision logs from OPA, and machine-readable metadata from Backstage, OpenAPI, AsyncAPI, and OpenTelemetry. The signature architectural act is to assemble them into one reconciled loop for architecture intent.