Case Study : Architecture Drift : Architecture Control Plane

Chandrasekar Jayabharathy
Apr 18
4 min read

Level : 3 - Architecture as code

Banking Case Study and Key References

The following case study is a realistic composite, not a description of one specific institution. The platform is a credit-risk system in a bank that computes decisions such as underwriting responses and umbrella credit limits. It uses microservices, Kubernetes, Git-based delivery, CQRS-style separation between command and query paths, event-driven updates, and a service mesh for observability and selective enforcement. Those choices are typical of modern banking platforms, while the governance requirements reflect EBA, MAS, and DORA expectations around change control, monitoring, logging, and evidence.

The intended architecture

The design intent is simple:

risk-command-service accepts applications and publishes decision-createdevents.
risk-query-service serves read-only decision views and must not call command services directly.
policy-service evaluates product and regulatory policy.
reporting-service may consume masked read models only.
Any PII in logs must be masked.
Changes to these controls must be versioned and approved.

That intent is captured as code.

apiVersion: architecture.bank.io/v1alpha1
kind: ArchitectureDefinition
metadata:
  name: credit-risk-platform
spec:
  owners:
    primary: group:risk-platform
  controls:
    communicationMode: event-first
    piiLogging: masked
  components:
    - name: risk-command-service
      role: command
      allowedCalls:
        - policy-service
        - decision-write-store
      emitsEvents:
        - decision-created
    - name: risk-query-service
      role: query
      allowedCalls:
        - decision-read-store
      forbiddenCalls:
        - risk-command-service
    - name: reporting-service
      role: reporting
      allowedCalls:
        - decision-read-store
      forbiddenCalls:
        - decision-write-store

A supporting control binding compiles part of that model into deploy-time and runtime targets.

apiVersion: architecture.bank.io/v1alpha1
kind: PolicyBinding
metadata:
  name: credit-risk-guardrails
spec:
  sourceRef:
    kind: ArchitectureDefinition
    name: credit-risk-platform
  targets:
    - type: ci
      mode: validate
    - type: kubernetes-admission
      engine: validatingadmissionpolicy
      mode: validate
    - type: service-mesh
      engine: istio-opa
      mode: enforce-selected

The drift that appears in production

To reduce latency under pressure, an engineer adds a synchronous HTTP call from risk-query-service to risk-command-service and caches the response locally. That solves an immediate product issue, but it breaks the CQRS boundary and creates a new hot dependency on the write path. At the same time, reporting-service starts logging a full upstream JSON payload during a troubleshooting exercise, inadvertently including fields that should never appear unmasked in logs.

A representative observed interaction looks like this.

timestamp: "2026-04-15T09:12:44Z"
trace_id: "8b0c6856d2f74aa8b8a8f3b0c8f9e5a1"
span_id: "4f1b7cbac8f9d201"
resource:
  service.name: risk-query-service
  service.namespace: lending
http:
  route: /credit/umbrella-limit/48291
server.address: risk-command-service.credit-risk.svc.cluster.local
status.code: OK
latency_ms: 43
traceparent: "00-8b0c6856d2f74aa8b8a8f3b0c8f9e5a1-4f1b7cbac8f9d201-01"

The log looks like this.

timestamp: "2026-04-15T09:12:45Z"
severity: INFO
resource:
  service.name: reporting-service
body: "decision payload forwarded account_number=4389910023419821 risk_band=H"
trace_id: "8b0c6856d2f74aa8b8a8f3b0c8f9e5a1"
span_id: "da41b8f1f4a11209"

The detection rules

The drift engine normalises observed edges and compares them with allowed and forbidden edges in the architecture model. That comparison can be written in a graph engine, SQL, or Rego. A simple policy example is enough to make the logic concrete.

package architecture.drift

deny contains msg if {
  edge := input.observed.edges[_]
  edge.source == "risk-query-service"
  edge.target == "risk-command-service"
  input.model.components[edge.source].role == "query"
  input.model.components[edge.target].role == "command"
  msg := "CQRS boundary violated: query service called command service"
}

deny contains msg if {
  record := input.observed.logs[_]
  record.resource["service.name"] == "reporting-service"
  regex.match("\\b\\d{16}\\b", record.body)
  msg := "Potential sensitive account or card number detected in logs"
}

Because OPA supports decision logs containing input, policy, bundle metadata, and decision ID, the control plane can retain not just the finding but the exact logical basis for the finding. That is the crucial difference between a “monitoring alert” and a piece of auditable governance evidence.

The enforcement and remediation path

The bank chooses progressive control rather than an immediate hard block for unknown findings.

PR stage: any future code change that introduces a direct generated client or declared dependency from query to command fails validation.
Admission stage: workloads in the domain must carry architecture ownership and component IDs, so that runtime findings always map to named owners.
Runtime stage: only the highest-risk edge—risk-query-service to risk-command-service—is moved into mesh-time enforcement after a short warning period.
Logging stage: a policy blocks releases that disable the standard masking library, while runtime log scanning remains in detect-and-alert mode.

A simplified mesh enforcement policy might look like this.

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: command-path-architecture-check
  namespace: credit-risk
spec:
  selector:
    matchLabels:
      app: risk-command-service
  action: CUSTOM
  provider:
    name: arch-opa
  rules:
    - from:
        - source:
            principals:
              - "cluster.local/ns/credit-risk/sa/risk-query-service"

OPA-Envoy is an appropriate pairing here because Envoy can delegate context-rich request decisions to OPA without modifying the application, and local sidecar evaluation avoids introducing an extra network hop for every decision.

The remediation the control plane proposes is not mystical. It is very ordinary, which is exactly the point:

restore the query path to read from decision-read-store;
republish decision-created events with any additional projection data needed by the query path;
remove the synchronous client from risk-query-service;
re-enable or fix the masking middleware in reporting-service;
open an exception only if product leadership and risk agree on a short-lived deviation, with explicit expiry and compensating controls.

The control plane records the whole lifecycle as one coherent artefact set: the architecture rule version, the finding, the trace evidence, the decision log or admission evidence, the owner, the exception if any, and the remediation action. That is the operational heart of banking-grade architecture governance.

The measured value

For this kind of pilot, the most meaningful outcomes are operational rather than cosmetic:

Measure	Before the control plane	After the pilot control plane
Time to discover forbidden service edge	Often after incident or manual review	Near-real-time to same day, depending on trace delay
Confidence in ownership of a finding	Mixed; often requires detective work	High if owner metadata is mandatory
Evidence available for audit or risk review	Screenshots, tickets, oral explanations	Versioned model, decision logs, trace samples, exception record
Ability to prevent recurrence	Low; knowledge stays local	High once the finding is compiled into CI or admission policy

Those pilot outcomes are plausible because every component already exists in the ecosystem: resource APIs and controllers from Kubernetes, declarative delivery from Argo CD, mesh telemetry and ext_authz from Istio and Envoy, policy bundles and decision logs from OPA, and machine-readable metadata from Backstage, OpenAPI, AsyncAPI, and OpenTelemetry. The signature architectural act is to assemble them into one reconciled loop for architecture intent.