Architecture Drift : Architecture Control Plane

Chandrasekar Jayabharathy
Apr 18
9 min read

Level : 3 - Architecture as code

Summary

“Architecture Control Plane” is best understood as a proposed control-plane layer for architecture itself: a resource-oriented system that stores architectural intent as machine-readable objects, reconciles that intent against observed runtime and delivery reality, compiles intent into validation and enforcement controls, and exposes findings, exceptions, evidence, and remediation workflows through APIs. This is not a product category with a single canonical implementation today; it is a design pattern synthesised from established control-plane ideas in Kubernetes, GitOps delivery, service mesh, policy-as-code, software catalogues, and observability.

The motivation is straightforward. Existing control planes manage infrastructure state, deployment state, policy decisions, or traffic behaviour. For example Kubernetes manages cluster objects through a REST API and controllers. But none of these systems, by themselves, owns the higher-order question that architects care about: is the running system still conforming to the intended service boundaries, dependency rules, data-classification rules, and operating constraints ?

The practical conclusion is that an Architecture Control Plane is feasible now, not after some future platform rewrite. If a bank already has Git-based delivery, Kubernetes or an equivalent platform, an observability stack, and a policy engine, an MVP can usually be built in one quarter for a bounded service domain. Scaling it into a bank-wide capability is a six-to-twelve-month programme, mostly because of model standardisation, instrumentation coverage, and operating-model change rather than raw technology difficulty.

Framing the Architecture Control Plane

A useful way to define the idea is by analogy. Kubernetes describes a control plane that manages the overall cluster state through an API server, scheduler, controller manager, and persistent state in etcd. Controllers watch desired and current state, then act to reconcile the two. An Architecture Control Plane applies the same control-loop idea to architecture: the desired state is the Architecture-as-Code model; the current state is the observed dependency graph, contracts, deployment shape, and policy outcomes; and the reconcilers are compilers, validators, detectors, and remediators that move the estate back toward intended structure.

Comparison with Existing Control Planes

An Architecture Control Plane should be positioned as an orchestrating layer above existing domain-specific control planes, not as a replacement for them.

Control plane	Primary managed object	Source of desired state	Main observation surface	Main actuators	Architectural blind spot	Evidence
Kubernetes infrastructure control plane	Cluster resources such as Pods, Services, Deployments	Resource manifests and custom resources	API server state, watches, controllers	Scheduling, reconciliation, admission, operators	Knows infrastructure and workload topology, not business boundaries or allowed system relationships
GitOps / CI-CD delivery plane	Application release state	Git repository state	Live-vs-target diff, health status	Deploy, sync, rollback	Detects manifest drift, not architecture drift across services and data flows
Policy-as-code control plane	Policy decisions and admission invariants	Rego, CEL, YAML policy definitions	Admission requests, API calls, request context, audits	Allow/deny, mutate, warn, audit	Enforces rules, but typically without a first-class architecture graph as the source of truth
Service-mesh traffic control plane	Proxy behaviour and traffic policy	Mesh config and traffic policy resources	In-mesh traffic telemetry, access logs, traces	Route, retry, circuit-break, authz extension, mTLS	Sees traffic brilliantly, but only traffic; does not model system intent on its own
Architecture Control Plane	Service boundaries, allowed relationships, data and control constraints, exceptions, findings	Architecture-as-Code schema plus contracts, ownership, and policy bindings	Runtime graph, code and contract graph, deployment graph, policy outcomes	Validation, enforcement compilation, findings, exceptions, remediation workflows	Its challenge is model completeness and operational adoption, not lack of technical leverage	This report’s synthesis

The comparison shows the signature point: Infrastructure, delivery, policy, and traffic control planes govern how a system runs; an Architecture Control Plane governs whether it is still the system you intended to run. That is why it should sit above them and compile intent into them.

Core Capabilities, APIs, and Data Model

A credible Architecture Control Plane needs six capabilities.

It needs a model registry for Architecture-as-Code definitions.
It needs a graph normaliser that turns disparate sources contracts, deployment manifests, service selectors, owners, exceptions into one consistent graph.
It needs a compiler that translates architecture intent into downstream control artefacts such as CI checks, admission policies, mesh authz rules, or OPA bundle content.
It needs signal ingestion so that traces, logs, policy decisions, and runtime state can be attached to the model.
It needs a drift engine that compares intended and observed state.
It needs exception and remediation workflows that are time bounded, auditable, and tied to named owners. Those ingredients are exactly what makes a control plane a control plane rather than a static registry.

A practical resource inventory for an Architecture Control Plane looks as follows.

Resource kind	Purpose	Minimum schema requirements	Why it exists
ArchitectureDefinition	Canonical system intent	metadata, spec.components, spec.allowedEdges, spec.forbiddenEdges, spec.dataClasses, spec.owners, status.conditions	Declares what the architecture should be
Component	Individual workload or subsystem	stable ID, runtime selectors, owner, criticality, lifecycle tier	Binds design intent to actual workloads
ApiContract	HTTP contract reference	OpenAPI ref, provider, consumer, version	Anchors synchronous interfaces in machine-readable form
EventContract	Event/message contract reference	AsyncAPI ref, channel/topic, producer, consumers	Anchors asynchronous interfaces in machine-readable form
PolicyBinding	Compilation target	target type, enforcement mode, scope, generated artefact refs	Connects architecture rules to CI, admission, mesh, or OPA
Observation	Runtime evidence	source, timestamp, normalised edge, resource attributes, confidence	Stores observed reality in a consistent graph
DriftFinding	Detected divergence	severity, violating rule, evidence refs, owner, SLA, status	Makes drift actionable rather than anecdotal
ArchitectureException	Approved deviation	rationale, approver, expiry, scope, compensating controls	Prevents “temporary” exceptions from becoming invisible debt
RemediationAction	Suggested or approved fix	playbook type, approval state, target systems, rollback plan	Enables safe automation

The schema requirement is not only “be declarative.” It is “be joinable.” If your architecture model cannot be joined to telemetry, catalogue entities, contracts, admission events, and deployment objects, it cannot be reconciled in a reliable control loop.

apiVersion: architecture.bank.io/v1alpha1 
kind: ArchitectureDefinition
metadata:
  name: credit-risk-platform
  labels:
    domain: lending
    criticality: critical
spec:
  owners:
    primary: group:risk-platform
    riskControl: group:enterprise-architecture
  dataClasses:
    - confidential
    - pii
  components:
    - name: risk-command-service
      runtime:
        namespace: credit-risk
        selector:
          app: risk-command-service
      role: command
      allowedCalls:
        - policy-service
        - decision-write-store
      emitsEvents:
        - decision-created
    - name: risk-query-service
      runtime:
        namespace: credit-risk
        selector:
          app: risk-query-service
      role: query
      allowedCalls:
        - decision-read-store
      forbiddenCalls:
        - risk-command-service
  contracts:
    apis:  
      - ref: openapi://credit-risk/policy-service.yaml
    events:
      - ref: asyncapi://credit-risk/decision-created.yaml
status:
  observedGeneration: 7
  lastCompiledAt: "2026-04-15T19:32:11Z"
  driftSummary:
    openFindings: 2
    expiredExceptions: 1

Signals, Integration Points, and Control Patterns

The signal model is the heart of the system. Architecture usually fails because it is declared once and observed poorly thereafter. An Architecture Control Plane works only if it can ingest both design signals and runtime signals.

Signal source	Typical artefacts	Value to the control plane	Key limitation	Evidence
Distributed tracing	spans, parent-child relationships, resource attributes, traceparent/trace context	Best source for actual synchronous call graph and request flows	Needs instrumentation and propagation coverage
Service-mesh telemetry and access logs	source workload, destination workload, response code, latency, access record	High-confidence service edge detection and traffic evidence, often without app changes	Only sees traffic that passes through the mesh or proxy
Application and infrastructure logs	structured logs, log bodies, resource context, TraceId/SpanId	Useful for policy evidence, PII leak detection, and post-incident investigation	Unstructured logs create noise; correlation quality matters
Infrastructure state	Kubernetes resources, Argo sync state, namespace labels, selectors	Shows how the system is deployed and whether declared targets are out of sync	Does not prove how distributed systems actually interact at runtime
HTTP contracts	OpenAPI docs	Machine-readable declaration of synchronous interfaces and versions	Tells you what is intended, not what is called in practice
Event contracts	AsyncAPI docs	Machine-readable declaration of event channels and operations	Consumer reality may drift from documentation
Catalogue and ownership metadata	Backstage entities and relations	Connects services to owners, systems, and API relationships	Usually weaker on infrastructure truth and runtime truth
CMDB and service mapping	configuration items and relationships	Strong enterprise configuration context, useful for risk, change and impact analysis	Often lags modern platform reality unless discovery is strong
Policy outcomes	OPA decision logs, admission audit results	Verifiable evidence of what was allowed, denied, or warned	Tells you about policy decisions, not the full runtime graph

This signal mix leads naturally to four distinct control patterns.

Pattern	When it runs	Typical targets	Best for	Main risk	Evidence
Validation in PR and CI	Before merge or build promotion	Architecture lints, contract checks, generated policies	Cheap feedback, high developer leverage	False positives if the model is incomplete	OPA is explicitly designed for CI/CD use, and GitOps tools expect declarative sources of truth.
Admission-time validation or mutation	At deploy time	ValidatingAdmissionPolicy, Gatekeeper, Kyverno	Preventing bad state from entering the cluster	Can block releases if policies are too brittle or dependencies are unavailable
Proxy-time enforcement	At request time	Istio CUSTOM authz + Envoy ext_authz + OPA-Envoy	Enforcing service-to-service rules and context-aware authz	Added control-path complexity; poor placement can introduce latency/availability issues
Asynchronous runtime drift analysis	After observation, continuously	Graph engine, alerts, tickets, dashboards	Detecting architecture drift that cannot be pre-blocked safely	Findings can become noisy if signal normalisation is weak

The right design is not “enforcement everywhere.” The right design is progressive control. Start by validating in PRs and builds. Add admission control for low-regret invariants such as required labels, owners, or network classes. Add proxy-time enforcement only for high-value edges such as forbidden calls between zones, data domains, or CQRS boundaries. Keep the rest in asynchronous drift detection until signal quality and exception processes are mature.

AI, Security, Compliance, and Operating Model

AI is not the control plane; it is an amplifier on top of the control plane’s machine-readable evidence. Once the estate is represented as structured resources, contracts, telemetry, and policy outputs, there are three sensible AI roles.

AI role	Inputs	Safe autonomy level	Good outcomes	Guardrails
Analysis	topology graph, contracts, traces, decision logs, findings	High	summarise drift, cluster related findings, identify likely root cause	human-readable evidence and traceability back to raw inputs
Suggestion	findings, ownership, playbooks, recent changes	Medium	propose remediation steps, exception scopes, rollout plans, missing metadata	no direct production changes without approval
Automated remediation	narrow, pre-approved playbooks	Low	open PRs, create tickets, add expired-exception alerts, revert known-bad policy changes	approval gates, rollback, blast-radius control

This is not speculative hand-waving. OpenTelemetry gives you normalised telemetry and resource attributes, OPA gives you decision logs with full decision context and bundle metadata, Backstage gives you structured ownership and API relationships, Runtime infrastructure gives you resource versions and watchable state.

The operating model should therefore be explicit.

Function	Recommended owner	Independence expectation
Architecture model authoring for a service	Product / domain engineering team	Same team that owns the service should maintain the model
Platform ownership of the control plane	Platform engineering	Centralised operations and reliability ownership
Policy standard definition	Enterprise architecture and security	Independent from day-to-day feature delivery
Runtime evidence and dashboards	SRE / observability platform	Shared operational service
Exception approval	Architecture governance plus risk/security depending on severity	Independent approval for high-risk deviations
Assurance and testing	Internal audit / control functions	Independent review and objective assurance

This follows both Backstage’s ownership semantics and the EBA’s emphasis on independence and assurance. Ownership must be singular and discoverable; assurance must be independent enough to be credible.

Roadmap, Economics, KPIs, and Actionable Next Steps

The implementation path should be incremental. Control planes fail when teams try to standardise the whole estate before the model has earned trust.

A practical roadmap looks like this.

Phase	Timeline	Milestones	Recommended KPIs
Preparation	3–6 weeks	choose one critical domain; define stable component IDs; standardise owner metadata; align service names with telemetry	model coverage for pilot domain > 70%; owner coverage > 95%
MVP	8–12 weeks	resource API; ArchitectureDefinition schema; PR checks; trace ingestion; first drift dashboard	mean time to detect drift < 1 day; false-positive rate < 20%; PR feedback time < 10 minutes
Production pilot	3–6 months	exception API; approval workflow; admission validation; generated policy bindings; evidence export for risk/audit	exception expiry adherence > 95%; blocking policy precision > 90%; architecture review lead time reduced by 30–50%
Scale-out	6–12 months	multi-cluster coverage; service-mesh integration where needed; CMDB sync; AI suggestions; remediation playbooks	coverage of tier-1 services > 80%; drift MTTD < 1 hour for protected domains; unauthorised edge recurrence down by > 50%

These KPIs are recommended control objectives for programme design, not industry-standard thresholds.

The effort and cost profile depends much more on estate maturity and signal quality than on raw service count.

Scenario	Estate assumptions	Team shape	Indicative elapsed time	Relative engineering effort	Relative platform/tool spend
Low	20–50 services, one strategic cluster, GitOps already present, good telemetry baseline	2–3 engineers plus part-time architect	10–14 weeks	Low to medium	Low
Medium	100–200 services, multiple namespaces/clusters, mixed telemetry quality, one catalogue or CMDB already present	4–6 engineers, one platform lead, one architect, one security/policy lead	4–6 months	Medium	Medium
High	300+ services, multiple regions/platforms, strong segregation requirements, partial legacy estate, multiple policy engines	8–12 engineers, SRE, architect, security engineering, change/risk integration	9–12 months	High	Medium to high

The biggest hidden cost is model hygiene. If owners are missing, service names are inconsistent, contracts are absent, or traces do not propagate correct context, the control plane spends its life reconciling broken metadata rather than architecture drift.

Actionable next steps for a bank are therefore very concrete:

Pick one domain where architecture drift is already painful and visible, such as credit risk, payments, or customer servicing.
Standardise service identity: service.name, service.namespace, owner ID, criticality, data classification.
Define the first ArchitectureDefinition schema and keep it intentionally small.
Wire the schema into PR checks before adding any hard runtime enforcement.
Build one drift detector from traces only; do not start with every signal source.
Add an exception resource with owner, expiry, reason, and compensating controls before you add any deny rules.
Only after the first two sprints of clean findings, compile selected rules into admission or mesh enforcement.

Conclusion

Architecture Control Plane is not a metaphor borrowed for effect. It is a rigorous next step in the same lineage as infrastructure control planes, GitOps reconcilers, service-mesh control planes, and policy control planes. In banking, where architecture needs to be not only elegant but continuously evidenced, it is a more realistic operating model than architecture boards trying to govern distributed systems with static diagrams.

ArchiDecode

Decode the future of architecture, technology, and distributed systems.