Architecture Drift : Architecture Control Plane
- Chandrasekar Jayabharathy
- 15 hours ago
- 9 min read
Level : 3 - Architecture as code

Summary
“Architecture Control Plane” is best understood as a proposed control-plane layer for architecture itself: a resource-oriented system that stores architectural intent as machine-readable objects, reconciles that intent against observed runtime and delivery reality, compiles intent into validation and enforcement controls, and exposes findings, exceptions, evidence, and remediation workflows through APIs. This is not a product category with a single canonical implementation today; it is a design pattern synthesised from established control-plane ideas in Kubernetes, GitOps delivery, service mesh, policy-as-code, software catalogues, and observability.
The motivation is straightforward. Existing control planes manage infrastructure state, deployment state, policy decisions, or traffic behaviour. For example Kubernetes manages cluster objects through a REST API and controllers. But none of these systems, by themselves, owns the higher-order question that architects care about: is the running system still conforming to the intended service boundaries, dependency rules, data-classification rules, and operating constraints ?
The practical conclusion is that an Architecture Control Plane is feasible now, not after some future platform rewrite. If a bank already has Git-based delivery, Kubernetes or an equivalent platform, an observability stack, and a policy engine, an MVP can usually be built in one quarter for a bounded service domain. Scaling it into a bank-wide capability is a six-to-twelve-month programme, mostly because of model standardisation, instrumentation coverage, and operating-model change rather than raw technology difficulty.
Framing the Architecture Control Plane
A useful way to define the idea is by analogy. Kubernetes describes a control plane that manages the overall cluster state through an API server, scheduler, controller manager, and persistent state in etcd. Controllers watch desired and current state, then act to reconcile the two. An Architecture Control Plane applies the same control-loop idea to architecture: the desired state is the Architecture-as-Code model; the current state is the observed dependency graph, contracts, deployment shape, and policy outcomes; and the reconcilers are compilers, validators, detectors, and remediators that move the estate back toward intended structure.
Comparison with Existing Control Planes
An Architecture Control Plane should be positioned as an orchestrating layer above existing domain-specific control planes, not as a replacement for them.
Control plane | Primary managed object | Source of desired state | Main observation surface | Main actuators | Architectural blind spot | Evidence |
Kubernetes infrastructure control plane | Cluster resources such as Pods, Services, Deployments | Resource manifests and custom resources | API server state, watches, controllers | Scheduling, reconciliation, admission, operators | Knows infrastructure and workload topology, not business boundaries or allowed system relationships | |
GitOps / CI-CD delivery plane | Application release state | Git repository state | Live-vs-target diff, health status | Deploy, sync, rollback | Detects manifest drift, not architecture drift across services and data flows | |
Policy-as-code control plane | Policy decisions and admission invariants | Rego, CEL, YAML policy definitions | Admission requests, API calls, request context, audits | Allow/deny, mutate, warn, audit | Enforces rules, but typically without a first-class architecture graph as the source of truth | |
Service-mesh traffic control plane | Proxy behaviour and traffic policy | Mesh config and traffic policy resources | In-mesh traffic telemetry, access logs, traces | Route, retry, circuit-break, authz extension, mTLS | Sees traffic brilliantly, but only traffic; does not model system intent on its own | |
Architecture Control Plane | Service boundaries, allowed relationships, data and control constraints, exceptions, findings | Architecture-as-Code schema plus contracts, ownership, and policy bindings | Runtime graph, code and contract graph, deployment graph, policy outcomes | Validation, enforcement compilation, findings, exceptions, remediation workflows | Its challenge is model completeness and operational adoption, not lack of technical leverage | This report’s synthesis |
The comparison shows the signature point: Infrastructure, delivery, policy, and traffic control planes govern how a system runs; an Architecture Control Plane governs whether it is still the system you intended to run. That is why it should sit above them and compile intent into them.
Core Capabilities, APIs, and Data Model
A credible Architecture Control Plane needs six capabilities.
It needs a model registry for Architecture-as-Code definitions.
It needs a graph normaliser that turns disparate sources contracts, deployment manifests, service selectors, owners, exceptions into one consistent graph.
It needs a compiler that translates architecture intent into downstream control artefacts such as CI checks, admission policies, mesh authz rules, or OPA bundle content.
It needs signal ingestion so that traces, logs, policy decisions, and runtime state can be attached to the model.
It needs a drift engine that compares intended and observed state.
It needs exception and remediation workflows that are time bounded, auditable, and tied to named owners. Those ingredients are exactly what makes a control plane a control plane rather than a static registry.
A practical resource inventory for an Architecture Control Plane looks as follows.
Resource kind | Purpose | Minimum schema requirements | Why it exists |
ArchitectureDefinition | Canonical system intent | metadata, spec.components, spec.allowedEdges, spec.forbiddenEdges, spec.dataClasses, spec.owners, status.conditions | Declares what the architecture should be |
Component | Individual workload or subsystem | stable ID, runtime selectors, owner, criticality, lifecycle tier | Binds design intent to actual workloads |
ApiContract | HTTP contract reference | OpenAPI ref, provider, consumer, version | Anchors synchronous interfaces in machine-readable form |
EventContract | Event/message contract reference | AsyncAPI ref, channel/topic, producer, consumers | Anchors asynchronous interfaces in machine-readable form |
PolicyBinding | Compilation target | target type, enforcement mode, scope, generated artefact refs | Connects architecture rules to CI, admission, mesh, or OPA |
Observation | Runtime evidence | source, timestamp, normalised edge, resource attributes, confidence | Stores observed reality in a consistent graph |
DriftFinding | Detected divergence | severity, violating rule, evidence refs, owner, SLA, status | Makes drift actionable rather than anecdotal |
ArchitectureException | Approved deviation | rationale, approver, expiry, scope, compensating controls | Prevents “temporary” exceptions from becoming invisible debt |
RemediationAction | Suggested or approved fix | playbook type, approval state, target systems, rollback plan | Enables safe automation |
The schema requirement is not only “be declarative.” It is “be joinable.” If your architecture model cannot be joined to telemetry, catalogue entities, contracts, admission events, and deployment objects, it cannot be reconciled in a reliable control loop.
apiVersion: architecture.bank.io/v1alpha1
kind: ArchitectureDefinition
metadata:
name: credit-risk-platform
labels:
domain: lending
criticality: critical
spec:
owners:
primary: group:risk-platform
riskControl: group:enterprise-architecture
dataClasses:
- confidential
- pii
components:
- name: risk-command-service
runtime:
namespace: credit-risk
selector:
app: risk-command-service
role: command
allowedCalls:
- policy-service
- decision-write-store
emitsEvents:
- decision-created
- name: risk-query-service
runtime:
namespace: credit-risk
selector:
app: risk-query-service
role: query
allowedCalls:
- decision-read-store
forbiddenCalls:
- risk-command-service
contracts:
apis:
- ref: openapi://credit-risk/policy-service.yaml
events:
- ref: asyncapi://credit-risk/decision-created.yaml
status:
observedGeneration: 7
lastCompiledAt: "2026-04-15T19:32:11Z"
driftSummary:
openFindings: 2
expiredExceptions: 1Signals, Integration Points, and Control Patterns
The signal model is the heart of the system. Architecture usually fails because it is declared once and observed poorly thereafter. An Architecture Control Plane works only if it can ingest both design signals and runtime signals.
Signal source | Typical artefacts | Value to the control plane | Key limitation | Evidence |
Distributed tracing | spans, parent-child relationships, resource attributes, traceparent/trace context | Best source for actual synchronous call graph and request flows | Needs instrumentation and propagation coverage | |
Service-mesh telemetry and access logs | source workload, destination workload, response code, latency, access record | High-confidence service edge detection and traffic evidence, often without app changes | Only sees traffic that passes through the mesh or proxy | |
Application and infrastructure logs | structured logs, log bodies, resource context, TraceId/SpanId | Useful for policy evidence, PII leak detection, and post-incident investigation | Unstructured logs create noise; correlation quality matters | |
Infrastructure state | Kubernetes resources, Argo sync state, namespace labels, selectors | Shows how the system is deployed and whether declared targets are out of sync | Does not prove how distributed systems actually interact at runtime | |
HTTP contracts | OpenAPI docs | Machine-readable declaration of synchronous interfaces and versions | Tells you what is intended, not what is called in practice | |
Event contracts | AsyncAPI docs | Machine-readable declaration of event channels and operations | Consumer reality may drift from documentation | |
Catalogue and ownership metadata | Backstage entities and relations | Connects services to owners, systems, and API relationships | Usually weaker on infrastructure truth and runtime truth | |
CMDB and service mapping | configuration items and relationships | Strong enterprise configuration context, useful for risk, change and impact analysis | Often lags modern platform reality unless discovery is strong | |
Policy outcomes | OPA decision logs, admission audit results | Verifiable evidence of what was allowed, denied, or warned | Tells you about policy decisions, not the full runtime graph |
This signal mix leads naturally to four distinct control patterns.
Pattern | When it runs | Typical targets | Best for | Main risk | Evidence |
Validation in PR and CI | Before merge or build promotion | Architecture lints, contract checks, generated policies | Cheap feedback, high developer leverage | False positives if the model is incomplete | OPA is explicitly designed for CI/CD use, and GitOps tools expect declarative sources of truth. |
Admission-time validation or mutation | At deploy time | ValidatingAdmissionPolicy, Gatekeeper, Kyverno | Preventing bad state from entering the cluster | Can block releases if policies are too brittle or dependencies are unavailable | |
Proxy-time enforcement | At request time | Istio CUSTOM authz + Envoy ext_authz + OPA-Envoy | Enforcing service-to-service rules and context-aware authz | Added control-path complexity; poor placement can introduce latency/availability issues | |
Asynchronous runtime drift analysis | After observation, continuously | Graph engine, alerts, tickets, dashboards | Detecting architecture drift that cannot be pre-blocked safely | Findings can become noisy if signal normalisation is weak |
The right design is not “enforcement everywhere.” The right design is progressive control. Start by validating in PRs and builds. Add admission control for low-regret invariants such as required labels, owners, or network classes. Add proxy-time enforcement only for high-value edges such as forbidden calls between zones, data domains, or CQRS boundaries. Keep the rest in asynchronous drift detection until signal quality and exception processes are mature.
AI, Security, Compliance, and Operating Model
AI is not the control plane; it is an amplifier on top of the control plane’s machine-readable evidence. Once the estate is represented as structured resources, contracts, telemetry, and policy outputs, there are three sensible AI roles.
AI role | Inputs | Safe autonomy level | Good outcomes | Guardrails |
Analysis | topology graph, contracts, traces, decision logs, findings | High | summarise drift, cluster related findings, identify likely root cause | human-readable evidence and traceability back to raw inputs |
Suggestion | findings, ownership, playbooks, recent changes | Medium | propose remediation steps, exception scopes, rollout plans, missing metadata | no direct production changes without approval |
Automated remediation | narrow, pre-approved playbooks | Low | open PRs, create tickets, add expired-exception alerts, revert known-bad policy changes | approval gates, rollback, blast-radius control |
This is not speculative hand-waving. OpenTelemetry gives you normalised telemetry and resource attributes, OPA gives you decision logs with full decision context and bundle metadata, Backstage gives you structured ownership and API relationships, Runtime infrastructure gives you resource versions and watchable state.
The operating model should therefore be explicit.
Function | Recommended owner | Independence expectation |
Architecture model authoring for a service | Product / domain engineering team | Same team that owns the service should maintain the model |
Platform ownership of the control plane | Platform engineering | Centralised operations and reliability ownership |
Policy standard definition | Enterprise architecture and security | Independent from day-to-day feature delivery |
Runtime evidence and dashboards | SRE / observability platform | Shared operational service |
Exception approval | Architecture governance plus risk/security depending on severity | Independent approval for high-risk deviations |
Assurance and testing | Internal audit / control functions | Independent review and objective assurance |
This follows both Backstage’s ownership semantics and the EBA’s emphasis on independence and assurance. Ownership must be singular and discoverable; assurance must be independent enough to be credible.
Roadmap, Economics, KPIs, and Actionable Next Steps
The implementation path should be incremental. Control planes fail when teams try to standardise the whole estate before the model has earned trust.
A practical roadmap looks like this.
Phase | Timeline | Milestones | Recommended KPIs |
Preparation | 3–6 weeks | choose one critical domain; define stable component IDs; standardise owner metadata; align service names with telemetry | model coverage for pilot domain > 70%; owner coverage > 95% |
MVP | 8–12 weeks | resource API; ArchitectureDefinition schema; PR checks; trace ingestion; first drift dashboard | mean time to detect drift < 1 day; false-positive rate < 20%; PR feedback time < 10 minutes |
Production pilot | 3–6 months | exception API; approval workflow; admission validation; generated policy bindings; evidence export for risk/audit | exception expiry adherence > 95%; blocking policy precision > 90%; architecture review lead time reduced by 30–50% |
Scale-out | 6–12 months | multi-cluster coverage; service-mesh integration where needed; CMDB sync; AI suggestions; remediation playbooks | coverage of tier-1 services > 80%; drift MTTD < 1 hour for protected domains; unauthorised edge recurrence down by > 50% |
These KPIs are recommended control objectives for programme design, not industry-standard thresholds.
The effort and cost profile depends much more on estate maturity and signal quality than on raw service count.
Scenario | Estate assumptions | Team shape | Indicative elapsed time | Relative engineering effort | Relative platform/tool spend |
Low | 20–50 services, one strategic cluster, GitOps already present, good telemetry baseline | 2–3 engineers plus part-time architect | 10–14 weeks | Low to medium | Low |
Medium | 100–200 services, multiple namespaces/clusters, mixed telemetry quality, one catalogue or CMDB already present | 4–6 engineers, one platform lead, one architect, one security/policy lead | 4–6 months | Medium | Medium |
High | 300+ services, multiple regions/platforms, strong segregation requirements, partial legacy estate, multiple policy engines | 8–12 engineers, SRE, architect, security engineering, change/risk integration | 9–12 months | High | Medium to high |
The biggest hidden cost is model hygiene. If owners are missing, service names are inconsistent, contracts are absent, or traces do not propagate correct context, the control plane spends its life reconciling broken metadata rather than architecture drift.
Actionable next steps for a bank are therefore very concrete:
Pick one domain where architecture drift is already painful and visible, such as credit risk, payments, or customer servicing.
Standardise service identity: service.name, service.namespace, owner ID, criticality, data classification.
Define the first ArchitectureDefinition schema and keep it intentionally small.
Wire the schema into PR checks before adding any hard runtime enforcement.
Build one drift detector from traces only; do not start with every signal source.
Add an exception resource with owner, expiry, reason, and compensating controls before you add any deny rules.
Only after the first two sprints of clean findings, compile selected rules into admission or mesh enforcement.
Conclusion
Architecture Control Plane is not a metaphor borrowed for effect. It is a rigorous next step in the same lineage as infrastructure control planes, GitOps reconcilers, service-mesh control planes, and policy control planes. In banking, where architecture needs to be not only elegant but continuously evidenced, it is a more realistic operating model than architecture boards trying to govern distributed systems with static diagrams.



Comments