top of page
Search

Architecture Drift : Architecture Control Plane

  • Writer: Chandrasekar Jayabharathy
    Chandrasekar Jayabharathy
  • 15 hours ago
  • 9 min read

Level : 3 - Architecture as code



Summary

Architecture Control Plane” is best understood as a proposed control-plane layer for architecture itself: a resource-oriented system that stores architectural intent as machine-readable objects, reconciles that intent against observed runtime and delivery reality, compiles intent into validation and enforcement controls, and exposes findings, exceptions, evidence, and remediation workflows through APIs. This is not a product category with a single canonical implementation today; it is a design pattern synthesised from established control-plane ideas in Kubernetes, GitOps delivery, service mesh, policy-as-code, software catalogues, and observability.

The motivation is straightforward. Existing control planes manage infrastructure state, deployment state, policy decisions, or traffic behaviour. For example Kubernetes manages cluster objects through a REST API and controllers. But none of these systems, by themselves, owns the higher-order question that architects care about: is the running system still conforming to the intended service boundaries, dependency rules, data-classification rules, and operating constraints ?

The practical conclusion is that an Architecture Control Plane is feasible now, not after some future platform rewrite. If a bank already has Git-based delivery, Kubernetes or an equivalent platform, an observability stack, and a policy engine, an MVP can usually be built in one quarter for a bounded service domain. Scaling it into a bank-wide capability is a six-to-twelve-month programme, mostly because of model standardisation, instrumentation coverage, and operating-model change rather than raw technology difficulty.


Framing the Architecture Control Plane

A useful way to define the idea is by analogy. Kubernetes describes a control plane that manages the overall cluster state through an API server, scheduler, controller manager, and persistent state in etcd. Controllers watch desired and current state, then act to reconcile the two. An Architecture Control Plane applies the same control-loop idea to architecture: the desired state is the Architecture-as-Code model; the current state is the observed dependency graph, contracts, deployment shape, and policy outcomes; and the reconcilers are compilers, validators, detectors, and remediators that move the estate back toward intended structure.


Comparison with Existing Control Planes

An Architecture Control Plane should be positioned as an orchestrating layer above existing domain-specific control planes, not as a replacement for them.

Control plane

Primary managed object

Source of desired state

Main observation surface

Main actuators

Architectural blind spot

Evidence

Kubernetes infrastructure control plane

Cluster resources such as Pods, Services, Deployments

Resource manifests and custom resources

API server state, watches, controllers

Scheduling, reconciliation, admission, operators

Knows infrastructure and workload topology, not business boundaries or allowed system relationships


GitOps / CI-CD delivery plane

Application release state

Git repository state

Live-vs-target diff, health status

Deploy, sync, rollback

Detects manifest drift, not architecture drift across services and data flows


Policy-as-code control plane

Policy decisions and admission invariants

Rego, CEL, YAML policy definitions

Admission requests, API calls, request context, audits

Allow/deny, mutate, warn, audit

Enforces rules, but typically without a first-class architecture graph as the source of truth


Service-mesh traffic control plane

Proxy behaviour and traffic policy

Mesh config and traffic policy resources

In-mesh traffic telemetry, access logs, traces

Route, retry, circuit-break, authz extension, mTLS

Sees traffic brilliantly, but only traffic; does not model system intent on its own


Architecture Control Plane

Service boundaries, allowed relationships, data and control constraints, exceptions, findings

Architecture-as-Code schema plus contracts, ownership, and policy bindings

Runtime graph, code and contract graph, deployment graph, policy outcomes

Validation, enforcement compilation, findings, exceptions, remediation workflows

Its challenge is model completeness and operational adoption, not lack of technical leverage

This report’s synthesis

The comparison shows the signature point: Infrastructure, delivery, policy, and traffic control planes govern how a system runs; an Architecture Control Plane governs whether it is still the system you intended to run. That is why it should sit above them and compile intent into them.


Core Capabilities, APIs, and Data Model

A credible Architecture Control Plane needs six capabilities.

  1. It needs a model registry for Architecture-as-Code definitions.

  2. It needs a graph normaliser that turns disparate sources contracts, deployment manifests, service selectors, owners, exceptions into one consistent graph.

  3. It needs a compiler that translates architecture intent into downstream control artefacts such as CI checks, admission policies, mesh authz rules, or OPA bundle content.

  4. It needs signal ingestion so that traces, logs, policy decisions, and runtime state can be attached to the model.

  5. It needs a drift engine that compares intended and observed state.

  6. It needs exception and remediation workflows that are time bounded, auditable, and tied to named owners. Those ingredients are exactly what makes a control plane a control plane rather than a static registry.


A practical resource inventory for an Architecture Control Plane looks as follows.

Resource kind

Purpose

Minimum schema requirements

Why it exists

ArchitectureDefinition

Canonical system intent

metadata, spec.components, spec.allowedEdges, spec.forbiddenEdges, spec.dataClasses, spec.owners, status.conditions

Declares what the architecture should be

Component

Individual workload or subsystem

stable ID, runtime selectors, owner, criticality, lifecycle tier

Binds design intent to actual workloads

ApiContract

HTTP contract reference

OpenAPI ref, provider, consumer, version

Anchors synchronous interfaces in machine-readable form

EventContract

Event/message contract reference

AsyncAPI ref, channel/topic, producer, consumers

Anchors asynchronous interfaces in machine-readable form

PolicyBinding

Compilation target

target type, enforcement mode, scope, generated artefact refs

Connects architecture rules to CI, admission, mesh, or OPA

Observation

Runtime evidence

source, timestamp, normalised edge, resource attributes, confidence

Stores observed reality in a consistent graph

DriftFinding

Detected divergence

severity, violating rule, evidence refs, owner, SLA, status

Makes drift actionable rather than anecdotal

ArchitectureException

Approved deviation

rationale, approver, expiry, scope, compensating controls

Prevents “temporary” exceptions from becoming invisible debt

RemediationAction

Suggested or approved fix

playbook type, approval state, target systems, rollback plan

Enables safe automation

The schema requirement is not only “be declarative.” It is “be joinable.” If your architecture model cannot be joined to telemetry, catalogue entities, contracts, admission events, and deployment objects, it cannot be reconciled in a reliable control loop.


apiVersion: architecture.bank.io/v1alpha1 
kind: ArchitectureDefinition
metadata:
  name: credit-risk-platform
  labels:
    domain: lending
    criticality: critical
spec:
  owners:
    primary: group:risk-platform
    riskControl: group:enterprise-architecture
  dataClasses:
    - confidential
    - pii
  components:
    - name: risk-command-service
      runtime:
        namespace: credit-risk
        selector:
          app: risk-command-service
      role: command
      allowedCalls:
        - policy-service
        - decision-write-store
      emitsEvents:
        - decision-created
    - name: risk-query-service
      runtime:
        namespace: credit-risk
        selector:
          app: risk-query-service
      role: query
      allowedCalls:
        - decision-read-store
      forbiddenCalls:
        - risk-command-service
  contracts:
    apis:  
      - ref: openapi://credit-risk/policy-service.yaml
    events:
      - ref: asyncapi://credit-risk/decision-created.yaml
status:
  observedGeneration: 7
  lastCompiledAt: "2026-04-15T19:32:11Z"
  driftSummary:
    openFindings: 2
    expiredExceptions: 1

Signals, Integration Points, and Control Patterns

The signal model is the heart of the system. Architecture usually fails because it is declared once and observed poorly thereafter. An Architecture Control Plane works only if it can ingest both design signals and runtime signals.

Signal source

Typical artefacts

Value to the control plane

Key limitation

Evidence

Distributed tracing

spans, parent-child relationships, resource attributes, traceparent/trace context

Best source for actual synchronous call graph and request flows

Needs instrumentation and propagation coverage


Service-mesh telemetry and access logs

source workload, destination workload, response code, latency, access record

High-confidence service edge detection and traffic evidence, often without app changes

Only sees traffic that passes through the mesh or proxy


Application and infrastructure logs

structured logs, log bodies, resource context, TraceId/SpanId

Useful for policy evidence, PII leak detection, and post-incident investigation

Unstructured logs create noise; correlation quality matters


Infrastructure state

Kubernetes resources, Argo sync state, namespace labels, selectors

Shows how the system is deployed and whether declared targets are out of sync

Does not prove how distributed systems actually interact at runtime


HTTP contracts

OpenAPI docs

Machine-readable declaration of synchronous interfaces and versions

Tells you what is intended, not what is called in practice


Event contracts

AsyncAPI docs

Machine-readable declaration of event channels and operations

Consumer reality may drift from documentation


Catalogue and ownership metadata

Backstage entities and relations

Connects services to owners, systems, and API relationships

Usually weaker on infrastructure truth and runtime truth


CMDB and service mapping

configuration items and relationships

Strong enterprise configuration context, useful for risk, change and impact analysis

Often lags modern platform reality unless discovery is strong


Policy outcomes

OPA decision logs, admission audit results

Verifiable evidence of what was allowed, denied, or warned

Tells you about policy decisions, not the full runtime graph


This signal mix leads naturally to four distinct control patterns.

Pattern

When it runs

Typical targets

Best for

Main risk

Evidence

Validation in PR and CI

Before merge or build promotion

Architecture lints, contract checks, generated policies

Cheap feedback, high developer leverage

False positives if the model is incomplete

OPA is explicitly designed for CI/CD use, and GitOps tools expect declarative sources of truth.

Admission-time validation or mutation

At deploy time

ValidatingAdmissionPolicy, Gatekeeper, Kyverno

Preventing bad state from entering the cluster

Can block releases if policies are too brittle or dependencies are unavailable


Proxy-time enforcement

At request time

Istio CUSTOM authz + Envoy ext_authz + OPA-Envoy

Enforcing service-to-service rules and context-aware authz

Added control-path complexity; poor placement can introduce latency/availability issues


Asynchronous runtime drift analysis

After observation, continuously

Graph engine, alerts, tickets, dashboards

Detecting architecture drift that cannot be pre-blocked safely

Findings can become noisy if signal normalisation is weak


The right design is not “enforcement everywhere.” The right design is progressive control. Start by validating in PRs and builds. Add admission control for low-regret invariants such as required labels, owners, or network classes. Add proxy-time enforcement only for high-value edges such as forbidden calls between zones, data domains, or CQRS boundaries. Keep the rest in asynchronous drift detection until signal quality and exception processes are mature.


AI, Security, Compliance, and Operating Model

AI is not the control plane; it is an amplifier on top of the control plane’s machine-readable evidence. Once the estate is represented as structured resources, contracts, telemetry, and policy outputs, there are three sensible AI roles.

AI role

Inputs

Safe autonomy level

Good outcomes

Guardrails

Analysis

topology graph, contracts, traces, decision logs, findings

High

summarise drift, cluster related findings, identify likely root cause

human-readable evidence and traceability back to raw inputs

Suggestion

findings, ownership, playbooks, recent changes

Medium

propose remediation steps, exception scopes, rollout plans, missing metadata

no direct production changes without approval

Automated remediation

narrow, pre-approved playbooks

Low

open PRs, create tickets, add expired-exception alerts, revert known-bad policy changes

approval gates, rollback, blast-radius control

This is not speculative hand-waving. OpenTelemetry gives you normalised telemetry and resource attributes, OPA gives you decision logs with full decision context and bundle metadata, Backstage gives you structured ownership and API relationships, Runtime infrastructure gives you resource versions and watchable state.


The operating model should therefore be explicit.

Function

Recommended owner

Independence expectation

Architecture model authoring for a service

Product / domain engineering team

Same team that owns the service should maintain the model

Platform ownership of the control plane

Platform engineering

Centralised operations and reliability ownership

Policy standard definition

Enterprise architecture and security

Independent from day-to-day feature delivery

Runtime evidence and dashboards

SRE / observability platform

Shared operational service

Exception approval

Architecture governance plus risk/security depending on severity

Independent approval for high-risk deviations

Assurance and testing

Internal audit / control functions

Independent review and objective assurance

This follows both Backstage’s ownership semantics and the EBA’s emphasis on independence and assurance. Ownership must be singular and discoverable; assurance must be independent enough to be credible.


Roadmap, Economics, KPIs, and Actionable Next Steps

The implementation path should be incremental. Control planes fail when teams try to standardise the whole estate before the model has earned trust.


A practical roadmap looks like this.

Phase

Timeline

Milestones

Recommended KPIs

Preparation

3–6 weeks

choose one critical domain; define stable component IDs; standardise owner metadata; align service names with telemetry

model coverage for pilot domain > 70%; owner coverage > 95%

MVP

8–12 weeks

resource API; ArchitectureDefinition schema; PR checks; trace ingestion; first drift dashboard

mean time to detect drift < 1 day; false-positive rate < 20%; PR feedback time < 10 minutes

Production pilot

3–6 months

exception API; approval workflow; admission validation; generated policy bindings; evidence export for risk/audit

exception expiry adherence > 95%; blocking policy precision > 90%; architecture review lead time reduced by 30–50%

Scale-out

6–12 months

multi-cluster coverage; service-mesh integration where needed; CMDB sync; AI suggestions; remediation playbooks

coverage of tier-1 services > 80%; drift MTTD < 1 hour for protected domains; unauthorised edge recurrence down by > 50%

These KPIs are recommended control objectives for programme design, not industry-standard thresholds.

The effort and cost profile depends much more on estate maturity and signal quality than on raw service count.

Scenario

Estate assumptions

Team shape

Indicative elapsed time

Relative engineering effort

Relative platform/tool spend

Low

20–50 services, one strategic cluster, GitOps already present, good telemetry baseline

2–3 engineers plus part-time architect

10–14 weeks

Low to medium

Low

Medium

100–200 services, multiple namespaces/clusters, mixed telemetry quality, one catalogue or CMDB already present

4–6 engineers, one platform lead, one architect, one security/policy lead

4–6 months

Medium

Medium

High

300+ services, multiple regions/platforms, strong segregation requirements, partial legacy estate, multiple policy engines

8–12 engineers, SRE, architect, security engineering, change/risk integration

9–12 months

High

Medium to high

The biggest hidden cost is model hygiene. If owners are missing, service names are inconsistent, contracts are absent, or traces do not propagate correct context, the control plane spends its life reconciling broken metadata rather than architecture drift.


Actionable next steps for a bank are therefore very concrete:

  1. Pick one domain where architecture drift is already painful and visible, such as credit risk, payments, or customer servicing.

  2. Standardise service identity: service.name, service.namespace, owner ID, criticality, data classification.

  3. Define the first ArchitectureDefinition schema and keep it intentionally small.

  4. Wire the schema into PR checks before adding any hard runtime enforcement.

  5. Build one drift detector from traces only; do not start with every signal source.

  6. Add an exception resource with owner, expiry, reason, and compensating controls before you add any deny rules.

  7. Only after the first two sprints of clean findings, compile selected rules into admission or mesh enforcement.


Conclusion

Architecture Control Plane is not a metaphor borrowed for effect. It is a rigorous next step in the same lineage as infrastructure control planes, GitOps reconcilers, service-mesh control planes, and policy control planes. In banking, where architecture needs to be not only elegant but continuously evidenced, it is a more realistic operating model than architecture boards trying to govern distributed systems with static diagrams.

 
 
 

Comments


Never Miss a Post. Subscribe Now!

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Thanks for submitting!

© 2035 by ArchiDecode

    bottom of page