A World Model for SRE Agents Across Observability Tools

An SRE agent starts every investigation missing the same thing: a model of how the system is wired. Which services call which, what each one reads from, where workloads are co-resident, how much redundancy a service actually has. People hold fragments of this in their heads and in stale wiki diagrams. The accurate version is latent in telemetry, and the telemetry is split across backends that do not share an identity scheme.

We build that model as a standing artifact and refresh it on a schedule. It is descriptive. It records entities and the relationships between them, with no judgment about what is healthy. Diagnosis is a separate job that reads the model.

Reconciling identity across backends

A typical environment has metrics in one system, logs in another, traces in a third, and the platform's own config as a fourth. Each is queried differently and, more importantly, names the same entity differently. A service is payment in a metrics label, payment-svc in a span's resource attributes, and payment-7c9f at the pod level in logs. Before any of these can be joined, identity has to be resolved to a canonical entity.

This is the step that quietly decides whether the model is correct. Merge two identities that are distinct and you collapse two services into one node. Fail to merge two names for the same service and you fan one service into several, each holding a fraction of its real edges. We resolve conservatively, carry provenance for every alias, keep the mapping reviewable, and leave names separate when the evidence is weak rather than guessing. Identity resolution runs before edge construction, and its output is auditable on its own.

Edges carry how they were derived

Once entities are canonical, edges come from whatever signals are present, and each edge records its derivation and a confidence alongside it. That metadata is the point. A downstream reader weights an observed edge differently from an inferred one.

Source	How an edge is derived	Confidence
Traces	parent and child spans, observed at runtime	high
Logs	shared request id across services, client target in message text	high to medium
Config	declared dependency, placement, replica count	high for existence
Metrics	temporal lead and lag between series	low, corroborating only

With traces present, the directed call graph falls out directly. Without them, request ids in logs reconstruct most of it, config supplies declared edges, and metric timing is used only to corroborate a candidate another source already proposed, never to assert an edge on its own. An edge that config declares but no signal ever exercises is kept and marked dormant, which is itself worth knowing.

Direction and weight

Each edge is directed and weighted, and the direction encodes more than call flow. If checkout calls payment, failure in payment propagates toward checkout while load from checkout propagates toward payment. The two travel opposite ways along one edge. Co-residency, two workloads on the same node, is an undirected fate relationship that no call graph shows. Weight is the share of traffic the edge carries, so a reader walking the graph for impact gets a ranked answer instead of a reachability set where everything eventually touches everything.

A scheduled snapshot makes structural change observable. Diff two days and the model surfaces:

a dependency that appeared or disappeared
a service whose replica count dropped
a call that now crosses an availability zone
a declared edge that went dormant or came back

Why it runs in your VPC

Constructing the model means federating reads across every telemetry backend you run, each with its own credentials, against the rawest data you hold. The model and its snapshots are built and stored where that data already lives, inside your network, with read-only scoped credentials per source. Air-gapped and regulated environments run the same way, since nothing about the construction depends on egress.

The result is a current, derivation-aware map of the running system that several agents can read at once and reason over together, without any of them rebuilding topology mid-incident.