Agentic AI Workflows for SRE / DevOps

There have been several posts about AI SRE solutions being deployed in a "fire and forget" manner, even for complex setups. One post that tested this idea is from the Clickhouse team. They concluded that "Models are not ready" and needed guidance throughout the process. But is this expecting too much during the initial deployment phase? Many posts fall into the trap of thinking that starting with path-finding is the best deployment strategy. We believe that less is more for MCP tooling and that this tooling can provide the essential groundwork for Agentic AI workflows for SRE.

Data and Systems Discovery

The most important step is creating a map of what exists and how everything interacts. For us, this begins with discovering tables, namespaces, components, and how services call each other. We use specialized "template queries" to guide the model accurately. Sampling can significantly limit the ability to effectively perform data discovery.

“Path-finding” Guidance

Deep dive investigations should be conducted one at a time, in isolation. A common mistake many AI SRE solutions make is getting stuck in the wrong context and overlooking important components and interactions. We enhance this process by offering path-finding expertise to the agent.

Incident Documentation Format

Finally, summaries, findings, and trends must be documented in a proper format for future agent runs.

An open, robust, and well-connected AI SRE

Our core belief, supported by our design partners, is that an AI SRE should first be liberated from rigid tools. This begins by using standard data formats like OTel and creating strong tool layers with MCP.