MTTR (Mean Time to Resolution) is a crucial benchmark/metric in Observability, and one that is the most dynamic. It’s also, if we were observing the observer, the one that is related to the most suffering. There is an abundance of tooling that facilitates ways to gather data and detect. Current tooling, however, seems to be focusing largely on engagement within a platform, rather than swiftness through the platform. Although at varying degrees, this can be largely aggravated due to a few key areas:
Time to Data/Repeated Time to Data
In frantic situations, the last thing observability users need is slow queries and overall “slow time to data”. Here we define “Data” as crucial information that assists in investigation. This can take various forms:
Slow Queries from Observability Backend: Metrics/logs/trace/etc queries are resulting in timeouts, errors, overload from concurrency, etc. This can cause a repeating pattern of further querying, frustration, loss of context
Obstacles to “Data”: Sometimes, due to product opinions, “Data” can be occluded intentionally to steer the user and force engagement. This is also immensely frustrating.
Malformed “Data”: Often, the expectation of how the data should be displayed deviates greatly, also leading to enormous context loss.
Investigations usually show compounding patterns, and all of these can add up incrementally.
Data Structure Disjointedness
Inconsistency in data structures is another avenue of enormous frustration. Consider log formatting:
Deeply Nested “JSONic” Logs
System Components Logs (Apache, NGINX, etc)
Free-form Application Logs
Stack Traces
Furthermore, if observability initiatives embrace tracing, detachment of spans due to missing trace ids (missing headers) can lead to gaps in observability for micro-service environments.
All of this results in “Data” that is less “joinable” and not pointing to causative factors for an incident.
Cognitive Load of Noise and Context Shifting
As humans, cognitive and emotional strain starts to pile up during active incidents and system failure. Little does tooling consider this during design. Noise, in the form of extraneous “Data”, too much “Data” doesn’t translate well to the unfortunately narrowing perception budget of the observability practitioner. This is further aggravated by UIs that are of an entirely different design philosophy and must be paired together during incident analysis.
Minimal Engagement/Communication Loss
Finally, gaps in communication or simply, observability not being utilized, is a significant problem. Very often, visualization is configured without considering that human operators won’t constantly be observing. Unfortunately, alerting configuration is non-intuitively designed vs its graphing counterpart. Furthermore, unreliable or ignored communication channels culminate into an ultimate panic event.
SRE-experience metric
When examining these frequent obstacles to incident resolution, tooling must start to consider metrics for emotional fatigue and general organizational chaos. Only then can we accurately have tooling comprehend what observability users have to endure during these types of events.