<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[OMLET: Open Metrics Logs Events Traces]]></title><description><![CDATA[OMLET: Open Metrics Logs Events Traces]]></description><link>https://blog.omlet.co</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1715226938326/tvtCAZhGe.jpg</url><title>OMLET: Open Metrics Logs Events Traces</title><link>https://blog.omlet.co</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 04:26:32 GMT</lastBuildDate><atom:link href="https://blog.omlet.co/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Minimal vulnerability and easy OpenClaw with Omlet OS]]></title><description><![CDATA[OpenClaw's role as a constantly running agent that performs various encoded tasks has been exciting. Its focus on prioritizing communication channels and its strong emphasis on sessions are particularly noteworthy. However, security concerns have inc...]]></description><link>https://blog.omlet.co/minimal-vulnerability-and-easy-openclaw-with-omlet-os</link><guid isPermaLink="true">https://blog.omlet.co/minimal-vulnerability-and-easy-openclaw-with-omlet-os</guid><category><![CDATA[openclaw]]></category><category><![CDATA[AI]]></category><category><![CDATA[ai agents]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 17 Feb 2026 04:17:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771299601033/613132e3-0570-4e8a-9495-ea3d3cce2355.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>OpenClaw's role as a constantly running agent that performs various encoded tasks has been exciting. Its focus on prioritizing communication channels and its strong emphasis on sessions are particularly noteworthy. However, security concerns have increased along with the hype.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771299306873/97e5a516-eeba-46b1-9f35-bbc94e0c88e8.png" alt class="image--center mx-auto" /></p>
<p>Omlet Agent is similar to OpenClaw but takes a different approach:</p>
<ol>
<li><p>Can we provide "code-like" automation in a cleaner interface that organizes sessions like traditional directories?</p>
</li>
<li><p>Can we offer advanced views, cron-like tasks, and dynamic tool usage already built into coding CLIs?</p>
</li>
<li><p>Can we achieve a potentially zero CVE deployment in a single container?</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300139534/68e0d5c2-8fcf-4375-b0e8-93ccb0e5bc6a.png" alt class="image--center mx-auto" /></p>
<p><strong>Directory View</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300210706/4d48b60d-7f5e-4500-80bd-975fd64f2735.png" alt class="image--center mx-auto" /></p>
<p><strong>CLI View</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300311986/b6c3a436-6d12-4749-932d-0bbf29367045.png" alt class="image--center mx-auto" /></p>
<p><strong>Task Sytem</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300326526/1ebca480-19e4-49a1-9f59-aa8c0e0f3461.png" alt class="image--center mx-auto" /></p>
<p><strong>Styled HTML Views</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300382120/ee7f3319-262a-441e-8144-e351a0c0022f.png" alt class="image--center mx-auto" /></p>
<p><strong>Shareable Links</strong></p>
<p>Most importantly, an autonomous agent must be secure. Strong authentication measures are essential, and we use Auth0 for this purpose. Additionally, can we achieve the ideal scenario of having zero CVEs? Fortunately, we managed to shift many components from the standard <code>node:20-slim</code>, use a multi-stage build, and heavily prune packages, which significantly reduced CVEs (see the Trivy scan below)!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771300989269/d339cffc-c805-4708-9b1c-1b5cba161711.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Observability Agents as "Platform Experts"]]></title><description><![CDATA[As AI agents flood the enterprise software landscape, companies must integrate their existing data hubs. In the data analytics world, this can include platforms like Snowflake. In CRM / front-office land, platforms like Salesforce. In Observability, ...]]></description><link>https://blog.omlet.co/ai-observability-agents-as-platform-experts</link><guid isPermaLink="true">https://blog.omlet.co/ai-observability-agents-as-platform-experts</guid><category><![CDATA[SRE]]></category><category><![CDATA[observability]]></category><category><![CDATA[Datadog]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Mon, 01 Dec 2025 04:39:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764302536646/84f24f40-a396-4fde-88d2-75750374dc6b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As AI agents flood the enterprise software landscape, companies must integrate their existing data hubs. In the data analytics world, this can include platforms like Snowflake. In CRM / front-office land, platforms like Salesforce. In Observability, names like Datadog and Splunk. As these platforms integrate their own AI agents, there is a parallel movement that relies on API integrations (and now MCP). Philosophically, we can think of the latter agents as having to “onboard” in their own way like new employees would. This path led us to build <strong>our very own “Datadog expert”</strong> that can work alongside the internal AI agent (“Bits”) within the platform.</p>
<h2 id="heading-it-starts-with-knowledge-gathering">It starts with knowledge gathering</h2>
<p>The first thing a user typically does is learn about the platform's dynamics, data structures, and goals by looking for relevant documentation. This includes query languages, UI mechanics, APIs, and more. For our expert, we had our agent thoroughly explore platform documentation and indexed knowledge. Our agent then organizes this information into clear markdown and creates visual diagrams for future reference (as shown below).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764302685276/dab90383-693b-4090-a6ff-9400a386fa7e.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764302692142/b24a1fe1-42ce-4db0-bd2e-d33b269b2270.png" alt class="image--center mx-auto" /></p>
<p>With a grounded knowledge base, our agent can effectively explore and play around with platform capabilities.</p>
<h2 id="heading-it-progresses-to-deterministic-tools">It progresses to deterministic tools</h2>
<p>Once our agent understands "how to access" a platform, using credentials or OAuth just like a user would, it can move on to creating precise tools. While MCP is a good general platform for building and distributing tools, we can also rely on OS first principles and let the agent execute code directly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764305257737/faee6754-2036-4a24-bdb2-3f3ec8b608cf.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764305267043/f85e87d6-ae1a-4273-bca8-61dd7dccfd96.png" alt class="image--center mx-auto" /></p>
<p>In the example above, we demonstrate one tool. Notice the importance of precision to help the agent recover from errors and its own mistakes. We can also create highly specific tools that follow a certain user structure (such as querying only this namespace or using these filters). We find particular value in these structured implementations, as they provide a strong framework around dynamic query languages.</p>
<h2 id="heading-it-culminates-in-sophisticated-processes">It culminates in sophisticated processes</h2>
<p>Finally, what if we link all these precise tools together while allowing for dynamic exploration and model analysis? This is where we can develop systematic processes to achieve a goal within the platform. This could involve a detailed investigation or a comprehensive overview of trends.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764307410626/36934c89-a187-406c-bede-6aecd739cefd.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764307416958/920943e1-ee79-4deb-bece-4a015736290a.png" alt class="image--center mx-auto" /></p>
<p>Our main goal, using targeted tools and advanced workflows, is to let observability data move beyond the limits of generic presentation and use. That's why, within our Agent OS platform, these components can be tailored to fit internal domain knowledge <strong>while also working with other Agents.</strong> Imagine one agent gathering information and skillfully querying, while another performs analysis and assesses multiple hypotheses.</p>
]]></content:encoded></item><item><title><![CDATA[Agentic AI Workflows for SRE / DevOps]]></title><description><![CDATA[There have been several posts about AI SRE solutions being deployed in a "fire and forget" manner, even for complex setups. One post that tested this idea is from the Clickhouse team. They concluded that "Models are not ready" and needed guidance thr...]]></description><link>https://blog.omlet.co/agentic-ai-workflows-for-sre-devops</link><guid isPermaLink="true">https://blog.omlet.co/agentic-ai-workflows-for-sre-devops</guid><category><![CDATA[SRE]]></category><category><![CDATA[Devops]]></category><category><![CDATA[agentic AI]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Mon, 25 Aug 2025 13:43:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755537097962/bafbd2f8-ea0c-4039-8d40-44078b4d6513.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There have been several posts about AI SRE solutions being deployed in a "fire and forget" manner, even for complex setups. One post that tested this idea is from the <a target="_blank" href="https://clickhouse.com/blog/llm-observability-challenge">Clickhouse</a> team. They concluded that "<em>Models are not ready</em>" and needed guidance throughout the process. But is this expecting too much during the initial deployment phase? Many posts fall into the trap of thinking that starting with path-finding is the best deployment strategy. We believe that <a target="_blank" href="https://blog.omlet.co/omlet-mcp">less is more</a> for MCP tooling and that this tooling can provide the essential groundwork for Agentic AI workflows for SRE.</p>
<h2 id="heading-data-and-systems-discovery">Data and Systems Discovery</h2>
<p>The most important step is creating a map of what exists and how everything interacts. For us, this begins with discovering tables, namespaces, components, and how services call each other. We use specialized "template queries" to guide the model accurately. <strong>Sampling can significantly limit the ability to effectively perform data discovery.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755556502603/cacd6eb6-1786-4775-83ee-31713977de21.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-path-finding-guidance">“Path-finding” Guidance</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755538998312/81cae3bf-f2fe-4616-9280-144657bcfe6a.png" alt class="image--center mx-auto" /></p>
<p>Deep dive investigations should be conducted one at a time, in isolation. A common mistake many AI SRE solutions make is getting stuck in the wrong context and overlooking important components and interactions. We enhance this process by offering path-finding expertise to the agent.</p>
<h2 id="heading-incident-documentation-format">Incident Documentation Format</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755556517406/c254fa0c-e18b-4b4e-b3eb-4b3c7e1e1a65.png" alt class="image--center mx-auto" /></p>
<p>Finally, summaries, findings, and trends must be documented in a proper format for future agent runs.</p>
<h2 id="heading-an-open-robust-and-well-connected-ai-sre">An open, robust, and well-connected AI SRE</h2>
<p>Our core belief, supported by our design partners, is that an AI SRE should first be liberated from rigid tools. This begins by using standard data formats like OTel and creating strong tool layers with MCP.</p>
]]></content:encoded></item><item><title><![CDATA[Omlet MCP]]></title><description><![CDATA[A while ago, we explored different LLM use cases through “Skillet”. During our experiments, we quickly realized the advantages of using tools, precise integration with observability data sources, and sparse tokenization. The ecosystem has been evolvi...]]></description><link>https://blog.omlet.co/omlet-mcp</link><guid isPermaLink="true">https://blog.omlet.co/omlet-mcp</guid><category><![CDATA[mcp]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[observability]]></category><category><![CDATA[OpenTelemetry]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Fri, 18 Jul 2025 04:08:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752777956406/83a9ed7e-1333-4d0d-a92b-9d1e33e3cb64.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A while ago, we explored different LLM use cases through <a target="_blank" href="https://blog.omlet.co/skillet-our-ai-agent-sre">“Skillet”</a>. During our experiments, we quickly realized the advantages of using tools, precise integration with observability data sources, and sparse tokenization. The ecosystem has been evolving rapidly, and we've seen MCP emerge as a protocol for flexible tool use. Fortunately, we were able to apply our existing knowledge to quickly move into the Agentic AI Era.</p>
<h2 id="heading-exploring-maximas-in-the-ai-space">Exploring Maximas in the AI Space</h2>
<p><img src="https://lh7-rt.googleusercontent.com/slidesz/AGV_vUfDvBG-yPboe5AeMMhCidxl-PYisy4Mc3DOsbDsi3AjykjwW-qq1cdVjlU4V_gTFviY4MGiXTdgGBzVtbujStH847j-w9RAwOYWfz3IzPB3tq4AWkxomdLKhBErFd-91VxoUYAuOQ=s2048?key=4XvP4gyldMEyPBNMzRzg6A" alt /></p>
<p>Sometimes, it's beneficial to wait and see how different parts of the ecosystem, especially in AI, develop. The image shows a pattern in various stages of the current AI phase and partly in past AI phases. At different times, the industry has claimed that true automation is finally here and that AI will solve all the problems that trouble enterprises. However, this has often fallen short due to:</p>
<ol>
<li><p><strong>Integration challenges</strong></p>
</li>
<li><p><strong>Inconsistent usage</strong></p>
</li>
<li><p><strong>Inaccuracy and scalability issues from a hardware perspective</strong></p>
</li>
</ol>
<p>I believe we are seeing some signs of a "global" peak in the usefulness of AI for data analysis.</p>
<h2 id="heading-less-is-more-for-mcp-tools">Less is more for MCP tools</h2>
<p>We quickly realized that humans are significantly affected by context switching and having "too many tools." From our past experience with LLMs and observability data, we also knew that using different payloads could confuse the LLM. With this in mind, we focused on the essentials for our MCP server:</p>
<ol>
<li><p><strong>Read data from OTel formatted tables</strong></p>
</li>
<li><p><strong>Read data and cluster</strong></p>
</li>
</ol>
<p>This "confinement" allows a reasoning model with tool-use abilities to focus on the basics without getting distracted.</p>
<h2 id="heading-whats-next">What’s next?</h2>
<p>As we mentioned, things are moving quickly, but we can use this growing knowledge to adapt fast. The guiding principle should always be: Are we helping engineering, architecture, and ops teams solve issues quickly and consistently? Are we assisting teams in resolving issues when they're unavailable or need breaks? Are we doing this throughout the organization?</p>
]]></content:encoded></item><item><title><![CDATA[Data Lakehouse for "Big" Observability]]></title><description><![CDATA[Usually, a single-tenant or isolated multi-tenant approach is sufficient to meet the observability needs of startups and enterprises, respectively. Aside from running distributed queries across tenants, is there a graceful approach to storing data fr...]]></description><link>https://blog.omlet.co/data-lakehouse-for-big-observability</link><guid isPermaLink="true">https://blog.omlet.co/data-lakehouse-for-big-observability</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[Data-lake]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 27 May 2025 13:59:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748119914165/e1924c0e-3ae8-462d-85dd-16968c969dc8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Usually, a single-tenant or isolated multi-tenant approach is sufficient to meet the observability needs of startups and enterprises, respectively. Aside from running distributed queries across tenants, is there a graceful approach to storing data from all tenants for very long periods of time? Can this be stored in a cost-effective, nicely partitioned, and compressed way?</p>
<h2 id="heading-object-storage">Object Storage</h2>
<p>Object storage represents a "bottomless" and horizontally concurrent way to store batches of data, such as <code>otlp_json</code>. The <code>awss3exporter</code> within the OTel Collector Contrib provides a great way to batch write to S3.</p>
<p>An example config:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">receivers:</span>
  <span class="hljs-attr">otlp:</span>
    <span class="hljs-attr">protocols:</span>
      <span class="hljs-attr">http:</span>
        <span class="hljs-attr">endpoint:</span> <span class="hljs-string">'0.0.0.0:4318'</span>

<span class="hljs-attr">processors:</span>
  <span class="hljs-attr">batch:</span>

<span class="hljs-attr">exporters:</span>
  <span class="hljs-attr">awss3:</span>
    <span class="hljs-attr">s3uploader:</span>
      <span class="hljs-attr">region:</span> <span class="hljs-string">"us-west-2"</span>
      <span class="hljs-attr">s3_bucket:</span> <span class="hljs-string">"omlet-logs"</span>
      <span class="hljs-comment"># root “directory” inside the bucket; you can change "logs" to whatever prefix you like</span>
      <span class="hljs-attr">s3_prefix:</span> <span class="hljs-string">"logs"</span>
      <span class="hljs-comment"># partition by year/month/day/hour/minute</span>
      <span class="hljs-attr">s3_partition_format:</span> <span class="hljs-string">"year=%Y/month=%m/day=%d/hour=%H/minute=%M"</span>
       <span class="hljs-comment"># gzip compression is optional</span>
      <span class="hljs-attr">compression:</span> <span class="hljs-string">gzip</span>
    <span class="hljs-attr">sending_queue:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">num_consumers:</span> <span class="hljs-number">5</span>
      <span class="hljs-attr">queue_size:</span> <span class="hljs-number">50</span>
    <span class="hljs-attr">timeout:</span> <span class="hljs-string">10s</span>
  <span class="hljs-attr">debug:</span>

<span class="hljs-attr">service:</span>
  <span class="hljs-attr">pipelines:</span>
    <span class="hljs-attr">logs:</span>
      <span class="hljs-attr">receivers:</span> [<span class="hljs-string">otlp</span>]
      <span class="hljs-attr">processors:</span> [<span class="hljs-string">batch</span>]
      <span class="hljs-attr">exporters:</span> [<span class="hljs-string">debug</span>, <span class="hljs-string">awss3</span>]
</code></pre>
<p>The above configuration takes OTLP data (in this example, only logs) and batches it to an S3 bucket. While straightforward, this setup unlocks <strong>major capabilities.</strong></p>
<h2 id="heading-lakehouse-db">Lakehouse DB</h2>
<p>OLAP databases like DuckDB and Clickhouse (among others) can easily connect to this storage system. We use CHDB (embedded Clickhouse) to access this object storage.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> gzip
<span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta
<span class="hljs-keyword">import</span> chdb
<span class="hljs-keyword">from</span> itertools <span class="hljs-keyword">import</span> islice
<span class="hljs-keyword">from</span> concurrent.futures <span class="hljs-keyword">import</span> ThreadPoolExecutor, as_completed

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># Configure logging</span>
<span class="hljs-comment"># ───────────────────────────────</span>
logging.basicConfig(
    level=logging.INFO,
    format=<span class="hljs-string">"%(asctime)s %(levelname)s %(message)s"</span>,
    datefmt=<span class="hljs-string">"%Y-%m-%d %H:%M:%S"</span>,
)
log = logging.getLogger(<span class="hljs-string">"otel_ingest"</span>)

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># AWS creds</span>
<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-keyword">assert</span> os.environ.get(<span class="hljs-string">"AWS_ACCESS_KEY_ID"</span>), <span class="hljs-string">"Missing AWS_ACCESS_KEY_ID"</span>
<span class="hljs-keyword">assert</span> os.environ.get(<span class="hljs-string">"AWS_SECRET_ACCESS_KEY"</span>), <span class="hljs-string">"Missing AWS_SECRET_ACCESS_KEY"</span>
log.info(<span class="hljs-string">"AWS credentials loaded"</span>)

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># List S3 files for last day</span>
<span class="hljs-comment"># ───────────────────────────────</span>
s3 = boto3.client(<span class="hljs-string">"s3"</span>, region_name=<span class="hljs-string">"us-west-2"</span>)
now = datetime.utcnow()
start = now - timedelta(days=<span class="hljs-number">1</span>)
paginator = s3.get_paginator(<span class="hljs-string">"list_objects_v2"</span>)
pages = paginator.paginate(Bucket=<span class="hljs-string">"omlet-logs"</span>, Prefix=<span class="hljs-string">"logs/"</span>)

all_keys = [
    obj[<span class="hljs-string">"Key"</span>]
    <span class="hljs-keyword">for</span> page <span class="hljs-keyword">in</span> pages
    <span class="hljs-keyword">for</span> obj <span class="hljs-keyword">in</span> page.get(<span class="hljs-string">"Contents"</span>, [])
    <span class="hljs-keyword">if</span> start &lt;= obj[<span class="hljs-string">"LastModified"</span>].replace(tzinfo=<span class="hljs-literal">None</span>) &lt;= now <span class="hljs-keyword">and</span> obj[<span class="hljs-string">"Key"</span>].endswith(<span class="hljs-string">".json.gz"</span>)
]
log.info(<span class="hljs-string">f"Total files found: <span class="hljs-subst">{len(all_keys)}</span>"</span>)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> all_keys:
    exit(<span class="hljs-number">0</span>)

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># Create target table</span>
<span class="hljs-comment"># ───────────────────────────────</span>
conn = chdb.connect()
cur = conn.cursor()
cur.execute(<span class="hljs-string">"""
CREATE TABLE IF NOT EXISTS otel_logs (
    Timestamp           DateTime64(9),
    TraceId             String,
    SpanId              String,
    TraceFlags          UInt8,
    SeverityText        String,
    SeverityNumber      UInt8,
    ServiceName         String,
    Body                String,
    ResourceAttributes  Map(String, String),
    ScopeAttributes     Map(String, String),
    LogAttributes       Map(String, String)
) ENGINE = Memory
"""</span>)
log.info(<span class="hljs-string">"Table otel_logs ready"</span>)

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># Helper to process a single file → list of dicts</span>
<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_file</span>(<span class="hljs-params">key</span>):</span>
    log.info(<span class="hljs-string">f"Starting download of <span class="hljs-subst">{key}</span>"</span>)
    blob = s3.get_object(Bucket=<span class="hljs-string">"omlet-logs"</span>, Key=key)[<span class="hljs-string">"Body"</span>].read()
    log.info(<span class="hljs-string">f"Downloaded <span class="hljs-subst">{key}</span> (<span class="hljs-subst">{len(blob)}</span> bytes)"</span>)
    payload = json.loads(gzip.decompress(blob))
    entries = payload.get(<span class="hljs-string">"resourceLogs"</span>, [])
    log.info(<span class="hljs-string">f"Parsed JSON for <span class="hljs-subst">{key}</span>, found <span class="hljs-subst">{len(entries)}</span> resourceLogs entries"</span>)

    out = []
    <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> entries:
        res_attrs = {a[<span class="hljs-string">"key"</span>]: a[<span class="hljs-string">"value"</span>][<span class="hljs-string">"stringValue"</span>] <span class="hljs-keyword">for</span> a <span class="hljs-keyword">in</span> r[<span class="hljs-string">"resource"</span>][<span class="hljs-string">"attributes"</span>]}
        <span class="hljs-keyword">for</span> sl <span class="hljs-keyword">in</span> r[<span class="hljs-string">"scopeLogs"</span>]:
            scope_attrs = {a[<span class="hljs-string">"key"</span>]: a[<span class="hljs-string">"value"</span>][<span class="hljs-string">"stringValue"</span>] <span class="hljs-keyword">for</span> a <span class="hljs-keyword">in</span> sl[<span class="hljs-string">"scope"</span>].get(<span class="hljs-string">"attributes"</span>, [])}
            service = sl[<span class="hljs-string">"scope"</span>].get(<span class="hljs-string">"name"</span>, <span class="hljs-string">""</span>)
            <span class="hljs-keyword">for</span> logrec <span class="hljs-keyword">in</span> sl[<span class="hljs-string">"logRecords"</span>]:
                ts = datetime.utcfromtimestamp(int(logrec[<span class="hljs-string">"timeUnixNano"</span>]) / <span class="hljs-number">1e9</span>).isoformat(sep=<span class="hljs-string">' '</span>)
                rec = {
                    <span class="hljs-string">"Timestamp"</span>: ts,
                    <span class="hljs-string">"TraceId"</span>: logrec.get(<span class="hljs-string">"traceId"</span>, <span class="hljs-string">""</span>),
                    <span class="hljs-string">"SpanId"</span>: logrec.get(<span class="hljs-string">"spanId"</span>, <span class="hljs-string">""</span>),
                    <span class="hljs-string">"TraceFlags"</span>: logrec.get(<span class="hljs-string">"traceFlags"</span>, <span class="hljs-number">0</span>),
                    <span class="hljs-string">"SeverityText"</span>: logrec.get(<span class="hljs-string">"severityText"</span>, <span class="hljs-string">""</span>),
                    <span class="hljs-string">"SeverityNumber"</span>: logrec.get(<span class="hljs-string">"severityNumber"</span>, <span class="hljs-number">0</span>),
                    <span class="hljs-string">"ServiceName"</span>: service,
                    <span class="hljs-string">"Body"</span>: logrec[<span class="hljs-string">"body"</span>][<span class="hljs-string">"stringValue"</span>],
                    <span class="hljs-string">"ResourceAttributes"</span>: res_attrs,
                    <span class="hljs-string">"ScopeAttributes"</span>: scope_attrs,
                    <span class="hljs-string">"LogAttributes"</span>: {a[<span class="hljs-string">"key"</span>]: a[<span class="hljs-string">"value"</span>][<span class="hljs-string">"stringValue"</span>] <span class="hljs-keyword">for</span> a <span class="hljs-keyword">in</span> logrec.get(<span class="hljs-string">"attributes"</span>, [])}
                }
                out.append(rec)
    log.info(<span class="hljs-string">f"Processed <span class="hljs-subst">{key}</span>, constructed <span class="hljs-subst">{len(out)}</span> records"</span>)
    <span class="hljs-keyword">return</span> out

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># Chunking utilities</span>
<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">chunked</span>(<span class="hljs-params">iterable, size</span>):</span>
    it = iter(iterable)
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        batch = list(islice(it, size))
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> batch:
            <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">yield</span> batch

batch_size = <span class="hljs-number">100</span>
row_batch_size = <span class="hljs-number">100000</span>  <span class="hljs-comment"># JSONEachRow batch size</span>
total = <span class="hljs-number">0</span>

<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-comment"># Main loop: batches of files</span>
<span class="hljs-comment"># ───────────────────────────────</span>
<span class="hljs-keyword">for</span> batch_keys <span class="hljs-keyword">in</span> chunked(all_keys, batch_size):
    log.info(<span class="hljs-string">f"Processing batch of <span class="hljs-subst">{len(batch_keys)}</span> files"</span>)
    <span class="hljs-comment"># parallel parse → list of dicts</span>
    rows_accum = []
    <span class="hljs-keyword">with</span> ThreadPoolExecutor(max_workers=<span class="hljs-number">8</span>) <span class="hljs-keyword">as</span> ex:
        futures = {ex.submit(parse_file, key): key <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> batch_keys}
        <span class="hljs-keyword">for</span> fut <span class="hljs-keyword">in</span> as_completed(futures):
            rows_accum.extend(fut.result())

    <span class="hljs-comment"># bulk insert in JSONEachRow chunks</span>
    <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunked(rows_accum, row_batch_size):
        payload = <span class="hljs-string">"\n"</span>.join(json.dumps(r, default=str) <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> chunk)
        sql = <span class="hljs-string">"INSERT INTO otel_logs FORMAT JSONEachRow\n"</span> + payload
        cur.execute(sql)
        log.info(<span class="hljs-string">f"Inserted JSONEachRow batch of <span class="hljs-subst">{len(chunk)}</span> rows"</span>)
        total += len(chunk)

log.info(<span class="hljs-string">f"Total rows inserted: <span class="hljs-subst">{total}</span>"</span>)

<span class="hljs-comment"># Example select query</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
df = pd.read_sql(<span class="hljs-string">"SELECT SeverityText, count() AS cnt FROM otel_logs GROUP BY SeverityText"</span>, conn)
log.info(<span class="hljs-string">"Counts by severity:\n"</span> + df.to_string(index=<span class="hljs-literal">False</span>))

conn.close()
log.info(<span class="hljs-string">"Done."</span>)
</code></pre>
<p>In the code above, we load a series of compressed (gzip) <code>otlp_json</code> files from a specific time window (the past day), create an in-memory table, insert rows in batches, and query for counts by severity level.</p>
<h2 id="heading-going-wide">Going “Wide”</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748130920062/c929cd99-eafb-489e-a371-9a3661689fa1.png" alt class="image--center mx-auto" /></p>
<p>The code above shows a simple way to query using one worker. However, the advantage of object storage is that it allows multiple workers to access partitions in the bucket. For example:</p>
<ol>
<li><p>Access log/trace data by minute per worker for the past hour for large-scale data and aggregate it.</p>
</li>
<li><p>Partition by "Service" and "Time" to query high-resolution data at scale during live incidents.</p>
</li>
</ol>
<p>Our vision at Omlet is to use OTel as a tool to create the best Observability Lakehouse experience.</p>
]]></content:encoded></item><item><title><![CDATA[Omlet House: Scaling Open Source Observability]]></title><description><![CDATA[Open-Source Observability stacks are always appealing. Alongside a core vendor, enterprises can realize the vision of a true “observability data lake” and leverage cutting-edge AI features, while also managing compliance constraints. Low-cost, highly...]]></description><link>https://blog.omlet.co/omlet-house-scaling-open-source-observability</link><guid isPermaLink="true">https://blog.omlet.co/omlet-house-scaling-open-source-observability</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Mon, 05 May 2025 19:59:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746202306143/47d29526-2566-4cdf-aaef-e7f34ea37fec.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Open-Source Observability stacks are always appealing. Alongside a core vendor, enterprises can realize the vision of a true <strong>“observability data lake”</strong> and leverage cutting-edge AI features, while also managing compliance constraints. Low-cost, highly customizable, and transparent, but not without challenges.</p>
<h2 id="heading-configuration-and-administration">Configuration and Administration</h2>
<p>Config and Admin are usually the first pain points engineers confront with OSS Observability solutions. Think endless YAML, user management (auth), service integration, boilerplate queries and dashboards, cluster configuration, permissions, etc.</p>
<h2 id="heading-networking">Networking</h2>
<p>Beyond initial configuration, networking components need to be addressed. How will users access the UIs? What DNS configurations do you need to think about? What about certificate management? Although these seem like simple, common-sense things, they really add up.</p>
<h2 id="heading-scale">Scale</h2>
<p>Different services and teams, at different times, may send unpredictable amounts of volume. How do you sustain peak load and not under-provision? How do you enforce resource quotas and minimize noisy neighbors? Can resources spin up (and down) quickly?</p>
<h2 id="heading-multi-tenancy-and-orchestration">Multi-Tenancy and Orchestration</h2>
<p>In the enterprise, various teams/groups will want their own tenants. How do you properly isolate for ownership and compliance? Can a central team manage and govern sub-units effectively?</p>
<h2 id="heading-omlet-house">Omlet House</h2>
<p>The above factors, especially regarding enterprise multi-tenancy and orchestration, really pushed us to consider the ideal stack of open source observability components. There are various combinations one could think of, but the above challenges still remained. With our <strong>Omlet House</strong> service that enterprises can deploy in their own Kubernetes cluster, we wanted the “stack” spin-up to be fluid and ready to go.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1746474130326/d32a2b11-965a-4808-875c-99719e7f629a.gif" alt class="image--center mx-auto" /></p>
<p>If you are on AWS, GCP, or Azure, we’d love to talk to you about trying it out! Send us a note at info@omlet.co!</p>
]]></content:encoded></item><item><title><![CDATA[The "Bitter Lesson" and "Fog of War" in Observability?]]></title><description><![CDATA[I’ve been looking at Observability reference architectures for a long time. The big initiator for me was always: “what tools do you all use?” Over time, and still something the industry needs to learn at large: “how do you use all these tools togethe...]]></description><link>https://blog.omlet.co/the-bitter-lesson-and-fog-of-war-in-observability</link><guid isPermaLink="true">https://blog.omlet.co/the-bitter-lesson-and-fog-of-war-in-observability</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Fri, 04 Apr 2025 03:50:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743738726574/7ea829f5-df44-47ed-81bc-3bc1bc7a739f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I’ve been looking at Observability reference architectures for a long time. The big initiator for me was always: <em>“what tools do you all use?”</em> Over time, and still something the industry needs to learn at large: <em>“how do you use all these tools together?”</em> So often, this culminates in the ultimate statement of: <em>“well we get pinged/sometimes alerted, and then look at logs here, look at logs there, look at logs…”</em></p>
<h2 id="heading-the-bitter-lesson-as-it-concerns-observability">The “Bitter Lesson” as it concerns Observability</h2>
<p>The “Bitter Lesson” is a significant observation on machine learning improvement trends:</p>
<blockquote>
<p>In the long run, general methods that leverage computation (like deep learning) outperform human-designed, specialized algorithms—even in domains where we think we have deep expertise.</p>
</blockquote>
<p>Of course, the “bitterness” arises in individuals who spend considerable time examining carefully crafted processes due to the sheer attachment they feel to them over time. This happens quite often in the sciences (physicists becoming emotionally attached to theories). Observability practitioners suffer from the same fate, in my opinion. Not because of the state of the environment, however, but because the mantra from a commercial sense has largely propagated a varied, uniform, “perfect” set of telemetry. Maybe it's self-created, forcing us to endlessly worry about what is or isn’t ingested or structured. From a field perspective, one painstakingly tries to promote an ideal collection of telemetry, only to find that people are fine with just logs.</p>
<h2 id="heading-the-fog-of-war-in-observability">The “Fog of War” in Observability</h2>
<p>When I first deployed tracing, I felt the power of what I thought of as “automatic and structured logging”. I saw spans as these efficient structures where I could stash metadata for whatever analytical purpose. What I should have realized at this point: <strong>this is what happens when you stick to a consistent pattern in your observability strategy.</strong> I was fortunate to have done this early. The temptation to collect extraneous datapoints via other ingestion frameworks wasn’t there. I was fine “extending my events”.</p>
<p>From a game-theory perspective, this is analogous to illuminating the “fog of war” in RTS (real-time strategy) games</p>
<p><img src="https://www.topbots.com/wp-content/uploads/2017/04/starcraft_fogofwar_700px.jpg" alt="Starcraft Fog Of War - TOPBOTS" class="image--center mx-auto" /></p>
<p>Common in these games, players have to regularly send out “scouts” or unit groups to explore and path find. I was doing <strong>this</strong> in my own way for my services. Consequently, we must accept that individuals/groups at enterprises (small and large) do this in <strong>their</strong> own way. We must ultimately be considerate of this.</p>
<p>Regardless of how this fog is being removed, we are at an interesting moment. In our RTS scenario, a constant need to explore can be mentally taxing and cost us units during the game. In observability, there is a compounding cognitive toll on querying and reading. Would there be a way to eliminate the “fog” from reappearing all together? Explore everything in parallel?</p>
<p><strong>*I want to highlight that a diverse group of units is a good strategy to combat opposing players in RTS, but that’s more akin to looking at this from a security perspective..</strong></p>
]]></content:encoded></item><item><title><![CDATA[Skillet: Our AI Agent SRE]]></title><description><![CDATA[Today, we’re excited to give a brief preview of Skillet, our upcoming AI Agent SRE product feature, and share our thoughts and some early findings from its development. We’ve been developing Skillet with collaboration from key design partners, and we...]]></description><link>https://blog.omlet.co/skillet-our-ai-agent-sre</link><guid isPermaLink="true">https://blog.omlet.co/skillet-our-ai-agent-sre</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[AI]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Mon, 17 Mar 2025 19:38:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1741659161808/9d1758c1-37da-4b66-84e3-d41685e1b3ca.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Today, we’re excited to give a brief preview of Skillet, our upcoming AI Agent SRE product feature, and share our thoughts and some early findings from its development. We’ve been developing Skillet with collaboration from key design partners, and we’d love to explore your ideas too.</strong></p>
<p><strong>Get in touch at info@omlet.co,</strong> <a target="_blank" href="https://www.omlet.co/get-started"><strong>our website contact form</strong></a><strong>, or leave a comment!</strong></p>
<h2 id="heading-approach">Approach</h2>
<p>One of the amazing benefits of the OpenTelemetry semantic conventions is the solid data platform it unlocks. With this great data foundation, we can creatively approach integrating AI and LLMs. As discussed in one of the <a target="_blank" href="https://blog.omlet.co/the-case-against-top-down-ai-in-observability">previous posts</a>, however, it is extremely tempting to look at Observability AI Agents from a “greedy” perspective. In that post, we mentioned that reliance on only “top-down” AI systems can miss crucial long-tail problems and disregard important information. Very quickly, this aproach devolves into a rules-based approach. At Omlet, we wanted to take a different, more straight-forward approach to incorporating a powerful, lightweight AI agent into the observability landscape.</p>
<h2 id="heading-observability-is-an-always-on-endeavor">Observability is an “Always-on” endeavor</h2>
<p>Observability practices, and the automation aligned to it, do not have the comfort of being active only <em>sometimes</em>. In order to be maximize effectiveness, observability is required to be <strong>always active</strong> and all encompassing. Practitioners are often forced into taking a “collect everything” mentality, which can result in massive increases in cost and noise. Fortunately, there’s a better way.</p>
<h2 id="heading-bottom-up-ai-in-the-pipeline-at-the-edge">Bottom-up: AI in the pipeline, at the “Edge”</h2>
<p>A core principle of Observability is the concept of “entities”. This could be reflected, oftentimes, as a “Service”, “Host”, “Pod”, etc. These cohorts unpack grouping methodologies that can be further supplemented by “Version”, “Release”, “Experiment”, etc. We can use these groupings as opportunities to examine issues in isolated groups. This covers a few giant pain points:</p>
<ol>
<li><p>Making sense of large volumes of data in a methodical way</p>
</li>
<li><p>Allowing “top-down” methodologies (AI or Human) to focus on larger impact of interacting parts, not why or how the parts are breaking. Humans spend too much time focusing on individual entities currently, resulting in high cognitive fatigue.</p>
</li>
</ol>
<h2 id="heading-efficient-always-on-ai">Efficient “Always-on” AI</h2>
<p>You’re probably wondering: wouldn’t event data structures (logs / spans), and their associated volume of tokens, lead to a cost/noise explosion within LLMs? We attacked this problem in three parts, specifically for LLM architectures:</p>
<ol>
<li><p><strong>Better Entity Based Grouping</strong></p>
</li>
<li><p><strong>Clustering and Masking</strong></p>
</li>
<li><p><strong>Selective extraction on Span Data Structures</strong></p>
</li>
</ol>
<p>What allowed these techniques in the first place? <strong>A plug-and-play data platform built on OTel.</strong> OTel semantic conventions are very well structured, which makes the data engineering and formatting task a breeze. We had to also consider carefully: Should we build yet another pane of glass for this type of view, <strong>or</strong>, could we benefit from the interoperability of OTel and bring the intelligence to the existing platform views? We chose the latter.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742240141928/7e8dfc50-fb72-4d61-9aab-18b7e8a61053.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-dynamic-incident-analysis-and-log-quality">Dynamic Incident Analysis and Log Quality</h2>
<p>Our early experiments showed exciting results across specific OpenAI (<code>gpt4o-mini</code>) calls. As we expanded the capabilities, we proceeded to add other models, like Claude, Llama, Mistral, and now 100s of others via “bring-you-own-LLM”. Some of the examples:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742005917811/930bb754-3096-4ae0-b5fa-ac75fd244552.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742006148124/b1e550dc-e76c-445f-b3c0-9a9ef3783275.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-whats-next">What’s Next?</h2>
<p>Unsurprisingly, observability use-cases aren’t limited to log or span incident analysis. Over the past months, we have been delighted to explore a diverse range of powerful use-cases presented by design partners working with us to develop Skillet:</p>
<ol>
<li><p><strong>Log Line Reduction Recommendation</strong></p>
</li>
<li><p><strong>Log Quality Analysis and Scoring</strong></p>
</li>
<li><p><strong>Complex Sensitive Data Analysis</strong></p>
</li>
<li><p><strong>Performance Optimization Opportunities</strong></p>
</li>
<li><p><strong>SQL Query Optimization</strong></p>
</li>
<li><p><strong>Key Points in Metrics Data</strong></p>
</li>
</ol>
<p>As an always-on AI agent, Skillet represents a massive leap forward in Omlet’s capacity to simplify and amplify your observability workflows. This is a feature still in active development, and we’re eager to continue exploring customer and partner-led ideas on how we can hone on in other use cases with this data (hint: security). Contact us at info@omlet.co and let’s get cooking!</p>
]]></content:encoded></item><item><title><![CDATA[Datadog Agents to OTel Gateway]]></title><description><![CDATA[Although I began using true observability data through APM with New Relic, it wasn't until around 2017 that I started heavily relying on the Datadog Agent for more in-depth system analysis. The agent is a robust, and very popular, way to collect syst...]]></description><link>https://blog.omlet.co/datadog-agents-to-otel-gateway</link><guid isPermaLink="true">https://blog.omlet.co/datadog-agents-to-otel-gateway</guid><category><![CDATA[Otel]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 18 Feb 2025 16:00:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1739673812493/2a048112-677d-4fb9-9347-44b25d68079a.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Although I began using true observability data through APM with New Relic, it wasn't until around 2017 that I started heavily relying on the Datadog Agent for more in-depth system analysis. The agent is a robust, and very popular, way to collect system / component telemetry, logs, and traces (along with other forms of APM like database query stats and profiling). It also has strong metadata linking capabilities from cloud providers. The OTel collector is quickly catching up in terms of data collection capabilities and may even surpass it in processing power. <strong>Can we combine them?</strong></p>
<h2 id="heading-the-power-of-open-source">The power of Open Source</h2>
<p>We're experiencing a kind of software "enlightenment" era with various Open Source movements like OTel, Apache Iceberg, Deepseek, and more. For us at Omlet, we believe that the <strong>“telemetry highway” will be OTel</strong>. The “on” and “off” ramps could be diverse, but the “highway”: <strong>OTel</strong></p>
<p>We want the Datadog Agent to transmit data smoothly, and we are excited to introduce the "Datadog-to-OTel" Service as a container that you can host yourself.</p>
<ul>
<li><p>Container: <code>psharma1989/datadog-to-otel</code></p>
</li>
<li><p>Associated Docs: <a target="_blank" href="https://omlet-1.gitbook.io/omlet-docs/datadog-agent-s-to-otel#omlet-datadog-to-otel-service">https://omlet-1.gitbook.io/omlet-docs/datadog-agent-s-to-otel#omlet-datadog-to-otel-service</a></p>
<ul>
<li>We use the Omlet Gateway as an example within the docs</li>
</ul>
</li>
</ul>
<p>This opens up various use cases that benefit from dual-shipping or proxying alongside your main Datadog platform usage.</p>
<h2 id="heading-use-cases">Use Cases</h2>
<h3 id="heading-high-compliance-scenarios">High-compliance scenarios</h3>
<p>Sensitive observability data requires teams to buffer or store data on-premises or in-VPC. Redaction or filtration rules can be applied after the fact.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739730598567/359ef1ab-e0ff-4b48-a37d-64c62dfe0795.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-edge-alerting-and-aiops">“Edge” Alerting and AIOPs</h3>
<p>OTel is quickly becoming the universal standard for Observability understanding. The Datadog Agent outputs a wealth of structured telemetry that can be augmented with AI Agents at the “Edge”. These agents could output back to the Datadog platform for alerting / orchestration</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739730922635/5df87401-3ae6-488c-a23d-f4525e9cd871.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-observability-pipelining-and-observability-data-lake">Observability Pipelining and Observability Data Lake</h3>
<p>You might want full data resolution with a specific retention period. You may also want to control the amount of noisy data that impacts MTTR and significantly increases costs. Long-term data analysis over weeks, months, or quarters can be done on complete observability data stored in object storage. Key observability data signals can be enhanced to reduce noise in the Datadog platform.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739731230223/4ea6c58a-c830-4798-ab32-4f1410d4c11e.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-what-next">What next?</h3>
<p>We’d love to hear from the community about what else you'd like to see. Visit us at <a target="_blank" href="https://www.omlet.co/">https://www.omlet.co/</a> or email us at info@omlet.co to share your thoughts!</p>
]]></content:encoded></item><item><title><![CDATA[“Good-enough-ness” in AI]]></title><description><![CDATA[Writing code has, for better or worse, taken a variable amount of my time pre-2023. Post-2023, you suddenly have the fatigue, cognitive load, and bouts of irritation start to evaporate.. It’s not a complete seamlessness by any means, but as you exit ...]]></description><link>https://blog.omlet.co/good-enough-ness-in-ai</link><guid isPermaLink="true">https://blog.omlet.co/good-enough-ness-in-ai</guid><category><![CDATA[AI]]></category><category><![CDATA[coding]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Fri, 31 Jan 2025 01:47:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738173330017/e82c6d52-6434-4852-b193-ae3e585fa9df.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Writing code has, for better or worse, taken a variable amount of my time pre-2023. Post-2023, you suddenly have the fatigue, cognitive load, and bouts of irritation start to evaporate.. It’s not a complete seamlessness by any means, but as you exit a work session with remarkable new tools, you must philosophically consider <strong>what has just happened and why?</strong> Also, it’s important, especially for a certain class of work and those that are curious about it: <strong>why for programming?.</strong></p>
<blockquote>
<p>The questions above really help illustrate the gulf that is developing between the various users of LLM tools. For some, its sparsely used and a toy, for programmers, it’s like discovering the mental combustion engine.</p>
</blockquote>
<h2 id="heading-knowledge-and-verification-bubbles">Knowledge and Verification Bubbles</h2>
<p>It certainly helps to examine different knowledge spheres in terms of scope and subjectivity. This is where “programming” as a sphere has various advantages:</p>
<ol>
<li><p><strong>Brevity</strong>: Programming Languages have the distinct advantage of having very few terms (keywords) and those keywords fixed in their function.</p>
</li>
<li><p><strong>Reasonably Verifiable:</strong> Execution of the program results in a healthy population of “correct” solutions. Compare this to more subjective (and equally, or more important to humans) knowledge, like literature.</p>
</li>
<li><p><strong>Modularity in problem and solution:</strong> The above has indirectly resulted in QA destinations (like Stack Overflow) that provide excellent training signals.</p>
</li>
<li><p><strong>Logical Structure:</strong> Program structure is fairly logical and dependencies (even in complex situations) can be reasonably discerned</p>
</li>
</ol>
<h2 id="heading-compressing-the-feedback-loop">Compressing the feedback loop</h2>
<p>Here we arrive to, what I think due to personal experience…the <strong>true philosophical moment.</strong> Recall the programming experience, lets say in the 2010s, for small tasks within a large task (“build an app, dashboard, components, API, etc”)</p>
<ol>
<li><p><strong>Decide on paradigm:</strong> Programming Language opinions, frameworks, documentation maturity, etc</p>
</li>
<li><p><strong>Execute feedback loop:</strong></p>
<ol>
<li><p><strong>Write code</strong></p>
</li>
<li><p><strong>Run</strong>: Verify correctness</p>
</li>
<li><p><strong>Debug</strong>: Read stack-traces, logs, etc</p>
</li>
<li><p><strong>Search</strong>: Documentation, Google, etc</p>
</li>
<li><p><strong>Repeat</strong></p>
</li>
</ol>
</li>
<li><p><strong>Refactor</strong>: clean-up with architectural best practices</p>
</li>
</ol>
<p>This, of course, is a slim example of the general feedback loop. The steps can vary, but the general loop is fairly consistent. Here is where the boundlessness, fuzziness, and variability contribute so heavily. At each of these steps, the cost of finding information, integrating, and verifying can be high. Furthermore, the <strong>cost of the wrong answer (or “path”),</strong> can be extremely high as this cascades into prior and subsequent efforts. To help visualize:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738171588406/306c2b3f-17ee-4a68-8ce0-14d57d4e028c.png" alt class="image--center mx-auto" /></p>
<p>You can imagine the vertical axis as being possible start paths, and the horizontal axis as distance in time and effort to the next task. The issue mainly lies in choosing and evaluating various paths, one that could be ultimately more time consuming or erroneous. What’s changed:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738171598686/c25aee74-84d5-4351-8d65-9826af31105a.png" alt class="image--center mx-auto" /></p>
<p>Suddenly, because of LLMs in programming, there is more of a “straight-line” effect between choices and tasks. <strong>It cannot be overstated how impactful this is.</strong> It compresses the exhausting feedback loop greatly. Although this may seem fully apparent and taken for granted, we must appreciate the significance of this for software engineering.</p>
<h2 id="heading-pathfinding-from-learning-to-search">Pathfinding: From Learning to Search</h2>
<p>We can think of this common feedback loop as a pathfinding problem. What’s interesting is that these “ideal” paths for all sorts of practical programming functions, due to human and synthetic data, are being quickly discovered. Higher level tasks such as “building an app, but with these variations” is starting to transition from a time-consuming learning problem, to a cached search problem. The implications of this are enormous, but why did this happen?</p>
<h2 id="heading-good-enough-ness">“Good-enough-ness”</h2>
<p>People unfamiliar or uninterested in software engineering tasks sometimes consider this type of work to be esoteric and/or arduous. Software engineers can also have a habit of touting the willpower and intellect needed to do what they do. <strong>But in reality, the work isn’t that complicated.</strong> Furthermore, things are getting quite <strong>“good enough”.</strong></p>
<p>Consider examples like SaaS applications with user management, dashboards, notifications, etc. At a lower-level, lets also consider their interface components and CRUD APIs. Might seem simplistic, but this represents the vast majority of generic SaaS applications that have verticalized in different industries. Even the smallest models can attack this problem set in components, with new architectures attacking this problem from high-level specification to components. Sure there are variations and opinions that humans need to inject, but the integration time has become fairly constant. The delta between a large team building a sophisticated app, vs one with new tools, is shrinking significantly. SaaS, unfortunately, also relies on the belief that customers are too dumb and lazy to do it themselves. <strong>This is changing fast.</strong></p>
<p>This is a <strong>massive</strong> industry, and if you add lower-level cloud architecture tools, its even more massive. At the end, only the physical substrate access (compute, storage) really matters. This presents a difficulty for incumbents, where now, customers look for extreme differentiation, or just proclaim: <strong><em>“good-enough”</em></strong></p>
]]></content:encoded></item><item><title><![CDATA[The Case Against "Top-Down" AI in Observability]]></title><description><![CDATA[LLMs are having an ostensibly large effect on SaaS verticals since the proliferation of agentic AI frameworks and the fruition of proper tool use. There are, finally, approaches and startups attacking the problem in the observability space. However, ...]]></description><link>https://blog.omlet.co/the-case-against-top-down-ai-in-observability</link><guid isPermaLink="true">https://blog.omlet.co/the-case-against-top-down-ai-in-observability</guid><category><![CDATA[observability]]></category><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><category><![CDATA[agentic AI]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Sat, 25 Jan 2025 20:22:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737776422833/a914af7f-5a50-4af6-9193-8abf2f2356eb.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>LLMs are having an ostensibly large effect on SaaS verticals since the proliferation of agentic AI frameworks and the fruition of proper tool use. There are, finally, approaches and startups attacking the problem in the observability space. However, I would argue that this paradigm is unfortunately stuck in a mindset that is:</p>
<ul>
<li><p>Obsessed, almost reliant, on network and entity complexity</p>
</li>
<li><p>Addicted to promoting big-data digestion</p>
</li>
<li><p>Suffers from, or completely disregards, context gaps. Or worse, assume you have the all right data.</p>
</li>
</ul>
<p>Engineers, and observability practitioners, ideally like to manage and forecast for complexity. It’s for this reason we are also seeing services and tools that are combating code complexity:</p>
<ul>
<li><p>Optimization for runtime execution via profiling</p>
</li>
<li><p>Code cleanup for brevity/developer experience</p>
</li>
<li><p>Dependency and configuration cleanup</p>
</li>
</ul>
<p>I always like to take a gentle, systems theory approach to these things. We can see a “shift-left” moment happening in programming to make the system and its functioning in the environment more predictable and less “reactive” to changes and drift. Let’s consider a variable <strong><em>C</em></strong> that represents complexity and <strong><em>N</em></strong> that can be considered independently but influences <strong><em>C</em></strong>. The purpose of this exercise is to examine how these variables can increase in magnitude, greatly influence compute time factors, and subsequently increase the probability of false negatives and positives.</p>
<p>Factors influencing <strong><em>C:</em></strong></p>
<ol>
<li><p><strong>Quantity</strong> of entities (micro-services, hosts, etc). Increases the “cohorts” potentially needed to be analyzed</p>
</li>
<li><p><strong>Variety</strong> of <strong>request types</strong>. Could be simple HTTP requests, or long-running jobs.</p>
</li>
<li><p><strong>Variety</strong> of <strong>request attributes</strong>. Metadata decoration.</p>
</li>
<li><p>The <strong>interaction</strong> between entities.</p>
</li>
</ol>
<p>Factors influencing <strong><em>N:</em></strong></p>
<ol>
<li><p><strong>Volume</strong> of requests</p>
</li>
<li><p><strong>Repetition</strong> rate of requests</p>
</li>
</ol>
<p>What one can imagine is a large network visualization with interacting nodes modeled by weighted edges. This has become a <strong>hallmark</strong> of Observability marketing propaganda…</p>
<p>I believe that it's necessary to accurately model the system by projecting it as this data structure. However, much of the Observability industry has clung to it as the "finished" product, even when it becomes unintelligible ("network hairball"). Others have tried to add some layer of meaning but feel: "if you don't show it in all its glory, you are missing out." I believe the days of using UI/UX to parse this are over, mostly due to the phase shift we are seeing because of LLMs.</p>
<p>Here lies another problem, however. As <strong><em>C</em></strong> <em>and</em> <strong><em>N</em></strong> increase, so will the resources required to distill. The approach I’m seeing is very top-down, similar to the human SRE approach, which suffers from the same fixation on the holy network visualization. What these systems are doing on this graph structure is running a selective/greedy or full <strong>depth-first search</strong> for the sake of it. Furthermore, there lies the critical <strong>instantiation problem</strong>. When should this search trigger? A typical metrics alert? An event?</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737835688153/f39255e3-d18d-493e-89fd-65941baca904.png" alt class="image--center mx-auto" /></p>
<p>The above image helps illustrate how boundless this agentic search can become. Worse yet, there is a high chance of missing context (unknown unknowns). The problem with observability isn’t only disjointed data; it's not having the data, as illustrated below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737835863702/7b932ea5-700c-49ff-8bf2-bd9ce496b55c.png" alt class="image--center mx-auto" /></p>
<p>In this case, a top-down agentic system might get “greedy” and miss edge cases or catastrophically disregard them. It’s worth noting, also, these types of agentic systems are <strong>highly incentivized</strong> to crave this complexity (“we correlate millions of signals, and charge per run”). This will quickly run opposite to what observability practitioners will want to budget for (efficiency).</p>
<p>This post highlights why we must critically examine the new breed of top-down agents, even if they spur intellectual curiosity. In subsequent posts, we will examine alternative approaches.</p>
]]></content:encoded></item><item><title><![CDATA[Build an Observability Data Lake with OTeL]]></title><description><![CDATA[For the past few years, the data lake ecosystem has grown extensively. Various storage backends, streaming platforms, processing frameworks, etc. I’ve always wondered, while working with data lakes, lake houses, and the traditional data warehouse, fi...]]></description><link>https://blog.omlet.co/build-an-observability-data-lake-with-otel</link><guid isPermaLink="true">https://blog.omlet.co/build-an-observability-data-lake-with-otel</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[Data-lake]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Wed, 30 Oct 2024 01:21:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730246191970/bc6228ad-ade1-4a78-af1b-1bbd13d2aae0.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For the past few years, the data lake ecosystem has grown extensively. Various storage backends, streaming platforms, processing frameworks, etc. I’ve always wondered, while working with data lakes, lake houses, and the traditional data warehouse, filled with customer activity scenarios: <strong>where is the observability data?</strong></p>
<p>Various reasons:</p>
<ol>
<li><p>No apparent use case for broad spectrum data, long-term storage, or both</p>
</li>
<li><p>Inability or hesitation to take different data structures and normalize (via ETL, etc)</p>
</li>
<li><p><strong>Standardization</strong></p>
</li>
</ol>
<p>There, finally, may be a way to build a so called Observability Data Lake, with OpenTelemetry schemas being the avenue. The data models specified by OTeL are extremely conducive to sustainable broad and long-term storage and unlock exciting capabilities:</p>
<ol>
<li><p>Organization wide visibility into incidents, incident impacts, resource usage, and cost forecasts</p>
</li>
<li><p>Using observability data for user analytics and making it “joinable” with other analytics (advertising performance, partner data, etc)</p>
</li>
<li><p>Large datasets for machine learning</p>
</li>
</ol>
<p>We seek to enable this through OTeL standard pipelining into object storage as a primary destination. This strikes a balance between the real-time and nuanced views that current observability vendors critically provide, and the full-resolution observability datasets that can be used for strategic purposes.</p>
]]></content:encoded></item><item><title><![CDATA[C++ Instrumentation with OTeL]]></title><description><![CDATA[Often times, when practitioners think of generating spans/traces, we think of languages like Python, Javascript, Java, etc. Even GoLang has become indexed in the minds of the observability community as a defacto language. However, C++ is still a main...]]></description><link>https://blog.omlet.co/c-instrumentation-with-otel</link><guid isPermaLink="true">https://blog.omlet.co/c-instrumentation-with-otel</guid><category><![CDATA[Otel]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[C++]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 15 Oct 2024 02:02:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1728952526468/8cee8e9a-2a44-4719-9462-ca54552076d9.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Often times, when practitioners think of generating spans/traces, we think of languages like Python, Javascript, Java, etc. Even GoLang has become indexed in the minds of the observability community as a defacto language. However, C++ is still a mainstay in many systems, and according to the <a target="_blank" href="https://www.tiobe.com/tiobe-index/">TIOBE Index</a> , is #2 (ranked behind Python).</p>
<p>Luckily, OTeL has a robust <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-cpp">C++ SDK</a>. An opinionated way to instrument is to be minimalistic on library components and rely on a host collector to access the wealth of processors and exporters. With that in mind, one can simply use the OTLP GRPC exporter to send to a local collector, creating a decoupled architecture with minimal potential impact.</p>
<p>A particularly valuable use case for the C++ OTeL SDK is tracing, judiciously, intensive computing jobs. We can treat spans as “structured logs” in this case and methodically add them in sections. Here is an example <em>“epidemic simulation”</em> :</p>
<pre><code class="lang-cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;iostream&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;vector&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;random&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;memory&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;thread&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;chrono&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;algorithm&gt;  // for std::count</span></span>

<span class="hljs-comment">// OpenTelemetry headers</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"opentelemetry/exporters/otlp/otlp_grpc_exporter_factory.h"</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"opentelemetry/sdk/trace/simple_processor_factory.h"</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"opentelemetry/sdk/trace/tracer_provider_factory.h"</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"opentelemetry/trace/provider.h"</span></span>

<span class="hljs-keyword">namespace</span> trace     = opentelemetry::trace;
<span class="hljs-keyword">namespace</span> trace_sdk = opentelemetry::sdk::trace;
<span class="hljs-keyword">namespace</span> otlp      = opentelemetry::exporter::otlp;

<span class="hljs-keyword">namespace</span>
{
opentelemetry::exporter::otlp::OtlpGrpcExporterOptions opts;
<span class="hljs-built_in">std</span>::<span class="hljs-built_in">shared_ptr</span>&lt;opentelemetry::sdk::trace::TracerProvider&gt; provider;

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">InitTracer</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-comment">// Create OTLP exporter instance</span>
    <span class="hljs-keyword">auto</span> exporter  = otlp::OtlpGrpcExporterFactory::Create(opts);
    <span class="hljs-keyword">auto</span> processor = trace_sdk::SimpleSpanProcessorFactory::Create(<span class="hljs-built_in">std</span>::move(exporter));
    provider       = trace_sdk::TracerProviderFactory::Create(<span class="hljs-built_in">std</span>::move(processor));

    <span class="hljs-comment">// Set the global trace provider</span>
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">shared_ptr</span>&lt;opentelemetry::trace::TracerProvider&gt; api_provider = provider;
    trace::Provider::SetTracerProvider(api_provider);
}

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">CleanupTracer</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-comment">// We call ForceFlush to prevent canceling running exports, but it's optional.</span>
    <span class="hljs-keyword">if</span> (provider)
    {
        provider-&gt;ForceFlush();
    }

    provider.reset();
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">shared_ptr</span>&lt;opentelemetry::trace::TracerProvider&gt; none;
    trace::Provider::SetTracerProvider(none);
}
}  <span class="hljs-comment">// namespace</span>

<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">main</span><span class="hljs-params">()</span>
</span>{
    opts.endpoint = <span class="hljs-string">"localhost:4317"</span>;  <span class="hljs-comment">// Default OTLP gRPC endpoint</span>
    opts.use_ssl_credentials = <span class="hljs-literal">false</span>;  <span class="hljs-comment">// Disable SSL/TLS for local communication</span>

    <span class="hljs-comment">// Initialize the tracer</span>
    InitTracer();

    <span class="hljs-keyword">auto</span> tracer = trace::Provider::GetTracerProvider()-&gt;GetTracer(<span class="hljs-string">"epidemic_simulation"</span>);

    <span class="hljs-comment">// Start the main simulation span</span>
    <span class="hljs-keyword">auto</span> simulation_span = tracer-&gt;StartSpan(<span class="hljs-string">"Simulation"</span>);
    <span class="hljs-keyword">auto</span> scoped_simulation = trace::Scope(simulation_span);

    <span class="hljs-comment">// Simulation parameters</span>
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> population_size = <span class="hljs-number">1000</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> initial_infected = <span class="hljs-number">10</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">double</span> infection_rate = <span class="hljs-number">0.05</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">double</span> recovery_rate = <span class="hljs-number">0.01</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">double</span> mortality_rate = <span class="hljs-number">0.005</span>;

    <span class="hljs-comment">// States: false = susceptible, true = infected, 2 = recovered, 3 = dead</span>
    <span class="hljs-function"><span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-title">population</span><span class="hljs-params">(population_size, <span class="hljs-number">0</span>)</span></span>;
    <span class="hljs-built_in">std</span>::fill(population.begin(), population.begin() + initial_infected, <span class="hljs-number">1</span>);

    <span class="hljs-built_in">std</span>::default_random_engine generator;
    <span class="hljs-function"><span class="hljs-built_in">std</span>::uniform_real_distribution&lt;<span class="hljs-keyword">double</span>&gt; <span class="hljs-title">infection_dist</span><span class="hljs-params">(<span class="hljs-number">0.0</span>, <span class="hljs-number">1.0</span>)</span></span>;
    <span class="hljs-function"><span class="hljs-built_in">std</span>::uniform_real_distribution&lt;<span class="hljs-keyword">double</span>&gt; <span class="hljs-title">recovery_dist</span><span class="hljs-params">(<span class="hljs-number">0.0</span>, <span class="hljs-number">1.0</span>)</span></span>;
    <span class="hljs-function"><span class="hljs-built_in">std</span>::uniform_real_distribution&lt;<span class="hljs-keyword">double</span>&gt; <span class="hljs-title">mortality_dist</span><span class="hljs-params">(<span class="hljs-number">0.0</span>, <span class="hljs-number">1.0</span>)</span></span>;

    <span class="hljs-keyword">int</span> day = <span class="hljs-number">1</span>;
    <span class="hljs-keyword">while</span> (<span class="hljs-literal">true</span>)  <span class="hljs-comment">// Infinite loop, change condition to stop based on your needs</span>
    {
        <span class="hljs-keyword">auto</span> day_span = tracer-&gt;StartSpan(<span class="hljs-string">"Day "</span> + <span class="hljs-built_in">std</span>::to_string(day));
        <span class="hljs-keyword">auto</span> scoped_day = trace::Scope(day_span);

        <span class="hljs-keyword">int</span> new_infections = <span class="hljs-number">0</span>;
        <span class="hljs-keyword">int</span> recoveries = <span class="hljs-number">0</span>;
        <span class="hljs-keyword">int</span> deaths = <span class="hljs-number">0</span>;

        <span class="hljs-comment">// Copy of the population to avoid modifying while iterating</span>
        <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; new_population = population;

        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; population_size; ++i)
        {
            <span class="hljs-keyword">if</span> (population[i] == <span class="hljs-number">1</span>)  <span class="hljs-comment">// Infected</span>
            {
                <span class="hljs-comment">// Chance to recover or die</span>
                <span class="hljs-keyword">if</span> (recovery_dist(generator) &lt; recovery_rate)
                {
                    new_population[i] = <span class="hljs-number">2</span>;  <span class="hljs-comment">// Recovered</span>
                    recoveries++;
                }
                <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (mortality_dist(generator) &lt; mortality_rate)
                {
                    new_population[i] = <span class="hljs-number">3</span>;  <span class="hljs-comment">// Dead</span>
                    deaths++;
                }
                <span class="hljs-keyword">else</span>
                {
                    <span class="hljs-comment">// Try to infect others</span>
                    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; population_size; ++j)
                    {
                        <span class="hljs-keyword">if</span> (population[j] == <span class="hljs-number">0</span> &amp;&amp; infection_dist(generator) &lt; infection_rate)
                        {
                            new_population[j] = <span class="hljs-number">1</span>;  <span class="hljs-comment">// Newly infected</span>
                            new_infections++;
                        }
                    }
                }
            }
        }

        population = new_population;

        <span class="hljs-keyword">int</span> total_infected = <span class="hljs-built_in">std</span>::count(population.begin(), population.end(), <span class="hljs-number">1</span>);
        <span class="hljs-keyword">int</span> total_recovered = <span class="hljs-built_in">std</span>::count(population.begin(), population.end(), <span class="hljs-number">2</span>);
        <span class="hljs-keyword">int</span> total_dead = <span class="hljs-built_in">std</span>::count(population.begin(), population.end(), <span class="hljs-number">3</span>);

        <span class="hljs-comment">// Log the day's results</span>
        <span class="hljs-built_in">std</span>::<span class="hljs-built_in">cout</span> &lt;&lt; <span class="hljs-string">"Day "</span> &lt;&lt; day &lt;&lt; <span class="hljs-string">": "</span> &lt;&lt; new_infections &lt;&lt; <span class="hljs-string">" new infections, "</span> 
                  &lt;&lt; recoveries &lt;&lt; <span class="hljs-string">" recoveries, "</span> &lt;&lt; deaths &lt;&lt; <span class="hljs-string">" deaths.\n"</span>;

        <span class="hljs-comment">// Add events and attributes to the span</span>
        day_span-&gt;AddEvent(<span class="hljs-string">"Day summary"</span>, {
            {<span class="hljs-string">"new_infections"</span>, new_infections},
            {<span class="hljs-string">"recoveries"</span>, recoveries},
            {<span class="hljs-string">"deaths"</span>, deaths},
        });
        day_span-&gt;SetAttribute(<span class="hljs-string">"day_number"</span>, day);
        day_span-&gt;SetAttribute(<span class="hljs-string">"total_infected"</span>, total_infected);
        day_span-&gt;SetAttribute(<span class="hljs-string">"total_recovered"</span>, total_recovered);
        day_span-&gt;SetAttribute(<span class="hljs-string">"total_dead"</span>, total_dead);

        day_span-&gt;End();
        day++;

        <span class="hljs-comment">// Sleep to slow down the simulation</span>
        <span class="hljs-built_in">std</span>::this_thread::sleep_for(<span class="hljs-built_in">std</span>::chrono::seconds(<span class="hljs-number">1</span>));  <span class="hljs-comment">// Sleep for 1 second between days</span>
    }

    simulation_span-&gt;End();

    <span class="hljs-comment">// Clean up and flush tracer</span>
    CleanupTracer();

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>Key points to note:</p>
<ol>
<li><p><code>InitTracer</code> : where we create our tracer instance and pass options (such as OTLP endpoint)</p>
</li>
<li><p><code>tracer-&gt;StartSpan</code> declarations to create spans</p>
</li>
<li><p><code>AddEvent</code> and <code>SetAttribute</code> to add crucial metadata to the spans. This can be valuable to store counts, facets for later processing in storage.</p>
</li>
<li><p>Ending and Cleaning up to flush</p>
</li>
</ol>
<p>Leveraging spans as a rich data structure provides a versatile way to instrument C++ applications and get a detailed inspection of behavior and performance.</p>
]]></content:encoded></item><item><title><![CDATA[Obstacles to MTTR]]></title><description><![CDATA[MTTR (Mean Time to Resolution) is a crucial benchmark/metric in Observability, and one that is the most dynamic. It’s also, if we were observing the observer, the one that is related to the most suffering. There is an abundance of tooling that facili...]]></description><link>https://blog.omlet.co/obstacles-to-mttr</link><guid isPermaLink="true">https://blog.omlet.co/obstacles-to-mttr</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Fri, 20 Sep 2024 14:53:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1726757331666/c88828ab-9aa0-4f96-b1d7-3951842b2523.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>MTTR (Mean Time to Resolution) is a crucial benchmark/metric in Observability, and one that is the most dynamic. It’s also, if we were observing the observer, the one that is related to the most suffering. There is an abundance of tooling that facilitates ways to gather data and detect. Current tooling, however, seems to be focusing largely on <strong><em>engagement</em></strong> within a platform, rather than <strong><em>swiftness</em></strong> through the platform. Although at varying degrees, this can be largely aggravated due to a few key areas:</p>
<h2 id="heading-time-to-datarepeated-time-to-data">Time to Data/Repeated Time to Data</h2>
<p>In frantic situations, the last thing observability users need is slow queries and overall “slow time to data”. Here we define “Data” as crucial information that assists in investigation. This can take various forms:</p>
<ol>
<li><p><strong>Slow Queries from Observability Backend:</strong> Metrics/logs/trace/etc queries are resulting in timeouts, errors, overload from concurrency, etc. This can cause a repeating pattern of further querying, frustration, loss of context</p>
</li>
<li><p><strong>Obstacles to “Data”:</strong> Sometimes, due to product opinions, “Data” can be occluded intentionally to steer the user and force engagement. This is also immensely frustrating.</p>
</li>
<li><p><strong>Malformed “Data”:</strong> Often, the expectation of how the data should be displayed deviates greatly, also leading to enormous context loss.</p>
</li>
</ol>
<p>Investigations usually show compounding patterns, and all of these can add up incrementally.</p>
<h2 id="heading-data-structure-disjointedness">Data Structure Disjointedness</h2>
<p>Inconsistency in data structures is another avenue of enormous frustration. Consider log formatting:</p>
<ol>
<li><p><strong>Deeply Nested “JSONic” Logs</strong></p>
</li>
<li><p><strong>System Components Logs (Apache, NGINX, etc)</strong></p>
</li>
<li><p><strong>Free-form Application Logs</strong></p>
</li>
<li><p><strong>Stack Traces</strong></p>
</li>
</ol>
<p>Furthermore, if observability initiatives embrace tracing, detachment of spans due to missing trace ids (missing headers) can lead to gaps in observability for micro-service environments.</p>
<p>All of this results in “Data” that is less “joinable” and not pointing to causative factors for an incident.</p>
<h2 id="heading-cognitive-load-of-noise-and-context-shifting">Cognitive Load of Noise and Context Shifting</h2>
<p>As humans, cognitive and emotional strain starts to pile up during active incidents and system failure. Little does tooling consider this during design. Noise, in the form of extraneous “Data”, too much “Data” doesn’t translate well to the unfortunately narrowing perception budget of the observability practitioner. This is further aggravated by UIs that are of an entirely different design philosophy and must be paired together during incident analysis.</p>
<h2 id="heading-minimal-engagementcommunication-loss">Minimal Engagement/Communication Loss</h2>
<p>Finally, gaps in communication or simply, observability not being utilized, is a significant problem. Very often, visualization is configured without considering that human operators won’t constantly be observing. Unfortunately, alerting configuration is non-intuitively designed vs its graphing counterpart. Furthermore, unreliable or ignored communication channels culminate into an ultimate panic event.</p>
<h2 id="heading-sre-experience-metric">SRE-experience metric</h2>
<p>When examining these frequent obstacles to incident resolution, tooling must start to consider metrics for emotional fatigue and general organizational chaos. Only then can we accurately have tooling comprehend what observability users have to endure during these types of events.</p>
]]></content:encoded></item><item><title><![CDATA[Span Collection with OTeL]]></title><description><![CDATA[Span (and Trace) collection in OTeL is standardized through tracing SDKs/Libraries that can directly send via the OTLP protocol to a collector. Traces can strongly supplement generic logging with its structure and linked information that can be repre...]]></description><link>https://blog.omlet.co/span-collection-with-otel</link><guid isPermaLink="true">https://blog.omlet.co/span-collection-with-otel</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Mon, 09 Sep 2024 13:52:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1725848259153/eb9464ca-5f2f-46eb-a816-3172d18e3401.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Span (and Trace) collection in OTeL is standardized through tracing SDKs/Libraries that can directly send via the <code>OTLP</code> protocol to a collector. Traces can strongly supplement generic logging with its structure and linked information that can be represented as directed acyclic graph. This can not only diagnose issues due to dependencies, it can also give crucial information on component timing. The hallmark of tracing libraries is <strong>auto-instrumentation.</strong></p>
<h2 id="heading-trace-data-model">Trace Data Model</h2>
<p>Just like Logs and Metrics, the OTeL Spec defines a Data Model (for Spans). Let's look at a few key fields:</p>
<ul>
<li><p><strong>Time</strong></p>
<ul>
<li><code>start_time</code> and <code>end_time</code>: together these can calculate <code>duration</code></li>
</ul>
</li>
<li><p><strong>Context</strong></p>
<ul>
<li><code>trace_id</code> and <code>span_id</code>: unique identifiers that help link spans together as part of a trace</li>
</ul>
</li>
<li><p><strong>Kind</strong></p>
<ul>
<li>Can be one either <code>Client</code>, <code>Server</code>, <code>Internal</code>, <code>Producer</code>, or <code>Consumer</code>: provides clues to how spans should be linked. Represents outgoing calls, incoming calls, or calls that are nested internally.</li>
</ul>
</li>
<li><p><strong>Status</strong></p>
<ul>
<li><code>status_code</code> and <code>status_message</code>: provides information on wether the span represents an error or without issue ("ok").</li>
</ul>
</li>
<li><p><strong>Attribution</strong></p>
<ul>
<li><code>resource</code> and <code>attributes</code>: just like other telemetry, resource attribution can represent metadata about the runtime environment (host, pod, etc). Attributes can help add metadata context to spans, like User ID.</li>
</ul>
</li>
<li><p><strong>Links</strong></p>
<ul>
<li>"Links" can help bridge different traces. Especially useful for asynchronous and distributed operations.</li>
</ul>
</li>
<li><p><strong>Events</strong></p>
<ul>
<li>Additional data that can represent a singular event of significance.</li>
</ul>
</li>
</ul>
<h2 id="heading-instrumentation">Instrumentation</h2>
<h3 id="heading-zero-code-instrumentation">Zero-Code Instrumentation</h3>
<p>Zero-Code instrumentation libraries provide an effective way for practitioners to gather trace telemetry without code changes. Here is an example with Node.Js:</p>
<pre><code class="lang-shell">env OTEL_TRACES_EXPORTER=otlp OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=your-endpoint \
node --require @opentelemetry/auto-instrumentations-node/register app.js
</code></pre>
<p>Note how <code>OTEL_EXPORTER_OTLP_TRACES_ENDPOINT</code> points to your collector. There are various environment variables one can set, examples as outlined <a target="_blank" href="https://opentelemetry.io/docs/specs/otel/protocol/exporter/">here</a>.</p>
<p>Additionally, one can verify supported libraries (Node.Js <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/metapackages/auto-instrumentations-node/README.md#supported-instrumentations">example</a>).</p>
<h3 id="heading-output">Output</h3>
<p>Once this is configured, you can start receiving traces, as seen below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725851076164/655038bf-26f4-4e6b-95d7-0ced3c27d65d.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Log File Collection with OTeL]]></title><description><![CDATA[The filelogreceiver is a powerful component within the OTeL Collector. It is a full-fledged receiver that can tail files, and also process them on the fly. Before diving into the receiver configuration specifics and functionality, it is crucial to un...]]></description><link>https://blog.omlet.co/log-file-collection-with-otel</link><guid isPermaLink="true">https://blog.omlet.co/log-file-collection-with-otel</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[logging]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 27 Aug 2024 19:57:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724769676184/5faa0b57-e914-4578-a265-bb8e8c1c6a0b.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The <code>filelogreceiver</code> is a powerful component within the OTeL Collector. It is a full-fledged receiver that can tail files, and also process them on the fly. Before diving into the receiver configuration specifics and functionality, it is crucial to understand the logs data model.</p>
<h2 id="heading-logs-data-model">Logs Data Model</h2>
<p>There are various fields within the data model specification. Let's cover them categorically:</p>
<ul>
<li><p><strong>Time</strong></p>
<ul>
<li><p><code>Timestamp</code>: The timestamp from the source (for example, when a log was generated by an SDK. <strong>Preferred</strong>)</p>
</li>
<li><p><code>ObservedTimestamp</code>: The timestamp from the collection system (for example, when the collector receives the log message)</p>
</li>
</ul>
</li>
<li><p><strong>Trace</strong></p>
<ul>
<li><code>TraceId</code>, <code>SpanId</code>, <code>TraceFlags</code>: For correlation to traces and spans. TraceFlags provides recommendations such as 'sampling' level, etc</li>
</ul>
</li>
<li><p><strong>Severity</strong></p>
<ul>
<li><code>SeverityText</code>, <code>SeverityNumber</code>: "log level". Optional fields but OTeL provides numerical representations for granular log levels (TRACE, DEBUG, etc)</li>
</ul>
</li>
<li><p><strong>Body</strong></p>
<ul>
<li><code>Body</code>: the log message body.</li>
</ul>
</li>
<li><p><strong>Attribution</strong></p>
<ul>
<li><p><code>Resource</code>: Key / Value pairs that define and scope where messages originate from (for example: pods, instances, services, etc)</p>
</li>
<li><p><code>Attributes</code>: Key / Value pairs that give additional information about the message (for example: http method, status, etc)</p>
</li>
<li><p><code>InstrumentationScope</code>: Provides information on emitting libraries (for example: OTeL SDK library and version)</p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-file-log-receiver">File Log Receiver</h2>
<p>This receiver has a multitude of configuration options. Again, let's explore categorically and highlight important ones:</p>
<ul>
<li><p><strong>File Match Patterns</strong></p>
<ul>
<li><code>include</code>, <code>exclude</code>: glob patterns that define files to be read, and specific ones to be excluded (for example: <code>/var/log/folder/*.json</code> )</li>
</ul>
</li>
<li><p><strong>Processing</strong></p>
<ul>
<li><p><code>multiline</code>: Specifies a match pattern that can be used to specify new records. Incredibly useful for Error / Stack Trace type logs.</p>
</li>
<li><p><code>storage</code>: Critical to offset tracking (file storage). Points to an "extension" in the OTeL spec.</p>
</li>
<li><p><code>operators</code>: There are various "operators" that can take one of the fields (like the message body) and parse it (for example, parse json, time, uri, etc). These can be chained together via the <code>output</code> option. You can use <code>if</code> conditionals and also embed processors</p>
</li>
</ul>
</li>
<li><p><strong>Performance</strong></p>
<ul>
<li><p><code>poll_interval</code>: Duration between File System polls</p>
</li>
<li><p><code>max_log_size</code>: Max size of log entry to read to protect from large memory usage</p>
</li>
</ul>
</li>
<li><p><strong>Metadata Decoration</strong></p>
<ul>
<li><p><code>include_file_*</code>: Various config parameters (boolean) like <code>include_file_name</code>, <code>include_file_path</code>, etc that can be added as useful attributes</p>
</li>
<li><p><code>attributes</code> / <code>resource</code>: Directly add key / value pairs for <strong>all</strong> log messages from a file referenced in a receiver.</p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-offset-tracking-for-reliability">Offset Tracking for Reliability</h2>
<p>Offset tracking is crucial to ensure log collection is reliable and accurate if the collector(s) restart. Using the <code>file_storage</code> extension, one can specify a directory for storage.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">receivers:</span>
  <span class="hljs-attr">filelog:</span>
    <span class="hljs-attr">include:</span> [<span class="hljs-string">/var/log/service/sample.log</span>]
    <span class="hljs-attr">storage:</span> <span class="hljs-string">file_storage/filelogreceiver</span>
<span class="hljs-attr">extensions:</span>
  <span class="hljs-attr">file_storage/filelogreceiver:</span>
    <span class="hljs-attr">directory:</span> <span class="hljs-string">/var/lib/otelcol/mydir</span>
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Process Monitoring with OTeL]]></title><description><![CDATA[In the last few posts, we have seen how we can utilize the OTeL collector to collect system and K8s metrics, along with metadata. This base configuration state can provide simple health metrics on diverse environment types. From here, the amount of u...]]></description><link>https://blog.omlet.co/process-monitoring-with-otel</link><guid isPermaLink="true">https://blog.omlet.co/process-monitoring-with-otel</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 06 Aug 2024 14:03:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1722900727881/d9e0d872-b555-4f9a-ac26-61fd3b3f561e.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the last few posts, we have seen how we can utilize the OTeL collector to collect system and K8s metrics, along with metadata. This base configuration state can provide simple health metrics on diverse environment types. From here, the amount of unique metrics practitioners can gather becomes vast. There are various receivers, the prometheus ecosystem, StatsD, custom OTLP, etc. An overlooked category of metrics is often of <strong>processes</strong> running on a host.</p>
<h2 id="heading-configuration">Configuration</h2>
<pre><code class="lang-yaml"><span class="hljs-attr">receivers:</span>
  <span class="hljs-attr">hostmetrics:</span>
    <span class="hljs-attr">collection_interval:</span> <span class="hljs-string">10s</span>
    <span class="hljs-attr">scrapers:</span>
      <span class="hljs-attr">process:</span>
          <span class="hljs-attr">metrics:</span>
            <span class="hljs-attr">process.cpu.utilization:</span>
              <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
            <span class="hljs-attr">process.disk.operations:</span>
              <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
            <span class="hljs-attr">process.memory.utilization:</span>
              <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
            <span class="hljs-attr">process.threads:</span>
              <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>As part of the hostmetrics receiver, the process scraper can easily be added. This example configuration provides a sufficient instantiation, with <strong>gauge</strong> like metrics such as cpu and memory utilization being enabled.</p>
<h2 id="heading-resource-attributes">Resource Attributes</h2>
<p>Valuable resource attributes are populated alongside the metrics. Of these, <code>process.pid</code>, <code>process.owner</code>, <code>process.command</code> and <code>process.command_line</code> are particularly interesting. Process identifiers and owners can help highlight resource contention within a host. Changes in process command line arguments can highlight anomalous changes in metrics due to different command line strings.</p>
<h2 id="heading-sensitive-data-scrubbing">Sensitive Data Scrubbing</h2>
<p>One thing to note is the potential to leak sensitive command line arguments as part of data collection. It is imperative to add the <code>transform</code> processor</p>
<pre><code class="lang-yaml"><span class="hljs-attr">transform:</span>
  <span class="hljs-attr">metric_statements:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">context:</span> <span class="hljs-string">metric</span>
      <span class="hljs-attr">statements:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">replace_pattern(resource.attributes["process.command_line"],</span> <span class="hljs-string">"password\\=[^\\s]*(\\s?)"</span><span class="hljs-string">,</span> <span class="hljs-string">"password=***"</span><span class="hljs-string">)</span>
</code></pre>
<p>An example where the <code>command_line</code> resource attribute with a password argument is redacted via OTTL (OpenTelemetry Transformation Language).</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Process monitoring is an excellent way to isolate long-running processes that are exhausting resources or ephemeral processes that cause sudden anomalies. OTeL provides a clean and concise way to do just this.</p>
]]></content:encoded></item><item><title><![CDATA[Ideal OTeL Configurations for Bare-Metal and Kubernetes (Part 2)]]></title><description><![CDATA[In Part 1, we explored the hostmetricsreceiver as an excellent way to get system metrics. Moving onto a K8s environment, we must consider the importance of additional metadata and the importance of metrics at a component level. Before diving into the...]]></description><link>https://blog.omlet.co/ideal-otel-configurations-for-bare-metal-and-kubernetes-part-2</link><guid isPermaLink="true">https://blog.omlet.co/ideal-otel-configurations-for-bare-metal-and-kubernetes-part-2</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Tue, 23 Jul 2024 02:59:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721696422877/ea70d394-024f-42bc-8cc3-3f9c696b3625.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In Part 1, we explored the <a target="_blank" href="https://blog.omlet.co/ideal-otel-configurations-for-bare-metal-and-kubernetes-part-1"><strong>hostmetricsreceiver</strong></a> as an excellent way to get system metrics. Moving onto a K8s environment, we must consider the importance of additional metadata and the importance of metrics at a component level. Before diving into the conventional ways of configuring (verbose configurations), one can utilize ready-made presets via Helm. Here is an example values.yaml:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">opentelemetry-collector:</span>
  <span class="hljs-attr">presets:</span>
    <span class="hljs-attr">logsCollection:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">kubernetesAttributes:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">kubeletMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">hostMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">clusterMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">kubernetesEvents:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>The above concisely covers <em>most</em> of what observability practitioners would want to cover. We can dive into each individually.</p>
<h2 id="heading-logs-collection">Logs Collection</h2>
<p>The <code>logsCollection</code> preset is essentially the <code>filelogreceiver</code>. By default it will collect <code>/var/log/pods/*/*/*.log</code></p>
<h2 id="heading-kubernetes-attributes">Kubernetes Attributes</h2>
<p>The Kubernetes Attributes Processor is probably considered the most important component. It specifically adds Kubernetes context to telemetry for effective correlation and filtration. Additional options should also be examined for further metadata extraction:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">opentelemetry-collector:</span>
  <span class="hljs-attr">presets:</span>
    <span class="hljs-attr">kubernetesAttributes:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
      <span class="hljs-comment"># When enabled the processor will extra all labels for an associated pod and add them as resource attributes.</span>
      <span class="hljs-comment"># The label's exact name will be the key.</span>
      <span class="hljs-attr">extractAllPodLabels:</span> <span class="hljs-literal">false</span>
      <span class="hljs-comment"># When enabled the processor will extra all annotations for an associated pod and add them as resource attributes.</span>
      <span class="hljs-comment"># The annotation's exact name will be the key.</span>
      <span class="hljs-attr">extractAllPodAnnotations:</span> <span class="hljs-literal">false</span>
</code></pre>
<p>This is recommended for most deployments to maximize metadata decoration. In our minimal example we can expect these attributes by default:</p>
<ul>
<li><p><a target="_blank" href="http://k8s.namespace.name"><code>k8s.namespace.name</code></a></p>
</li>
<li><p><a target="_blank" href="http://k8s.pod.name"><code>k8s.pod.name</code></a></p>
</li>
<li><p><code>k8s.pod.uid</code></p>
</li>
<li><p><code>k8s.pod.start_time</code></p>
</li>
<li><p><a target="_blank" href="http://k8s.deployment.name"><code>k8s.deployment.name</code></a></p>
</li>
<li><p><a target="_blank" href="http://k8s.node.name"><code>k8s.node.name</code></a></p>
</li>
</ul>
<h2 id="heading-kublet-metrics">Kublet Metrics</h2>
<p>This preset adds the <code>kubeletstatsreceiver</code> to the configuration and collects node, pod and container metrics. The metrics definitions, such as <code>container.cpu.time</code>, <code>k8s.pod.cpu.utilization</code> can be found <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md">here</a>.</p>
<h2 id="heading-host-metrics">Host Metrics</h2>
<p>Similar to Part 1, we can get details about the host (or node) by adding a <code>hostmetricsreceiver</code></p>
<h2 id="heading-cluster-metrics">Cluster Metrics</h2>
<p>Cluster level metrics can be collected by enabling this (adding a <code>k8sclusterreceiver</code>). Similarly to Kublet Metrics, we have definitions of metrics <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/k8sclusterreceiver/documentation.md">here</a></p>
<h2 id="heading-kubernetes-events">Kubernetes Events</h2>
<p>This adds the <code>k8sobjectsreceiver</code> and pulls/watches (via the K8s API) for K8s events. This is outputted as individual logs in the collector pipeline.</p>
<h2 id="heading-config-section">Config Section</h2>
<p>Beyond the presets, within the <code>values.yaml</code>, one can provide specific configurations in the <code>config</code> section:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">opentelemetry-collector:</span>
  <span class="hljs-attr">presets:</span>
    <span class="hljs-attr">kubernetesAttributes:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">kubeletMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">hostMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">clusterMetrics:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">kubernetesEvents:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">receivers:</span>
      <span class="hljs-attr">kubeletstats:</span>
        <span class="hljs-attr">insecure_skip_verify:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">extensions:</span>
      <span class="hljs-attr">bearertokenauth:</span>
        <span class="hljs-attr">token:</span> <span class="hljs-string">$API_KEY</span>
    <span class="hljs-attr">exporters:</span>
      <span class="hljs-attr">otlphttp/example:</span>
        <span class="hljs-attr">endpoint:</span> <span class="hljs-string">"https://intake.test_domain.com"</span>
        <span class="hljs-attr">auth:</span>
          <span class="hljs-attr">authenticator:</span> <span class="hljs-string">bearertokenauth</span>

    <span class="hljs-attr">service:</span>
      <span class="hljs-attr">pipelines:</span>
        <span class="hljs-attr">traces:</span>
          <span class="hljs-attr">exporters:</span> [<span class="hljs-string">spanmetrics</span>, <span class="hljs-string">otlphttp/example</span>]
        <span class="hljs-attr">metrics:</span>
          <span class="hljs-attr">exporters:</span> [<span class="hljs-string">otlphttp/example</span>]
        <span class="hljs-attr">logs:</span>
          <span class="hljs-attr">exporters:</span> [<span class="hljs-string">otlphttp/example</span>]
      <span class="hljs-attr">extensions:</span> [<span class="hljs-string">health_check</span>, <span class="hljs-string">bearertokenauth</span>]
      <span class="hljs-attr">telemetry:</span>
        <span class="hljs-attr">metrics:</span>
          <span class="hljs-attr">level:</span> <span class="hljs-string">"none"</span>
</code></pre>
<p>In this configuration example, we are defining exporter configurations with authentication. Additionally, we are modifying the <code>kubeletstatsreceiver</code> from the preset, and adding a config option (<code>insecure_skip_verify: true</code>).</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In Part 1 and 2, we introduce ideal configuration options that add health metrics and attributes that can benefit observability practitioners as soon as they deploy an OTeL collector. In future blog posts, we will continue to explore configuration profiles.</p>
]]></content:encoded></item><item><title><![CDATA[Ideal OTeL Configurations for Bare-Metal and Kubernetes (Part 1)]]></title><description><![CDATA[The intention of this post is to be an opinionated suggestion on collector configurations to gather comprehensive metrics for two fundamental environment types. Fortunately, the OTeL community has made it fairly straight-forward. In "Part 1", we will...]]></description><link>https://blog.omlet.co/ideal-otel-configurations-for-bare-metal-and-kubernetes-part-1</link><guid isPermaLink="true">https://blog.omlet.co/ideal-otel-configurations-for-bare-metal-and-kubernetes-part-1</guid><category><![CDATA[opentelemetry collector]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Thu, 11 Jul 2024 17:38:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1720281740945/e331fef1-f88e-480d-a5c8-b71a41043813.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The intention of this post is to be an opinionated suggestion on collector configurations to gather comprehensive metrics for two fundamental environment types. Fortunately, the OTeL community has made it fairly straight-forward. In "Part 1", we will explore simple host monitoring.</p>
<h2 id="heading-host-metrics-and-attributes">Host Metrics and Attributes</h2>
<p>The <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md">hostmetricsreceiver</a> is a mainstay of the OTeL collector repository and provides system health metrics. There are various components within the configuration patterns that can be enabled selectively.</p>
<h3 id="heading-collection-interval">Collection Interval</h3>
<p>Receivers can scrape at various intervals:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">hostmetrics:</span>
        <span class="hljs-attr">collection_interval:</span> <span class="hljs-string">10s</span>
</code></pre>
<p>Depending on data retention windows and granularity needed, one should expect to set this from <strong>5s-30s</strong> (in increments of <strong>5s</strong>), although any frequency can be used.</p>
<h3 id="heading-cpu-and-memory">CPU and Memory</h3>
<p>CPU and Memory metrics can be enabled, however, it is important to know which ones are on by "default". Let's walkthrough a config:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">hostmetrics:</span>
        <span class="hljs-attr">collection_interval:</span> <span class="hljs-string">10s</span>
        <span class="hljs-attr">scrapers:</span>
          <span class="hljs-attr">cpu:</span>
            <span class="hljs-attr">metrics:</span>
              <span class="hljs-attr">system.cpu.utilization:</span>
                <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">memory:</span>
            <span class="hljs-attr">metrics:</span>
              <span class="hljs-attr">system.memory.utilization:</span>
                <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>In the above code, just enabling <code>cpu</code> populates <code>system.cpu.time</code> time metrics as a <code>sum</code>. <code>system.cpu.utilization</code> is recommended as it provides percentage based metrics as a <code>gauge</code>. Operators may be more familiar with this.</p>
<p>It's important to consider that states such as <code>idle</code>, <code>wait</code>, <code>user</code> are in the form of attributes, not unique metric names. So instead of <code>system.cpu.utilization.user</code>, depending on the backend, one must filter by an attribute / tag</p>
<p>Similarly, <code>system.memory.utilization</code> offers <code>gauge</code> like metrics that may be more appetizing. Attributes such as <code>used</code> and <code>free</code> are also available rather than unique metric names.</p>
<h3 id="heading-disk-and-filesystem">Disk and Filesystem</h3>
<p>Disk performance and overall filesystem usage can be added as seen in the below config:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">hostmetrics:</span>
        <span class="hljs-attr">collection_interval:</span> <span class="hljs-string">10s</span>
        <span class="hljs-attr">scrapers:</span>
          <span class="hljs-attr">disk:</span>
          <span class="hljs-attr">filesystem:</span>
            <span class="hljs-attr">metrics:</span>
              <span class="hljs-attr">system.filesystem.utilization:</span>
                <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Important metrics, especially <code>system.disk.io</code>, are populated via the <code>disk</code> scraper. Attributes provide <code>read</code> and <code>write</code> info by <code>device</code>. Filesystem usage (space used vs free) is also added, with <code>system.filesystem.utilization</code> providing <code>gauge</code> like metrics. Again, attributes exist for <code>free</code> and <code>used</code>.</p>
<h3 id="heading-network">Network</h3>
<p>To get network I/O, connections, and errors/dropped packets:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">hostmetrics:</span>
        <span class="hljs-attr">collection_interval:</span> <span class="hljs-string">10s</span>
        <span class="hljs-attr">scrapers:</span>
          <span class="hljs-attr">network:</span>
</code></pre>
<p>These metrics can be quite valuable as they can pinpoint network issues by <code>device</code>, <code>protocol</code> and <code>receive</code> / <code>transmit</code></p>
<h3 id="heading-resource-attributes">Resource Attributes</h3>
<p>Resource attribution decoration is crucial to query metrics by information such as "host", "region", "provider", etc:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">processors:</span>
  <span class="hljs-attr">resourcedetection:</span>
    <span class="hljs-attr">detectors:</span> [<span class="hljs-string">gcp</span>, <span class="hljs-string">ecs</span>, <span class="hljs-string">ec2</span>, <span class="hljs-string">azure</span>, <span class="hljs-string">system</span>]
    <span class="hljs-attr">override:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Keep in mind: order matters and the <strong>first detector to insert wins.</strong></p>
<h2 id="heading-part-2">Part 2</h2>
<p>In Part 2, we will explore relevant metrics and metadata decoration within a Kubernetes environment.</p>
]]></content:encoded></item><item><title><![CDATA[OMLET: (O)pen (M)etrics (L)ogs (E)vents (T)races]]></title><description><![CDATA[I've been in the Observability world for over 12 years now, witnessing the comforts (and complexities) that Cloud architecture provided. Recently, I had a random thought about the space:

If I were to start a high school course around the topic, what...]]></description><link>https://blog.omlet.co/omlet-open-metrics-logs-events-traces</link><guid isPermaLink="true">https://blog.omlet.co/omlet-open-metrics-logs-events-traces</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Praneet Sharma]]></dc:creator><pubDate>Fri, 21 Jun 2024 14:27:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1718849340094/69ceafd8-5c08-4cbf-8ecc-93a848437536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I've been in the Observability world for over 12 years now, witnessing the comforts (and complexities) that Cloud architecture provided. Recently, I had a random thought about the space:</p>
<blockquote>
<p>If I were to start a high school course around the topic, what would it look like..?</p>
</blockquote>
<p>To me, it fundamentally speaks to the importance of understanding proper "feedback from a system" to accurately answer questions like:</p>
<ul>
<li><p>Why is it deviating from a baseline?</p>
</li>
<li><p>If I were to make X or Y change, what data points would be forecasted to change alongside that?</p>
</li>
<li><p>What is the system's "homeostasis"?</p>
</li>
<li><p>How can I <strong>easily</strong> gather system output data that is best reflective of internal states?</p>
</li>
</ul>
<p>If you strip away all the vernacular that we use in the industry, it becomes simply about being able to "ask" something about the system. A stable and open standard is crucial to that succeeding. Furthermore, fostering an environment that facilitates an enticing entry for future generations is even more significant.</p>
<h2 id="heading-opentelemetry-otel">OpenTelemetry (OTeL)</h2>
<p>I think everyone in the Observability space has explored OTeL from the perspective of a practitioner, vendor, architect, engineer, IT/OPs, analyst, the list goes on. Opinions have varied from it being a liberator, misguided initiative, "not ready yet", indifference, or even adversarial.</p>
<p>With all that said, if we consider a student's (even beginner) approach to comfortably approaching this subject, we can all agree that humans <strong>hate context shifting</strong>. So whatever the state of the standard or what it looks like, the fact that the space is relentlessly heading towards it is <strong>good</strong>.</p>
<p>Every individual within the space can claim the cognitive burden induced by obscure data structures, unlinked data, and arbitrary config patterns. In fact, <strong>vendors</strong> have the most to benefit. Why is this? <strong>Sales cycles, onboarding and continuous engagement.</strong></p>
<p>Ask every account person and customer in Observability, the main toil is re-education and re-absorption of concepts that should be quite fluid. Very often, the engineering and product teams focus on patterns that allow them to iterate quickly and unfortunately emerge as hyper-opinionated. This comes at a direct cost to the mental bandwidth of the account teams responsible for promoting to and maintaining the users. Not only are they consistently playing catch-up to the core product/eng teams, they are hearing the brunt of the complaints from customers.</p>
<h2 id="heading-omlet">OMLET</h2>
<p>OMLET was started because we wanted to be part of the OTeL journey, but especially because promotion and evangelism of an open standard in this space is such a need. We plan to focus on the pipeline problem that restricts observability users from kickstarting their OTeL journey, or keeping it alive. Beyond the tech, standardization is very much an organizational and cultural need.</p>
<p><em>"You can't make an omelette without breaking eggs"</em></p>
]]></content:encoded></item></channel></rss>