The intention of this post is to be an opinionated suggestion on collector configurations to gather comprehensive metrics for two fundamental environment types. Fortunately, the OTeL community has made it fairly straight-forward. In "Part 1", we will explore simple host monitoring.
Host Metrics and Attributes
The hostmetricsreceiver is a mainstay of the OTeL collector repository and provides system health metrics. There are various components within the configuration patterns that can be enabled selectively.
Collection Interval
Receivers can scrape at various intervals:
hostmetrics:
collection_interval: 10s
Depending on data retention windows and granularity needed, one should expect to set this from 5s-30s (in increments of 5s), although any frequency can be used.
CPU and Memory
CPU and Memory metrics can be enabled, however, it is important to know which ones are on by "default". Let's walkthrough a config:
hostmetrics:
collection_interval: 10s
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
In the above code, just enabling cpu
populates system.cpu.time
time metrics as a sum
. system.cpu.utilization
is recommended as it provides percentage based metrics as a gauge
. Operators may be more familiar with this.
It's important to consider that states such as idle
, wait
, user
are in the form of attributes, not unique metric names. So instead of system.cpu.utilization.user
, depending on the backend, one must filter by an attribute / tag
Similarly, system.memory.utilization
offers gauge
like metrics that may be more appetizing. Attributes such as used
and free
are also available rather than unique metric names.
Disk and Filesystem
Disk performance and overall filesystem usage can be added as seen in the below config:
hostmetrics:
collection_interval: 10s
scrapers:
disk:
filesystem:
metrics:
system.filesystem.utilization:
enabled: true
Important metrics, especially system.disk.io
, are populated via the disk
scraper. Attributes provide read
and write
info by device
. Filesystem usage (space used vs free) is also added, with system.filesystem.utilization
providing gauge
like metrics. Again, attributes exist for free
and used
.
Network
To get network I/O, connections, and errors/dropped packets:
hostmetrics:
collection_interval: 10s
scrapers:
network:
These metrics can be quite valuable as they can pinpoint network issues by device
, protocol
and receive
/ transmit
Resource Attributes
Resource attribution decoration is crucial to query metrics by information such as "host", "region", "provider", etc:
processors:
resourcedetection:
detectors: [gcp, ecs, ec2, azure, system]
override: true
Keep in mind: order matters and the first detector to insert wins.
Part 2
In Part 2, we will explore relevant metrics and metadata decoration within a Kubernetes environment.