P3 case_file

Proactive Log Management Across Production Services

Linux Observability Production Storage

Context

Multiple production services — Elasticsearch, RabbitMQ, Nginx — were emitting logs at a rate that would, without intervention, eventually pressure disk capacity and trigger reactive incidents.

Problem

Logs were valuable for incident response but expensive in disk usage. The strategy could not be one-size-fits-all: each service had different rotation, retention and access patterns.

My role

Operational owner of the initiative: chose the right tool per service and shipped the changes as routine maintenance instead of an after-the-fact recovery.

Technical actions

[01] Designed logrotate-based retention and compression policies for Elasticsearch and Nginx logs.
[02] Configured size- and time-based rotation rules to match each service's behaviour.
[03] Added cron-based size checks for services where logrotate alone was insufficient.
[04] Validated post-rotation file ownership and permissions so services kept writing without manual intervention.

Operational impact

Disk-usage risk reduced across the affected fleet. The initiative removed a class of avoidable late-night incidents and made log volume a planned cost instead of a surprise.

What this demonstrates

SRE-style preventative work, not just reactive firefighting.
Per-service log strategy instead of a copy-pasted policy.
Comfort working at the Linux/systemd layer where production actually lives.

Why this matters

The cheapest production incident is the one prevented two weeks earlier by a logrotate rule. Most engineers wait for the page; some go and write the rule.