P2 case_file

Elasticsearch Capacity Planning for High-Traffic Event Readiness

Sizing a three-node Elasticsearch cluster for a high-traffic event window, with cluster-health gates between steps and a documented rollback path.

Status

Completed

Timeframe

Pre-event change window

Environment

Production · Azure

Cloud Database Observability Production

Context

Three production Elasticsearch nodes in Azure needed to be sized for an upcoming high-traffic event. The cluster fed search and analytics paths the event would amplify.

Problem

Existing capacity left thin headroom under the projected load. Resizing had to be done with controlled risk, validated cluster health, and a clear rollback path before the event window.

My role

Capacity planner and operator: defined the sizing target, executed the resize, and ran the cluster-health checks.

Technical actions

[01] Increased vCPU and RAM on all three Elasticsearch nodes.
[02] Adjusted JVM heap to roughly 50% of available RAM, respecting Elasticsearch's recommended heap ceiling.
[03] Validated cluster state, shard allocation and recovery via the cluster health API between steps.
[04] Documented the rollback path in case post-resize behaviour deviated from baseline.

Operational impact

Cluster prepared and validated for the event window with a documented rollback path. Sizing decisions captured for future event-readiness work.

Evidence

[✓] vCPU and RAM increased on all three nodes.
[✓] JVM heap aligned to ~50% of available RAM, within Elasticsearch's recommended ceiling.
[✓] Cluster state, shard allocation and recovery validated via the cluster health API between steps.
[✓] Rollback path documented before the event window opened.

What this demonstrates

Capacity planning tied to a real workload event, not abstract benchmarks.
Working knowledge of Elasticsearch operational constraints (heap, shards, cluster state).
Treating resizes as production changes with health gates and rollback.

Why this matters

Capacity planning sounds tidy in slides. In production it is a sequence of small irreversible changes you would rather not make at peak. This case is the kind that earns the slide.