Using OpenSearch in Real Projects: A Practical Example and Lessons Learned

Lokesh Varman
Dec 18
1.4k
0
0

Article

I didn’t start using OpenSearch because I wanted a search engine.
I started using it because our engineering team was drowning in logs.
We were managing multiple microservices, each writing its own log file.
During incidents, everyone would open five terminals and manually grep through gigabytes of text. It worked… until it didn’t.
That is where OpenSearch entered the picture.

Why We Picked OpenSearch

The decision wasn’t driven by something fancy. We needed:

A way to search logs quickly
A dashboard to visualize spikes and failures
A tool that didn’t require a license nightmare
Something easy to deploy on regular infrastructure

OpenSearch checked all those boxes, and more importantly, we could start small.

The Practical Example: API Error Monitoring System

Let me walk through the exact setup we built — simple but powerful.

Step 1: Sending Logs to OpenSearch

Each microservice (Node.js, Java, and Python) wrote logs in JSON format.
Instead of shipping raw text, we forwarded logs to OpenSearch using Filebeat.

A typical log entry looked like this:

{
  "timestamp": "2025-01-15T14:21:10Z",
  "service": "payment-api",
  "level": "error",
  "message": "Transaction failed due to invalid token",
  "userId": "U3021",
  "latencyMs": 842
}

The moment logs started flowing in, OpenSearch automatically created an index.
But this is where we made the first improvement: we defined our own index mapping.

Why? Because the default mapping treated latencyMs as text.
Try running aggregations on text — it won’t end well.

Our custom mapping:

{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "service": { "type": "keyword" },
      "level": { "type": "keyword" },
      "message": { "type": "text" },
      "userId": { "type": "keyword" },
      "latencyMs": { "type": "integer" }
    }
  }
}

Just this small step improved reliability dramatically.

Step 2: Querying Errors Using DSL

Once the logs landed in OpenSearch, we started exploring the Query DSL.
One of the first real queries we wrote was:

“Show me all payment-api errors in the last 15 minutes.”

{
  "query": {
    "bool": {
      "must": [
        { "term": { "service": "payment-api" }},
        { "term": { "level": "error" }}
      ],
      "filter": {
        "range": {
          "timestamp": {
            "gte": "now-15m"
          }
        }
      }
    }
  }
}

The speed surprised everyone.
What took minutes with manual grepping became a sub-second search.

Step 3: Error Spike Alerts

Once we saw how quickly queries worked, we started thinking:
Can OpenSearch tell us when something unusual happens?

Yes — through alerting.

We set up a rule:

Every minute, count errors in the payment-api service
If error count rises above 20 per minute
Send a Slack alert to the DevOps team

Aggregation used for the rule:

{
  "size": 0,
  "aggs": {
    "errors_per_minute": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1m"
      },
      "aggs": {
        "error_count": {
          "filter": { "term": { "level": "error" }}
        }
      }
    }
  }
}

This small automation saved us from multiple late-night outages.

Step 4: Creating a Dashboard

Once data came alive, dashboards became addictive.

Our main dashboard included:

Error rate over time
Top error messages
Slowest endpoints
Error distribution by service
User-ID based error frequency (helped detect misuse or bugs)

One chart that the business team loved showed API latency trends.
It revealed something unexpected:
The payment API slowed down only during weekends — which aligned exactly with peak traffic.

We would never have spotted this pattern with plain log files.

What We Learned the Hard Way

1. Bad mappings lead to painful debugging

Our first few days were spent wondering why aggregations didn’t work.
The root cause: OpenSearch treated numbers as strings.

2. Too many small indices slow down queries

We initially created a new index every hour.
It felt organized, but it wrecked performance.
Switching to one daily index fixed it.

3. Dashboards can become operational tools

Developers started watching the dashboard after every production deployment.
This helped catch issues faster than waiting for users to report them.

4. Snapshots are not optional

We once lost data due to a corrupted node and no snapshots.
After that, automated S3 snapshots became mandatory.

Where OpenSearch Fits Best in Daily Engineering Work

Through this practical use-case, a few patterns emerged:

It excels in log analytics
It is perfect for building search features quickly
It works well for real-time trend visualization
Alerts reduce manual monitoring drastically

But it’s not meant to replace transactional databases or heavy BI systems.
Its strength lies in fast search and analysis.

Final Thoughts

Our first OpenSearch setup was extremely simple, but it brought an immediate jump in visibility and debugging speed.
Over a few months, it went from a basic log viewer to a crucial monitoring system for all microservices.

For anyone starting out, I’d recommend:

Begin small
Define mappings
Build dashboards slowly
Add alerting once queries stabilize
Automate snapshots from day one

OpenSearch grows with you, and the more data you feed it, the more valuable it becomes.