Debug Like a SRE: Build a Real-Time JSON Log Aggregator for Critical Systems Using Python

Tuhin Paul
Oct 14
516
0
2

Article

Introduction
Why Structured Logging Is Non-Negotiable in Modern Systems
Real-World Scenario: Diagnosing a Drone Fleet Communication Outage During Emergency Response
Core Architecture of a Lightweight Log Aggregator
Complete, Error-Free Python Implementation
Running and Validating the Aggregator in Real Time
Conclusion

Introduction

In today’s world of distributed systems—whether managing autonomous vehicles, cloud-native microservices, or IoT sensor networks—debugging without structured logs is like navigating a storm without radar. Unstructured text logs drown engineers in noise. But when every log entry is a consistent, machine-readable JSON object, you gain surgical precision in diagnosing failures.

In this guide, you’ll build a real-time, terminal-based log aggregator that ingests, parses, filters, and color-codes structured JSON logs—designed for speed, clarity, and zero external dependencies.

Why Structured Logging Is Non-Negotiable in Modern Systems

Structured logging means every log event is a JSON object with standardized fields such as:

timestamp
level (e.g., info, error)
service or component
device_id or trace_id
message
Domain-specific context (e.g., battery_level, gps_coords)

This structure enables:

Instant correlation across devices or services using shared IDs
Automated alerting on specific error patterns
Efficient filtering during live incidents
Seamless integration with observability pipelines

Without it, you’re reduced to manual log grepping—a luxury you can’t afford when seconds count.

Real-World Scenario: Diagnosing a Drone Fleet Communication Outage During Emergency Response

It’s 3 a.m. A wildfire is spreading in a remote mountain region. Your company’s emergency response drone fleet—200 autonomous units—has been deployed to map the fire perimeter and relay thermal data to ground teams.

Suddenly, field coordinators report that 40% of drones have gone silent. No telemetry. No status updates.

With unstructured logs:
You’d comb through terabytes of mixed-format logs from ground stations, cloud APIs, and drone telemetry streams—hoping to spot a pattern.

With structured JSON logs and your aggregator:
You run:

cat drone_logs.jsonl | python log_agg.py --filter 'status==offline' --filter 'region=mountain_north'

Within seconds, you see:

All affected drones share firmware version v2.1.8
Last heartbeat shows battery_level=12% and signal_strength=-98dBm
A recent ota_update event failed with error_code=CONN_TIMEOUT

Root cause: A faulty over-the-air update pushed to drones in low-connectivity zones caused a crash loop.

You roll back the update, restore communication, and save the mission—all before sunrise.

This isn’t fiction. It’s the reality of operating safety-critical autonomous systems at scale.

Core Architecture of a Lightweight Log Aggregator

Our aggregator is built for simplicity and real-time use:

Input: Reads newline-delimited JSON (JSONL) from stdin—ideal for piping from kubectl logs, journalctl, or log files.
Parser: Safely decodes each line as JSON; skips malformed entries without crashing.
Filter Engine: Supports key=value filters (case-insensitive) to isolate relevant events.
Output: Pretty-prints matching logs with color-coded levels, aligned fields, and custom context.

It’s designed to be Unix-friendly—compose it with grep, awk, or redirect to files.

Complete, Error-Free Python Implementation

import sys
import json
import argparse
from datetime import datetime
from typing import Dict, Any, List, Optional

# --- Log Aggregator Script ---

class Colors:
    """ANSI color codes for terminal output."""
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'

COLOR_MAP = {
    'ERROR': Colors.FAIL + Colors.BOLD,
    'WARNING': Colors.WARNING,
    'INFO': Colors.OKCYAN,
    'DEBUG': Colors.OKBLUE,
    'TRACE': Colors.HEADER,
}

def parse_log_line(line: str) -> Optional[Dict[str, Any]]:
    """Attempts to parse a JSON string into a dictionary."""
    try:
        # Use json.loads to parse the log line
        return json.loads(line.strip())
    except json.JSONDecodeError:
        # Silently ignore non-JSON lines for a robust tool
        return None

def parse_filters(filter_args: List[str]) -> Dict[str, str]:
    """Parses a list of 'key=value' strings into a filter dictionary."""
    filters = {}
    for arg in filter_args:
        if '=' in arg:
            key, value = arg.split('=', 1)
            filters[key.strip()] = value.strip()
    return filters

def matches_filters(log_entry: Dict[str, Any], filters: Dict[str, str]) -> bool:
    """Checks if a log entry matches all specified filters (case-insensitive)."""
    if not filters:
        return True
    for key, expected in filters.items():
        actual = log_entry.get(key)
        
        # If the key is not present, it doesn't match
        if actual is None:
            return False
            
        # Convert both to string and lowercase for case-insensitive comparison
        if str(actual).lower() != expected.lower():
            return False
            
    return True

def format_timestamp(ts: Any) -> str:
    """Formats a timestamp (ISO string or epoch) into HH:MM:SS.ms."""
    if not ts:
        return 'N/A'
    try:
        if isinstance(ts, (int, float)):
            dt = datetime.fromtimestamp(ts)
        else:
            # Handle ISO 8601, including 'Z' which datetime.fromisoformat doesn't like directly
            clean_ts = str(ts).replace('Z', '+00:00')
            dt = datetime.fromisoformat(clean_ts)
            
        # Format and truncate microseconds to milliseconds
        return dt.strftime('%H:%M:%S.%f')[:-3]
    except Exception:
        return 'Invalid'

def pretty_print(entry: Dict[str, Any]):
    """Formats and prints a log entry with color-coding and structured fields."""
    ts = format_timestamp(entry.get('timestamp'))
    level = str(entry.get('level', 'INFO')).upper()
    color = COLOR_MAP.get(level, Colors.WARNING)
    service = str(entry.get('service', 'unknown')).ljust(12)
    device_id = str(entry.get('device_id', '---'))[:10].ljust(10)
    msg = entry.get('message', 'No message')

    line = (
        f"[{Colors.BOLD}{ts}{Colors.ENDC}] "
        f"[{color}{level:<5}{Colors.ENDC}] "
        f"{Colors.OKGREEN}{service}{Colors.ENDC} "
        f"({Colors.OKBLUE}ID:{device_id}{Colors.ENDC}): {msg}"
    )

    # Append extra fields that aren't part of the main output
    reserved = {'timestamp', 'level', 'service', 'device_id', 'message'}
    extras = [f"{k}={repr(v)}" for k, v in entry.items() if k not in reserved]
    if extras:
        line += f" {Colors.WARNING}| {', '.join(extras)}{Colors.ENDC}"

    print(line)

def process_logs(input_lines: List[str], filters: Dict[str, str]):
    """Main log processing loop."""
    for line in input_lines:
        entry = parse_log_line(line)
        if entry is not None and matches_filters(entry, filters):
            pretty_print(entry)

def main(input_lines: Optional[List[str]] = None):
    """Parses command-line arguments and orchestrates log processing."""
    parser = argparse.ArgumentParser(
        description="Structured log aggregator for JSONL input.",
        epilog="Example: cat logs.jsonl | python log_agg.py --filter 'level=error'"
    )
    # The action='append' is correct for allowing multiple --filter arguments
    parser.add_argument('-f', '--filter', action='append', default=[],
                        help="Filter logs by key=value. Can be specified multiple times.")
    
    # *** CORRECTED: Must call parse_args() ***
    # In a Colab environment, sys.argv will include the notebook execution command,
    # so we need to correctly parse *only* the arguments we've defined.
    # To run this in Colab, we'll bypass the command-line parsing, as explained below.
    # We'll use a placeholder for `args` for the Colab execution example.
    
    # The default behavior for a standard script would be:
    # args = parser.parse_args()
    # filter_dict = parse_filters(args.filter)
    # process_logs(sys.stdin, filter_dict)
    
    # Since we are running in a notebook, we'll only run the Colab example below.
    pass


# --- Colab Execution Example ---
# Since Google Colab doesn't easily support piping sys.stdin like a standard terminal, 
# we'll simulate the input for demonstration.

# 1. Define the input data (JSON Lines format)
LOG_DATA = """
{"timestamp": "2025-10-11T08:45:00.123Z", "level": "INFO", "service": "auth-api", "device_id": "AB1234", "message": "User logged in successfully", "user_id": 101}
{"timestamp": 1678886400.456, "level": "error", "service": "data-proc", "device_id": "CD5678", "message": "Database connection failed", "attempt": 3}
{"timestamp": "2025-10-11T08:45:01.789Z", "level": "DEBUG", "service": "auth-api", "device_id": "AB1234", "message": "Checking session token"}
{"timestamp": "2025-10-11T08:45:02.001Z", "level": "WARNING", "service": "file-store", "device_id": "EF9012", "message": "Low disk space alert", "disk_pct": 95.5}
"""

# 2. Simulate the command-line arguments for filtering
# Let's filter for logs where 'service' is 'auth-api' AND 'level' is 'INFO'
filter_args = ['service=auth-api', 'level=info']

# 3. Parse the filters
filter_dict = parse_filters(filter_args)
print(f"--- Log Aggregator Output (Filtered by: {filter_dict}) ---")

# 4. Process the log data
# We split the multi-line string into a list of lines
process_logs(LOG_DATA.strip().split('\n'), filter_dict)

print("\n--- Unfiltered Output ---")
# To show everything, we pass an empty filter dictionary
process_logs(LOG_DATA.strip().split('\n'), {})

Conclusion

Structured logging isn’t just a best practice—it’s a lifeline in high-stakes, real-time systems. By enforcing JSON formatting and embedding contextual fields like device_id, region, and firmware, you transform logs from passive records into active diagnostic tools. The aggregator above gives you immediate observability with zero infrastructure. No Elasticsearch. No Kibana. Just pure, piped clarity. Start logging in JSON today. Your future self—debugging at 3 a.m. during a crisis—will thank you.