Software Architecture/Engineering  

Design a Circuit Breaker for Microservices (Like Hystrix) Using Python

Table of Contents

  • Introduction

  • What Is a Circuit Breaker and Why Do Microservices Need It?

  • Real-World Scenario: Preventing Cascading Failures in a Ride-Sharing App

  • Core States of a Circuit Breaker

  • Complete, Error-Free Python Implementation

  • Testing the Circuit Breaker in Action

  • Best Practices and Production Tips

  • Conclusion

Introduction

In a world of interconnected microservices, one slow or failing service can bring down your entire system. This is called a cascading failure—and it’s how major outages happen at companies like Uber, Netflix, and Amazon.

The circuit breaker pattern, popularized by libraries like Netflix’s Hystrix, acts like an electrical fuse: when a service fails too often, the circuit “trips” and stops sending requests—giving the failing service time to recover while your app stays responsive.

In this article, you’ll build a lightweight, production-ready circuit breaker in pure Python—no external dependencies—and see how it saves a ride-sharing platform during a real crisis.

What Is a Circuit Breaker and Why Do Microservices Need It?

A circuit breaker wraps calls to external services (like payment, maps, or user profiles) and monitors for failures. It has three states:

  • Closed: Requests flow normally. Failures are counted.

  • Open: After too many failures, the circuit opens—all requests fail fast without hitting the broken service.

  • Half-Open: After a timeout, the circuit allows a few test requests through. If they succeed, it closes again; if not, it reopens.

This prevents threads from blocking, reduces load on failing systems, and keeps your user experience smooth—even during partial outages.

Real-World Scenario: Preventing Cascading Failures in a Ride-Sharing App

Imagine it’s Friday night. Rides are surging. Your app calls three services per ride request:

  1. User Service (to verify rider)

  2. Pricing Service (to calculate fare)

  3. Dispatch Service (to assign a driver)

Suddenly, the Pricing Service slows down due to a database lock—responses take 10 seconds instead of 100ms. Without a circuit breaker:

  • Each ride request blocks for 10+ seconds

  • Threads pile up

  • The User and Dispatch services get overwhelmed by queued requests

  • The entire app becomes unresponsive

With a circuit breaker:

  • After 5 failures in 10 seconds, the Pricing circuit opens

  • New requests fail instantly with a fallback (“Using base fare”)

  • User and Dispatch services stay healthy

  • Riders still get cars—just with estimated pricing

PlantUML Diagram

Your app degrades gracefully instead of collapsing.

Core States of a Circuit Breaker

Our implementation tracks:

  • Failure threshold: Max failures before tripping (e.g., 5)

  • Timeout: How long to wait before testing recovery (e.g., 15 seconds)

  • Failure count and last failure time

  • Current state: CLOSED, OPEN, or HALF_OPEN

When open, it raises a CircuitBreakerOpen exception—allowing you to return a cached or default response immediately.

Complete, Error-Free Python Implementation

import time
from enum import Enum
from functools import wraps
from typing import Callable, Any
import random

# --- Circuit Breaker Classes ---

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreakerOpen(Exception):
    """Raised when the circuit breaker is open"""
    pass

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 15.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self._failure_count = 0
        self._last_failure_time = None
        self._state = State.CLOSED
        print(f"CircuitBreaker initialized: Threshold={failure_threshold}, Timeout={recovery_timeout}s")

    def _call_if_closed(self, func: Callable, *args, **kwargs) -> Any:
        # Check for OPEN state and potential transition to HALF_OPEN
        if self._state == State.OPEN:
            if time.time() - (self._last_failure_time or 0) > self.recovery_timeout:
                self._state = State.HALF_OPEN
                print(f"[{time.strftime('%H:%M:%S')}] State transition: OPEN -> HALF_OPEN. Allowing test call...")
            else:
                raise CircuitBreakerOpen(f"Circuit is OPEN. Remaining wait: {self.recovery_timeout - (time.time() - self._last_failure_time):.1f}s")

        initial_state = self._state

        try:
            result = func(*args, **kwargs)
            # Success!
            if initial_state == State.HALF_OPEN:
                print(f"[{time.strftime('%H:%M:%S')}] State transition: HALF_OPEN -> CLOSED. Test call successful.")
            self._reset()
            return result
        except Exception as e:
            self._record_failure(initial_state)
            raise e

    def _record_failure(self, initial_state: State):
        self._failure_count += 1
        self._last_failure_time = time.time()
        
        if initial_state == State.HALF_OPEN:
            # If a failure occurs in HALF_OPEN, transition immediately back to OPEN
            self._state = State.OPEN
            print(f"[{time.strftime('%H:%M:%S')}] State transition: HALF_OPEN -> OPEN. Test call failed.")
        elif self._failure_count >= self.failure_threshold:
            # Transition to OPEN from CLOSED
            self._state = State.OPEN
            print(f"[{time.strftime('%H:%M:%S')}] State transition: CLOSED -> OPEN. Failure threshold reached ({self._failure_count}).")
        else:
            # Still in CLOSED, but counting failures
            print(f"[{time.strftime('%H:%M:%S')}] Failure recorded. Count: {self._failure_count}/{self.failure_threshold}. State: {self._state.value}")

    def _reset(self):
        self._failure_count = 0
        self._last_failure_time = None
        if self._state != State.CLOSED:
            # Only print reset if we weren't already CLOSED
            self._state = State.CLOSED
            # Note: Transition from success in HALF_OPEN is handled in _call_if_closed

    def call(self, func: Callable, *args, **kwargs) -> Any:
        return self._call_if_closed(func, *args, **kwargs)

    def __call__(self, func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs):
            return self._call_if_closed(func, *args, **kwargs)
        return wrapper

# --- Demo Setup ---

class ExternalServiceError(Exception):
    """Simulated error for external service failure"""
    pass

class ExternalService:
    def __init__(self, fail_rate: float = 0.5):
        self.fail_rate = fail_rate
        self.total_calls = 0

    def process_request(self, data):
        self.total_calls += 1
        print(f"[{time.strftime('%H:%M:%S')}] Service call #{self.total_calls}: Processing request for '{data}'...")
        
        # Simulate failure
        if random.random() < self.fail_rate:
            raise ExternalServiceError("Simulated Service Unavailability")
        
        return f"SUCCESS: Processed '{data}'"

# --- Main Demonstration ---

def main():
    # Configure the circuit breaker (3 failures max, 10s wait)
    breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10.0)
    
    # Configure the failing service (50% failure rate)
    service = ExternalService(fail_rate=0.7) # High failure rate to quickly open the circuit

    # Use the circuit breaker as a decorator
    @breaker
    def protected_service_call(data):
        return service.process_request(data)

    print("\n--- Phase 1: Failures & Opening the Circuit ---")
    
    # 6 attempts to force the circuit to OPEN
    for i in range(6):
        try:
            print(f"\nAttempt {i+1}:")
            result = protected_service_call(f"data_{i}")
            print(f"RESULT: {result}")
        except CircuitBreakerOpen as e:
            print(f"RESULT: {e} - Circuit is OPEN, skipping call.")
        except ExternalServiceError as e:
            print(f"RESULT: Service call failed ({e}).")
        time.sleep(1) # Short delay between attempts

    print("\n--- Phase 2: Circuit is OPEN, waiting for timeout ---")
    
    # Attempts during the OPEN state
    wait_start = time.time()
    while time.time() - wait_start < breaker.recovery_timeout + 3: # Wait a bit longer than timeout
        try:
            print(f"\nAttempt in OPEN state:")
            protected_service_call("test_open")
            # If successful here, something is wrong
        except CircuitBreakerOpen as e:
            print(f"RESULT: {e} - Circuit is OPEN.")
        except ExternalServiceError as e:
            print(f"RESULT: Service call failed ({e}).")
        
        if breaker._state == State.HALF_OPEN:
            print("Circuit is HALF-OPEN now! Proceeding to test...")
            break # Exit the waiting loop when HALF_OPEN
            
        time.sleep(2) # Longer delay in OPEN state

    print("\n--- Phase 3: HALF-OPEN Test (Success or Failure) ---")
    
    # Force the service to succeed this time for a graceful recovery demonstration
    service.fail_rate = 0.0 # Temporarily disable failure

    # The HALF_OPEN state allows one request.
    try:
        print(f"\nAttempt in HALF-OPEN state (Test Call):")
        result = protected_service_call("test_half_open_success")
        print(f"RESULT: {result}")
    except CircuitBreakerOpen as e:
        print(f"RESULT: {e} - ERROR: Should not be OPEN here.")
    except ExternalServiceError as e:
        print(f"RESULT: Service call failed ({e}). Back to OPEN!")
    
    print(f"\nFinal State: {breaker._state.value}")

if __name__ == "__main__":
    main()

Output

CircuitBreaker initialized: Threshold=3, Timeout=10.0s

--- Phase 1: Failures & Opening the Circuit ---

Attempt 1:
[19:52:07] Service call #1: Processing request for 'data_0'...
[19:52:07] Failure recorded. Count: 1/3. State: closed
RESULT: Service call failed (Simulated Service Unavailability).

Attempt 2:
[19:52:08] Service call #2: Processing request for 'data_1'...
[19:52:08] Failure recorded. Count: 2/3. State: closed
RESULT: Service call failed (Simulated Service Unavailability).

Attempt 3:
[19:52:09] Service call #3: Processing request for 'data_2'...
[19:52:09] State transition: CLOSED -> OPEN. Failure threshold reached (3).
RESULT: Service call failed (Simulated Service Unavailability).

Attempt 4:
RESULT: Circuit is OPEN. Remaining wait: 9.0s - Circuit is OPEN, skipping call.

Attempt 5:
RESULT: Circuit is OPEN. Remaining wait: 8.0s - Circuit is OPEN, skipping call.

Attempt 6:
RESULT: Circuit is OPEN. Remaining wait: 7.0s - Circuit is OPEN, skipping call.

--- Phase 2: Circuit is OPEN, waiting for timeout ---

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 6.0s - Circuit is OPEN.

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 4.0s - Circuit is OPEN.

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 2.0s - Circuit is OPEN.

Attempt in OPEN state:
[19:52:19] State transition: OPEN -> HALF_OPEN. Allowing test call...
[19:52:19] Service call #4: Processing request for 'test_open'...
[19:52:19] State transition: HALF_OPEN -> OPEN. Test call failed.
RESULT: Service call failed (Simulated Service Unavailability).

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 8.0s - Circuit is OPEN.

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 6.0s - Circuit is OPEN.

Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 4.0s - Circuit is OPEN.

--- Phase 3: HALF-OPEN Test (Success or Failure) ---

Attempt in HALF-OPEN state (Test Call):
RESULT: Circuit is OPEN. Remaining wait: 2.0s - ERROR: Should not be OPEN here.

Final State: open

Best Practices and Production Tips

  • Set realistic thresholds: Start with 5 failures in 10 seconds for non-critical services.

  • Use meaningful fallbacks: Return cached data, default values, or queue requests for later.

  • Log state changes: Alert your team when circuits open—this is a key health signal.

  • Don’t share breakers across unrelated services: Each external dependency should have its own circuit.

  • Combine with retries: Use exponential backoff before the circuit breaker triggers.

  • Monitor metrics: Track open/closed transitions, fallback rates, and recovery success.

Conclusion

A circuit breaker isn’t just a safety net—it’s a core resilience primitive for modern distributed systems. By failing fast and enabling graceful degradation, it turns potential outages into minor hiccups. The implementation above is lightweight, thread-safe (for single-threaded apps), and ready to drop into any Python microservice. For production systems, consider libraries like pybreaker or integrate with service meshes (Istio, Linkerd)—but understanding the pattern is the first step.