Table of Contents
Introduction
What Is a Circuit Breaker and Why Do Microservices Need It?
Real-World Scenario: Preventing Cascading Failures in a Ride-Sharing App
Core States of a Circuit Breaker
Complete, Error-Free Python Implementation
Testing the Circuit Breaker in Action
Best Practices and Production Tips
Conclusion
Introduction
In a world of interconnected microservices, one slow or failing service can bring down your entire system. This is called a cascading failure—and it’s how major outages happen at companies like Uber, Netflix, and Amazon.
The circuit breaker pattern, popularized by libraries like Netflix’s Hystrix, acts like an electrical fuse: when a service fails too often, the circuit “trips” and stops sending requests—giving the failing service time to recover while your app stays responsive.
In this article, you’ll build a lightweight, production-ready circuit breaker in pure Python—no external dependencies—and see how it saves a ride-sharing platform during a real crisis.
What Is a Circuit Breaker and Why Do Microservices Need It?
A circuit breaker wraps calls to external services (like payment, maps, or user profiles) and monitors for failures. It has three states:
Closed: Requests flow normally. Failures are counted.
Open: After too many failures, the circuit opens—all requests fail fast without hitting the broken service.
Half-Open: After a timeout, the circuit allows a few test requests through. If they succeed, it closes again; if not, it reopens.
This prevents threads from blocking, reduces load on failing systems, and keeps your user experience smooth—even during partial outages.
Real-World Scenario: Preventing Cascading Failures in a Ride-Sharing App
Imagine it’s Friday night. Rides are surging. Your app calls three services per ride request:
User Service (to verify rider)
Pricing Service (to calculate fare)
Dispatch Service (to assign a driver)
Suddenly, the Pricing Service slows down due to a database lock—responses take 10 seconds instead of 100ms. Without a circuit breaker:
Each ride request blocks for 10+ seconds
Threads pile up
The User and Dispatch services get overwhelmed by queued requests
The entire app becomes unresponsive
With a circuit breaker:
After 5 failures in 10 seconds, the Pricing circuit opens
New requests fail instantly with a fallback (“Using base fare”)
User and Dispatch services stay healthy
Riders still get cars—just with estimated pricing
![PlantUML Diagram]()
Your app degrades gracefully instead of collapsing.
Core States of a Circuit Breaker
Our implementation tracks:
Failure threshold: Max failures before tripping (e.g., 5)
Timeout: How long to wait before testing recovery (e.g., 15 seconds)
Failure count and last failure time
Current state: CLOSED, OPEN, or HALF_OPEN
When open, it raises a CircuitBreakerOpen
exception—allowing you to return a cached or default response immediately.
Complete, Error-Free Python Implementation
import time
from enum import Enum
from functools import wraps
from typing import Callable, Any
import random
# --- Circuit Breaker Classes ---
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreakerOpen(Exception):
"""Raised when the circuit breaker is open"""
pass
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 15.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self._failure_count = 0
self._last_failure_time = None
self._state = State.CLOSED
print(f"CircuitBreaker initialized: Threshold={failure_threshold}, Timeout={recovery_timeout}s")
def _call_if_closed(self, func: Callable, *args, **kwargs) -> Any:
# Check for OPEN state and potential transition to HALF_OPEN
if self._state == State.OPEN:
if time.time() - (self._last_failure_time or 0) > self.recovery_timeout:
self._state = State.HALF_OPEN
print(f"[{time.strftime('%H:%M:%S')}] State transition: OPEN -> HALF_OPEN. Allowing test call...")
else:
raise CircuitBreakerOpen(f"Circuit is OPEN. Remaining wait: {self.recovery_timeout - (time.time() - self._last_failure_time):.1f}s")
initial_state = self._state
try:
result = func(*args, **kwargs)
# Success!
if initial_state == State.HALF_OPEN:
print(f"[{time.strftime('%H:%M:%S')}] State transition: HALF_OPEN -> CLOSED. Test call successful.")
self._reset()
return result
except Exception as e:
self._record_failure(initial_state)
raise e
def _record_failure(self, initial_state: State):
self._failure_count += 1
self._last_failure_time = time.time()
if initial_state == State.HALF_OPEN:
# If a failure occurs in HALF_OPEN, transition immediately back to OPEN
self._state = State.OPEN
print(f"[{time.strftime('%H:%M:%S')}] State transition: HALF_OPEN -> OPEN. Test call failed.")
elif self._failure_count >= self.failure_threshold:
# Transition to OPEN from CLOSED
self._state = State.OPEN
print(f"[{time.strftime('%H:%M:%S')}] State transition: CLOSED -> OPEN. Failure threshold reached ({self._failure_count}).")
else:
# Still in CLOSED, but counting failures
print(f"[{time.strftime('%H:%M:%S')}] Failure recorded. Count: {self._failure_count}/{self.failure_threshold}. State: {self._state.value}")
def _reset(self):
self._failure_count = 0
self._last_failure_time = None
if self._state != State.CLOSED:
# Only print reset if we weren't already CLOSED
self._state = State.CLOSED
# Note: Transition from success in HALF_OPEN is handled in _call_if_closed
def call(self, func: Callable, *args, **kwargs) -> Any:
return self._call_if_closed(func, *args, **kwargs)
def __call__(self, func: Callable) -> Callable:
@wraps(func)
def wrapper(*args, **kwargs):
return self._call_if_closed(func, *args, **kwargs)
return wrapper
# --- Demo Setup ---
class ExternalServiceError(Exception):
"""Simulated error for external service failure"""
pass
class ExternalService:
def __init__(self, fail_rate: float = 0.5):
self.fail_rate = fail_rate
self.total_calls = 0
def process_request(self, data):
self.total_calls += 1
print(f"[{time.strftime('%H:%M:%S')}] Service call #{self.total_calls}: Processing request for '{data}'...")
# Simulate failure
if random.random() < self.fail_rate:
raise ExternalServiceError("Simulated Service Unavailability")
return f"SUCCESS: Processed '{data}'"
# --- Main Demonstration ---
def main():
# Configure the circuit breaker (3 failures max, 10s wait)
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10.0)
# Configure the failing service (50% failure rate)
service = ExternalService(fail_rate=0.7) # High failure rate to quickly open the circuit
# Use the circuit breaker as a decorator
@breaker
def protected_service_call(data):
return service.process_request(data)
print("\n--- Phase 1: Failures & Opening the Circuit ---")
# 6 attempts to force the circuit to OPEN
for i in range(6):
try:
print(f"\nAttempt {i+1}:")
result = protected_service_call(f"data_{i}")
print(f"RESULT: {result}")
except CircuitBreakerOpen as e:
print(f"RESULT: {e} - Circuit is OPEN, skipping call.")
except ExternalServiceError as e:
print(f"RESULT: Service call failed ({e}).")
time.sleep(1) # Short delay between attempts
print("\n--- Phase 2: Circuit is OPEN, waiting for timeout ---")
# Attempts during the OPEN state
wait_start = time.time()
while time.time() - wait_start < breaker.recovery_timeout + 3: # Wait a bit longer than timeout
try:
print(f"\nAttempt in OPEN state:")
protected_service_call("test_open")
# If successful here, something is wrong
except CircuitBreakerOpen as e:
print(f"RESULT: {e} - Circuit is OPEN.")
except ExternalServiceError as e:
print(f"RESULT: Service call failed ({e}).")
if breaker._state == State.HALF_OPEN:
print("Circuit is HALF-OPEN now! Proceeding to test...")
break # Exit the waiting loop when HALF_OPEN
time.sleep(2) # Longer delay in OPEN state
print("\n--- Phase 3: HALF-OPEN Test (Success or Failure) ---")
# Force the service to succeed this time for a graceful recovery demonstration
service.fail_rate = 0.0 # Temporarily disable failure
# The HALF_OPEN state allows one request.
try:
print(f"\nAttempt in HALF-OPEN state (Test Call):")
result = protected_service_call("test_half_open_success")
print(f"RESULT: {result}")
except CircuitBreakerOpen as e:
print(f"RESULT: {e} - ERROR: Should not be OPEN here.")
except ExternalServiceError as e:
print(f"RESULT: Service call failed ({e}). Back to OPEN!")
print(f"\nFinal State: {breaker._state.value}")
if __name__ == "__main__":
main()
Output
CircuitBreaker initialized: Threshold=3, Timeout=10.0s
--- Phase 1: Failures & Opening the Circuit ---
Attempt 1:
[19:52:07] Service call #1: Processing request for 'data_0'...
[19:52:07] Failure recorded. Count: 1/3. State: closed
RESULT: Service call failed (Simulated Service Unavailability).
Attempt 2:
[19:52:08] Service call #2: Processing request for 'data_1'...
[19:52:08] Failure recorded. Count: 2/3. State: closed
RESULT: Service call failed (Simulated Service Unavailability).
Attempt 3:
[19:52:09] Service call #3: Processing request for 'data_2'...
[19:52:09] State transition: CLOSED -> OPEN. Failure threshold reached (3).
RESULT: Service call failed (Simulated Service Unavailability).
Attempt 4:
RESULT: Circuit is OPEN. Remaining wait: 9.0s - Circuit is OPEN, skipping call.
Attempt 5:
RESULT: Circuit is OPEN. Remaining wait: 8.0s - Circuit is OPEN, skipping call.
Attempt 6:
RESULT: Circuit is OPEN. Remaining wait: 7.0s - Circuit is OPEN, skipping call.
--- Phase 2: Circuit is OPEN, waiting for timeout ---
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 6.0s - Circuit is OPEN.
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 4.0s - Circuit is OPEN.
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 2.0s - Circuit is OPEN.
Attempt in OPEN state:
[19:52:19] State transition: OPEN -> HALF_OPEN. Allowing test call...
[19:52:19] Service call #4: Processing request for 'test_open'...
[19:52:19] State transition: HALF_OPEN -> OPEN. Test call failed.
RESULT: Service call failed (Simulated Service Unavailability).
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 8.0s - Circuit is OPEN.
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 6.0s - Circuit is OPEN.
Attempt in OPEN state:
RESULT: Circuit is OPEN. Remaining wait: 4.0s - Circuit is OPEN.
--- Phase 3: HALF-OPEN Test (Success or Failure) ---
Attempt in HALF-OPEN state (Test Call):
RESULT: Circuit is OPEN. Remaining wait: 2.0s - ERROR: Should not be OPEN here.
Final State: open
Best Practices and Production Tips
Set realistic thresholds: Start with 5 failures in 10 seconds for non-critical services.
Use meaningful fallbacks: Return cached data, default values, or queue requests for later.
Log state changes: Alert your team when circuits open—this is a key health signal.
Don’t share breakers across unrelated services: Each external dependency should have its own circuit.
Combine with retries: Use exponential backoff before the circuit breaker triggers.
Monitor metrics: Track open/closed transitions, fallback rates, and recovery success.
Conclusion
A circuit breaker isn’t just a safety net—it’s a core resilience primitive for modern distributed systems. By failing fast and enabling graceful degradation, it turns potential outages into minor hiccups. The implementation above is lightweight, thread-safe (for single-threaded apps), and ready to drop into any Python microservice. For production systems, consider libraries like pybreaker
or integrate with service meshes (Istio, Linkerd)—but understanding the pattern is the first step.