Objectives of Chaos Engineering
- Validate system behavior under stress
- Identify single points of failure
- Improve fault tolerance and recovery mechanisms
- Build confidence in production systems
Core Principles of Chaos Engineering
- Start Small: Begin with low-risk experiments in staging environments.
- Define a Steady State: Know what 'normal' looks like (e.g., response time, throughput).
- Introduce Realistic Failures: Simulate outages, latency, or resource limits.
- Observe and Measure: Monitor system metrics and logs during the experiment.
- Automate and Repeat: Integrate chaos tests into CI/CD pipelines.
Popular Tools for Chaos Engineering
- Gremlin: Enterprise-grade chaos engineering platform with UI and APIs.
- Chaos Monkey: Netflix’s open-source tool that randomly terminates instances.
- LitmusChaos: Kubernetes-native chaos engineering framework.
- Steadybit: SaaS platform for automated chaos experiments.
- PowerfulSeal: Targets Kubernetes clusters for chaos testing.
Common Chaos Engineering Experiments
- Instance Termination: Randomly terminate EC2 instances or Kubernetes pods.
- Network Latency Injection: Introduce latency between services.
- Dependency Failure: Block access to third-party APIs or internal services.
- CPU/Memory Stress: Consume high CPU or memory resources.
- Disk Failure Simulation: Simulate disk full or I/O errors.
- DNS Failure: Block DNS resolution for specific services.
Best Practices for Chaos Engineering
- Always run chaos experiments in a controlled environment first.
- Use feature flags to toggle chaos experiments.
- Ensure observability with tools like Prometheus, Grafana, or Datadog.
- Collaborate across Dev, QA, and SRE teams.
- Document learnings and update incident response playbooks.
Real-World Use Cases
- Netflix uses Chaos Monkey to ensure its streaming service remains available during instance failures.
- Amazon tests its microservices architecture for resilience under high load and service disruptions.
- LinkedIn and Google use chaos engineering to validate failover strategies and improve system robustness.
Chaos Engineering Code Examples
1. Chaos Monkey (Java-based microservices)
Chaos Monkey randomly terminates instances in production-like environments.
Code Example
// Chaos Monkey for Spring Boot
// Add dependency in build.gradle
implementation 'de.codecentric:chaos-monkey-spring-boot:2.5.0'
// Enable in application.properties
chaos.monkey.enabled=true
chaos.monkey.assaults.level=5
chaos.monkey.assaults.latency-active=true
chaos.monkey.assaults.latency-range-start=1000
chaos.monkey.assaults.latency-range-end=3000
Expected Outcome: Simulates latency and instance failures to test service resilience.
2. Gremlin (CLI-based attack)
Gremlin allows you to run chaos experiments via CLI or API.
Code Example
# Simulate CPU stress on a host
gremlin attack --target "host" --type "cpu" --length 60 --cores 2
Expected Outcome: Tests how the system handles high CPU usage and whether it recovers gracefully.
3. Chaos Toolkit (Python/JSON)
Chaos Toolkit is an open-source framework for writing chaos experiments in JSON/YAML.
Code Example
{
 "version": "1.0.0",
 "title": "HTTP Latency Injection",
 "description": "Inject latency into HTTP service",
 "steady-state-hypothesis": {
   "title": "Service is healthy",
   "probes": [{
     "type": "probe",
     "name": "check-service",
     "tolerance": true,
     "provider": {
       "type": "http",
       "url": "http://my-service/health",
       "method": "GET"
     }
   }]
 },
 "method": [{
   "type": "action",
   "name": "inject-latency",
   "provider": {
     "type": "process",
     "path": "toxiproxy-cli",
     "arguments": ["toxics", "add", "my-proxy", "--type", "latency", "--attribute", "latency=1000"]
   }
 }]
}
Expected Outcome: Injects latency into a service and verifies if it remains healthy under delay.
4. LitmusChaos (Kubernetes-native)
LitmusChaos uses CRDs to define chaos experiments for Kubernetes environments.
Code Example
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
 name: pod-delete-engine
spec:
 appinfo:
   appns: default
   applabel: "app=nginx"
   appkind: deployment
 chaosServiceAccount: litmus-admin
 experiments:
   - name: pod-delete
     spec:
       components:
         env:
           - name: TOTAL_CHAOS_DURATION
             value: "30"
Expected Outcome: Deletes pods in a Kubernetes deployment to test auto-recovery and resilience.