Software Testing  

Breaking Things on Purpose: Code-Driven Chaos Engineering

Objectives of Chaos Engineering

  • Validate system behavior under stress
  • Identify single points of failure
  • Improve fault tolerance and recovery mechanisms
  • Build confidence in production systems

Core Principles of Chaos Engineering

  1. Start Small: Begin with low-risk experiments in staging environments.
  2. Define a Steady State: Know what 'normal' looks like (e.g., response time, throughput).
  3. Introduce Realistic Failures: Simulate outages, latency, or resource limits.
  4. Observe and Measure: Monitor system metrics and logs during the experiment.
  5. Automate and Repeat: Integrate chaos tests into CI/CD pipelines.

Popular Tools for Chaos Engineering

  • Gremlin: Enterprise-grade chaos engineering platform with UI and APIs.
  • Chaos Monkey: Netflix’s open-source tool that randomly terminates instances.
  • LitmusChaos: Kubernetes-native chaos engineering framework.
  • Steadybit: SaaS platform for automated chaos experiments.
  • PowerfulSeal: Targets Kubernetes clusters for chaos testing.

Common Chaos Engineering Experiments

  1. Instance Termination: Randomly terminate EC2 instances or Kubernetes pods.
  2. Network Latency Injection: Introduce latency between services.
  3. Dependency Failure: Block access to third-party APIs or internal services.
  4. CPU/Memory Stress: Consume high CPU or memory resources.
  5. Disk Failure Simulation: Simulate disk full or I/O errors.
  6. DNS Failure: Block DNS resolution for specific services.

Best Practices for Chaos Engineering

  • Always run chaos experiments in a controlled environment first.
  • Use feature flags to toggle chaos experiments.
  • Ensure observability with tools like Prometheus, Grafana, or Datadog.
  • Collaborate across Dev, QA, and SRE teams.
  • Document learnings and update incident response playbooks.

Real-World Use Cases

  • Netflix uses Chaos Monkey to ensure its streaming service remains available during instance failures.
  • Amazon tests its microservices architecture for resilience under high load and service disruptions.
  • LinkedIn and Google use chaos engineering to validate failover strategies and improve system robustness.

Chaos Engineering Code Examples

1. Chaos Monkey (Java-based microservices)

Chaos Monkey randomly terminates instances in production-like environments.

Code Example

// Chaos Monkey for Spring Boot
// Add dependency in build.gradle
implementation 'de.codecentric:chaos-monkey-spring-boot:2.5.0'

// Enable in application.properties
chaos.monkey.enabled=true
chaos.monkey.assaults.level=5
chaos.monkey.assaults.latency-active=true
chaos.monkey.assaults.latency-range-start=1000
chaos.monkey.assaults.latency-range-end=3000

Expected Outcome: Simulates latency and instance failures to test service resilience.

2. Gremlin (CLI-based attack)

Gremlin allows you to run chaos experiments via CLI or API.

Code Example

# Simulate CPU stress on a host
gremlin attack --target "host" --type "cpu" --length 60 --cores 2

Expected Outcome: Tests how the system handles high CPU usage and whether it recovers gracefully.

3. Chaos Toolkit (Python/JSON)

Chaos Toolkit is an open-source framework for writing chaos experiments in JSON/YAML.

Code Example

{
  "version": "1.0.0",
  "title": "HTTP Latency Injection",
  "description": "Inject latency into HTTP service",
  "steady-state-hypothesis": {
    "title": "Service is healthy",
    "probes": [{
      "type": "probe",
      "name": "check-service",
      "tolerance": true,
      "provider": {
        "type": "http",
        "url": "http://my-service/health",
        "method": "GET"
      }
    }]
  },
  "method": [{
    "type": "action",
    "name": "inject-latency",
    "provider": {
      "type": "process",
      "path": "toxiproxy-cli",
      "arguments": ["toxics", "add", "my-proxy", "--type", "latency", "--attribute", "latency=1000"]
    }
  }]
}

Expected Outcome: Injects latency into a service and verifies if it remains healthy under delay.

4. LitmusChaos (Kubernetes-native)

LitmusChaos uses CRDs to define chaos experiments for Kubernetes environments.

Code Example

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Expected Outcome: Deletes pods in a Kubernetes deployment to test auto-recovery and resilience.