Breaking Things on Purpose: Code-Driven Chaos Engineering

Pratik Chavan
Jul 28
751
0
2

100

Article

Objectives of Chaos Engineering

Validate system behavior under stress
Identify single points of failure
Improve fault tolerance and recovery mechanisms
Build confidence in production systems

Core Principles of Chaos Engineering

Start Small: Begin with low-risk experiments in staging environments.
Define a Steady State: Know what 'normal' looks like (e.g., response time, throughput).
Introduce Realistic Failures: Simulate outages, latency, or resource limits.
Observe and Measure: Monitor system metrics and logs during the experiment.
Automate and Repeat: Integrate chaos tests into CI/CD pipelines.

Popular Tools for Chaos Engineering

Gremlin: Enterprise-grade chaos engineering platform with UI and APIs.
Chaos Monkey: Netflix’s open-source tool that randomly terminates instances.
LitmusChaos: Kubernetes-native chaos engineering framework.
Steadybit: SaaS platform for automated chaos experiments.
PowerfulSeal: Targets Kubernetes clusters for chaos testing.

Common Chaos Engineering Experiments

Instance Termination: Randomly terminate EC2 instances or Kubernetes pods.
Network Latency Injection: Introduce latency between services.
Dependency Failure: Block access to third-party APIs or internal services.
CPU/Memory Stress: Consume high CPU or memory resources.
Disk Failure Simulation: Simulate disk full or I/O errors.
DNS Failure: Block DNS resolution for specific services.

Best Practices for Chaos Engineering

Always run chaos experiments in a controlled environment first.
Use feature flags to toggle chaos experiments.
Ensure observability with tools like Prometheus, Grafana, or Datadog.
Collaborate across Dev, QA, and SRE teams.
Document learnings and update incident response playbooks.

Real-World Use Cases

Netflix uses Chaos Monkey to ensure its streaming service remains available during instance failures.
Amazon tests its microservices architecture for resilience under high load and service disruptions.
LinkedIn and Google use chaos engineering to validate failover strategies and improve system robustness.

Chaos Engineering Code Examples

1. Chaos Monkey (Java-based microservices)

Chaos Monkey randomly terminates instances in production-like environments.

Code Example

// Chaos Monkey for Spring Boot
// Add dependency in build.gradle
implementation 'de.codecentric:chaos-monkey-spring-boot:2.5.0'

// Enable in application.properties
chaos.monkey.enabled=true
chaos.monkey.assaults.level=5
chaos.monkey.assaults.latency-active=true
chaos.monkey.assaults.latency-range-start=1000
chaos.monkey.assaults.latency-range-end=3000

Expected Outcome: Simulates latency and instance failures to test service resilience.

2. Gremlin (CLI-based attack)

Gremlin allows you to run chaos experiments via CLI or API.

Code Example

# Simulate CPU stress on a host
gremlin attack --target "host" --type "cpu" --length 60 --cores 2

Expected Outcome: Tests how the system handles high CPU usage and whether it recovers gracefully.

3. Chaos Toolkit (Python/JSON)

Chaos Toolkit is an open-source framework for writing chaos experiments in JSON/YAML.

Code Example

{
  "version": "1.0.0",
  "title": "HTTP Latency Injection",
  "description": "Inject latency into HTTP service",
  "steady-state-hypothesis": {
    "title": "Service is healthy",
    "probes": [{
      "type": "probe",
      "name": "check-service",
      "tolerance": true,
      "provider": {
        "type": "http",
        "url": "http://my-service/health",
        "method": "GET"
      }
    }]
  },
  "method": [{
    "type": "action",
    "name": "inject-latency",
    "provider": {
      "type": "process",
      "path": "toxiproxy-cli",
      "arguments": ["toxics", "add", "my-proxy", "--type", "latency", "--attribute", "latency=1000"]
    }
  }]
}

Expected Outcome: Injects latency into a service and verifies if it remains healthy under delay.

4. LitmusChaos (Kubernetes-native)

LitmusChaos uses CRDs to define chaos experiments for Kubernetes environments.

Code Example

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Expected Outcome: Deletes pods in a Kubernetes deployment to test auto-recovery and resilience.