AI-Powered Chaos Engineering: Automating Resilience on AWS
Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.
INNOVATION DECK
AI-Powered Resilience Testing & DR Validation
Using Generative AI + AWS Fault Injection Service to automate chaos engineering at scale
"Everything fails all the time." โ Dr. Werner Vogels, Amazon CTO
THE PROBLEM
Resilience Testing Is Too Slow & Manual
Traditional resilience testing and DR validation relies on manual processes, tribal knowledge, and infrequent testing cycles โ leaving teams blind to failure scenarios until production incidents occur.
๐
Weeks to design a single chaos experiment
Engineers spend most time writing boilerplate, not validating
๐
Unknown dependencies & single points of failure
Inventory is scattered, assumptions are unverified
๐
RCAs gather dust
Past incidents are not converted into repeatable automated tests
THE SOLUTION
Generative AI + AWS FIS: Automated Chaos at Scale
An agentic AI system that discovers, designs, and executes resilience experiments automatically
AI agents scan your environment via AWS Systems Manager Inventory, mapping services, dependencies, and configuration automatically
Bedrock-powered agents generate failure hypotheses and produce validated SSM Automation documents โ no manual scripting
AWS FIS orchestrates safe, controlled chaos experiments with rollback, preconditions, and isolation boundaries built in
~90% reduction in experiment setup time
TECHNOLOGY STACK
Key AWS Services & Components
AWS Fault Injection Service (FIS)
Orchestrates resilience experiments with native chaos actions: CPU/memory stress, network latency, power interruption. Provides reusable experiment templates and safe execution guardrails.
AWS Systems Manager
Inventory module discovers installed software, services, network config, DB connections, and online EC2 instances. SSM Automation Documents implement custom OS/app-level impairments.
Amazon Bedrock (Generative AI)
Powers intelligent agents for inventory analysis, failure hypothesis generation, and automated SSM document creation. Reduces boilerplate from days to minutes.
AWS Strands Framework
Agent SDK for building and running multi-agent systems. Handles model selection, AWS API tool exposure, prompts, and callbacks โ minimal code required.
ARCHITECTURE
Multi-Agent Chaos Engineering System
Specialized AI agents collaborate across the full resilience testing lifecycle
1
๐
Hypothesis Generator
Analyzes inventory, generates failure scenarios
2
๐
Prioritization Agent
Ranks experiments by risk & impact
3
๐ ๏ธ
Experiment Designer
Generates SSM Automation docs (PowerShell)
4
โถ๏ธ
Experiment Executor
Triggers FIS templates via API/CLI
5
๐
Monitor & Iterate
Observes results, feeds learning loop
Powered by Amazon Bedrock + AWS Strands SDK
Integrated with CloudWatch + Incident Tooling
DEMO WALKTHROUGH
From Inventory to Experiment in Minutes
1
DISCOVER
Inventory agent connects to online EC2 instance. Discovers IIS, ODBC drivers, DB dependencies, autoscaling config, and health-check endpoints via SSM Inventory.
2
HYPOTHESIZE
Agent generates failure hypotheses: block DB port via security group, impair IIS app pool, inject network latency โ ranked by impact.
3
GENERATE
Document generator agent produces SSM Automation document with preconditions, safety checks, PowerShell impairment steps, and cleanup/rollback.
4
VALIDATE
Team reviews generated documents. Validates in production-like (lower-risk) environment before full experiment.
5
EXECUTE
FIS experiment template ties native FIS actions + SSM document. Runs via console, API, or CLI. Results feed monitoring.
SAFETY FIRST
Agent Guardrails & Experiment Safety
AI agents operate like detailed job descriptions โ with explicit permissions, constraints, and rollback built in
AGENTS ARE ALLOWED TO
Discover inventory and dependencies
Generate SSM documents for approved targets
Execute experiments within isolation boundaries
Roll back and restore state after each test
AGENTS MUST NEVER
Impair Systems Manager agent (management connectivity)
Modify critical production infra beyond permitted scope
Run experiments without passing precondition checks
Skip cleanup or state restoration steps
PRECONDITION CHECKS BEFORE EVERY EXPERIMENT
โ Instance is online
โ Target service is running
โ Sufficient disk space
โ Required modules installed
FRAMEWORK
The Resilience Lifecycle
AI accelerates the design, testing, and learning phases โ humans stay in control
SET OBJECTIVES
Define resilience goals, SLOs, and acceptable failure boundaries
DESIGN & IMPLEMENT
Architect for fault tolerance; AI assists with dependency mapping
EVALUATE & TEST
Run chaos experiments via FIS; AI generates scenarios automatically
OPERATE
Monitor live systems; AI-powered DevOps agents reduce MTTR
LEARN & RESPOND
Convert incidents/RCAs into automated tests; close the feedback loop
Resilience Lifecycle Framework โ released Oct 2023
BUSINESS IMPACT
Faster, Smarter, Safer Resilience
~90%
Reduction in experiment setup time
Days โ Hours
From hypothesis to running experiment
โ MTTR
Faster incident response via AI-assisted monitoring and automated DR playbooks
Unknown Dependencies Surfaced
Inventory agents reveal hidden single points of failure before they cause outages
RCAs Become Tests
Past incidents are automatically converted into reproducible chaos experiments that validate mitigations
Human-AI Collaboration
AI drafts and accelerates; engineers validate and refine โ shifting focus from writing plumbing to delivering resilience
BEST PRACTICES
Building Safe & Effective Experiments
SSM AUTOMATION DOCUMENTS
Make documents modular and idempotent โ safe to run multiple times without side effects
Always include precondition validation before impairment steps
Build restoration/rollback flows into every document
Use native FIS actions where available; author SSM docs only for OS/app-level gaps
CHAOS EXPERIMENTS
Validate automation in lower-risk, production-like environments first
Keep experiments focused and controlled โ purpose-driven chaos, not random destruction
Define isolation boundaries and strictly limit blast radius
Log everything: preconditions, execution steps, cleanup, and outcomes
"Practice chaos engineering with purpose rather than random destruction."
GET STARTED
Next Steps & Resources
YOUR NEXT STEPS
Review the AWS Resilience Lifecycle Framework โ your north star for resilient application design (Oct 2023)
Explore FIS native actions library and best-practices blog โ use as context for your agent prompts
Pilot the multi-agent chaos engineering code โ validate and adapt for your environment
Start with one production-like workload โ run inventory โ hypothesize โ test cycle
KEY RESOURCES
AWS Resilience Lifecycle Framework
FIS Actions & Template Library
SSM Inventory & Automation Docs Guide
AWS Strands Agent Framework (SDK)
AWS Resiliency Analyst & Fault Isolation Boundaries
Test at your own risk โ validate and adapt before running in production
Build Resilience.
Before It Builds You.
Generative AI + AWS FIS gives your team the power to discover, design, and validate resilience experiments at unprecedented speed โ so you're ready when it matters most.
Questions & Discussion
๐ aws.amazon.com/fis
๐ AWS Resilience Hub
- chaos-engineering
- aws-fis
- generative-ai
- amazon-bedrock
- resilience-testing
- devops
- disaster-recovery
- automation