# AI-Powered Chaos Engineering: Automating Resilience on AWS
> Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.

Tags: chaos-engineering, aws-fis, generative-ai, amazon-bedrock, resilience-testing, devops, disaster-recovery, automation
## AI-Powered Resilience Testing & DR Validation
* Automating chaos engineering at scale using GenAI + AWS Fault Injection Service (FIS).
* Quotation from Dr. Werner Vogels: "Everything fails all the time."

## The Problem: Manual Resilience Testing
* Traditional testing is slow, relies on tribal knowledge, and has infrequent cycles.
* Challenges include weeks of design for a single experiment and unknown single points of failure.

## The Solution: Agentic AI + AWS FIS
* Use AI agents to scan environments via AWS Systems Manager Inventory.
* Amazon Bedrock-powered agents generate failure hypotheses and SSM Automation documents.
* Results in a ~90% reduction in experiment setup time.

## Technology Stack
* **AWS Fault Injection Service (FIS):** Orchestrates actions like CPU/memory stress and network latency.
* **AWS Systems Manager:** Discovers configuration and implements app-level impairments.
* **Amazon Bedrock:** Powers agents for hypothesis generation and document creation.
* **AWS Strands Framework:** SDK for multi-agent system orchestration.

## Multi-Agent System Architecture
1. **Hypothesis Generator:** Analyzes inventory.
2. **Prioritization Agent:** Ranks by risk/impact.
3. **Experiment Designer:** Generates PS1/SSM docs.
4. **Experiment Executor:** Triggers FIS templates.
5. **Monitor & Iterate:** Feeds the learning loop.

## Demo Walkthrough & Safety
* Process flows from discovery to execution in minutes.
* **Guardrails:** Agents must never impair management connectivity or run tests without precondition checks (e.g., verifying instance is online).

## Business Impact
* Reduction of setup time from days to hours.
* Lowered MTTR through automated DR playbooks.
* Conversion of past Root Cause Analyses (RCAs) into repeatable automated tests.

## Best Practices & Getting Started
* Make SSM documents modular and idempotent.
* Define strict isolation boundaries to limit blast radius.
* Key Resources: AWS Resilience Lifecycle Framework (Oct 2023) and AWS Resilience Hub.
---
This presentation was created with [Bobr AI](https://bobr.ai) — an AI presentation generator.