AI-Powered Chaos Engineering: Automating Resilience on AWS

Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.

#chaos-engineering#aws-fis#generative-ai#amazon-bedrock#resilience-testing#devops#disaster-recovery#automation

Watch
Pitch

01

INNOVATION DECK

AI-Powered Resilience Testing & DR Validation

Using Generative AI + AWS Fault Injection Service to automate chaos engineering at scale

"Everything fails all the time." — Dr. Werner Vogels, Amazon CTO

Made by

02

THE PROBLEM

Resilience Testing Is Too Slow & Manual

Traditional resilience testing and DR validation relies on manual processes, tribal knowledge, and infrequent testing cycles — leaving teams blind to failure scenarios until production incidents occur.

🕐

Weeks to design a single chaos experiment

Engineers spend most time writing boilerplate, not validating

🔍

Unknown dependencies & single points of failure

Inventory is scattered, assumptions are unverified

📋

RCAs gather dust

Past incidents are not converted into repeatable automated tests

Made by

03

THE SOLUTION

Generative AI + AWS FIS: Automated Chaos at Scale

An agentic AI system that discovers, designs, and executes resilience experiments automatically

⚡

DISCOVER

AI agents scan your environment via AWS Systems Manager Inventory, mapping services, dependencies, and configuration automatically

🧠

DESIGN

Bedrock-powered agents generate failure hypotheses and produce validated SSM Automation documents — no manual scripting

🚀

EXECUTE

AWS FIS orchestrates safe, controlled chaos experiments with rollback, preconditions, and isolation boundaries built in

~90% reduction in experiment setup time

Made by

04

TECHNOLOGY STACK

Key AWS Services & Components

AWS Fault Injection Service (FIS)

Orchestrates resilience experiments with native chaos actions: CPU/memory stress, network latency, power interruption. Provides reusable experiment templates and safe execution guardrails.

AWS Systems Manager

Inventory module discovers installed software, services, network config, DB connections, and online EC2 instances. SSM Automation Documents implement custom OS/app-level impairments.

Amazon Bedrock (Generative AI)

Powers intelligent agents for inventory analysis, failure hypothesis generation, and automated SSM document creation. Reduces boilerplate from days to minutes.

AWS Strands Framework

Agent SDK for building and running multi-agent systems. Handles model selection, AWS API tool exposure, prompts, and callbacks — minimal code required.

Made by

05

ARCHITECTURE

Multi-Agent Chaos Engineering System

Specialized AI agents collaborate across the full resilience testing lifecycle

1

🔍

Hypothesis Generator

Analyzes inventory, generates failure scenarios

2

📊

Prioritization Agent

Ranks experiments by risk & impact

3

🛠️

Experiment Designer

Generates SSM Automation docs (PowerShell)

4

▶️

Experiment Executor

Triggers FIS templates via API/CLI

5

📈

Monitor & Iterate

Observes results, feeds learning loop

Powered by Amazon Bedrock + AWS Strands SDK

Integrated with CloudWatch + Incident Tooling

Made by

06

DEMO WALKTHROUGH

From Inventory to Experiment in Minutes

1

DISCOVER Inventory agent connects to online EC2 instance. Discovers IIS, ODBC drivers, DB dependencies, autoscaling config, and health-check endpoints via SSM Inventory.

2

HYPOTHESIZE Agent generates failure hypotheses: block DB port via security group, impair IIS app pool, inject network latency — ranked by impact.

3

GENERATE Document generator agent produces SSM Automation document with preconditions, safety checks, PowerShell impairment steps, and cleanup/rollback.

4

VALIDATE Team reviews generated documents. Validates in production-like (lower-risk) environment before full experiment.

5

EXECUTE FIS experiment template ties native FIS actions + SSM document. Runs via console, API, or CLI. Results feed monitoring.

document-agent.yaml

description: "Impair IIS / DB Latency"
schemaVersion: "0.3"
mainSteps:
- name: PreconditionsCheck
action: aws:runCommand
inputs:
DocumentName: AWS-RunPowerShellScript
Script: |-
# Check if IIS is running
$status = (Get-WebAppPoolState "DefaultAppPool").Value
if ($status -ne "Started") {
Write-Error "Precondition failed"
exit 1
}
Write-Host "Precondition met."
- name: CleanupAndRollback
action: aws:runCommand
inputs:
Script: |-
# Restart App Pool
Start-WebAppPool -Name "DefaultAppPool"

Made by

07

SAFETY FIRST

Agent Guardrails & Experiment Safety

AI agents operate like detailed job descriptions — with explicit permissions, constraints, and rollback built in

✅ AGENTS ARE ALLOWED TO

Discover inventory and dependencies

Generate SSM documents for approved targets

Execute experiments within isolation boundaries

Roll back and restore state after each test

🚫 AGENTS MUST NEVER

Impair Systems Manager agent (management connectivity)

Modify critical production infra beyond permitted scope

Run experiments without passing precondition checks

Skip cleanup or state restoration steps

PRECONDITION CHECKS BEFORE EVERY EXPERIMENT

✓ Instance is online

✓ Target service is running

✓ Sufficient disk space

✓ Required modules installed

Made by

08

FRAMEWORK

The Resilience Lifecycle

AI accelerates the design, testing, and learning phases — humans stay in control

🎯

SET OBJECTIVES

Define resilience goals, SLOs, and acceptable failure boundaries

🏗️

AI-Assisted

DESIGN & IMPLEMENT

Architect for fault tolerance; AI assists with dependency mapping

🧪

AI-Assisted

EVALUATE & TEST

Run chaos experiments via FIS; AI generates scenarios automatically

⚙️

AI-Assisted

OPERATE

Monitor live systems; AI-powered DevOps agents reduce MTTR

📚

AI-Assisted

LEARN & RESPOND

Convert incidents/RCAs into automated tests; close the feedback loop

Resilience Lifecycle Framework — released Oct 2023

Made by

09

BUSINESS IMPACT

Faster, Smarter, Safer Resilience

~90%

Reduction in experiment setup time

Days → Hours

From hypothesis to running experiment

↓ MTTR

Faster incident response via AI-assisted monitoring and automated DR playbooks

Unknown Dependencies Surfaced

Inventory agents reveal hidden single points of failure before they cause outages

RCAs Become Tests

Past incidents are automatically converted into reproducible chaos experiments that validate mitigations

Human-AI Collaboration

AI drafts and accelerates; engineers validate and refine — shifting focus from writing plumbing to delivering resilience

Made by

10

BEST PRACTICES

Building Safe & Effective Experiments

SSM AUTOMATION DOCUMENTS

Make documents modular and idempotent — safe to run multiple times without side effects

Always include precondition validation before impairment steps

Build restoration/rollback flows into every document

Use native FIS actions where available; author SSM docs only for OS/app-level gaps

CHAOS EXPERIMENTS

Validate automation in lower-risk, production-like environments first

Keep experiments focused and controlled — purpose-driven chaos, not random destruction

Define isolation boundaries and strictly limit blast radius

Log everything: preconditions, execution steps, cleanup, and outcomes

"Practice chaos engineering with purpose rather than random destruction."

Made by

11

GET STARTED

Next Steps & Resources

YOUR NEXT STEPS

1

Review the AWS Resilience Lifecycle Framework — your north star for resilient application design (Oct 2023)

2

Explore FIS native actions library and best-practices blog — use as context for your agent prompts

3

Pilot the multi-agent chaos engineering code — validate and adapt for your environment

4

Start with one production-like workload — run inventory → hypothesize → test cycle

KEY RESOURCES

📖

AWS Resilience Lifecycle Framework

⚡

FIS Actions & Template Library

🔧

SSM Inventory & Automation Docs Guide

🤖

AWS Strands Agent Framework (SDK)

🛡️

AWS Resiliency Analyst & Fault Isolation Boundaries

Test at your own risk — validate and adapt before running in production

Made by

12

Build Resilience.

Before It Builds You.

Generative AI + AWS FIS gives your team the power to discover, design, and validate resilience experiments at unprecedented speed — so you're ready when it matters most.

Questions & Discussion

🌐 aws.amazon.com/fis

📚 AWS Resilience Hub

Made by

DESIGNER-MADE
PRESENTATION,
GENERATED FROM
YOUR PROMPT

Create your own professional slide deck with real images, data charts, and unique design in under a minute.

Generate For Free

AI-Powered Chaos Engineering: Automating Resilience on AWS

Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.

INNOVATION DECK

AI-Powered Resilience Testing & DR Validation

Using Generative AI + AWS Fault Injection Service to automate chaos engineering at scale

"Everything fails all the time." — Dr. Werner Vogels, Amazon CTO

THE PROBLEM

Resilience Testing Is Too Slow & Manual

Traditional resilience testing and DR validation relies on manual processes, tribal knowledge, and infrequent testing cycles — leaving teams blind to failure scenarios until production incidents occur.

🕐

Weeks to design a single chaos experiment

Engineers spend most time writing boilerplate, not validating

🔍

Unknown dependencies & single points of failure

Inventory is scattered, assumptions are unverified

📋

RCAs gather dust

Past incidents are not converted into repeatable automated tests

THE SOLUTION

Generative AI + AWS FIS: Automated Chaos at Scale

An agentic AI system that discovers, designs, and executes resilience experiments automatically

AI agents scan your environment via AWS Systems Manager Inventory, mapping services, dependencies, and configuration automatically

Bedrock-powered agents generate failure hypotheses and produce validated SSM Automation documents — no manual scripting

AWS FIS orchestrates safe, controlled chaos experiments with rollback, preconditions, and isolation boundaries built in

~90% reduction in experiment setup time

TECHNOLOGY STACK

Key AWS Services & Components

AWS Fault Injection Service (FIS)

Orchestrates resilience experiments with native chaos actions: CPU/memory stress, network latency, power interruption. Provides reusable experiment templates and safe execution guardrails.

AWS Systems Manager

Inventory module discovers installed software, services, network config, DB connections, and online EC2 instances. SSM Automation Documents implement custom OS/app-level impairments.

Amazon Bedrock (Generative AI)

Powers intelligent agents for inventory analysis, failure hypothesis generation, and automated SSM document creation. Reduces boilerplate from days to minutes.

AWS Strands Framework

Agent SDK for building and running multi-agent systems. Handles model selection, AWS API tool exposure, prompts, and callbacks — minimal code required.

ARCHITECTURE

Multi-Agent Chaos Engineering System

Specialized AI agents collaborate across the full resilience testing lifecycle

1

🔍

Hypothesis Generator

Analyzes inventory, generates failure scenarios

2

📊

Prioritization Agent

Ranks experiments by risk & impact

3

🛠️

Experiment Designer

Generates SSM Automation docs (PowerShell)

4

▶️

Experiment Executor

Triggers FIS templates via API/CLI

5

📈

Monitor & Iterate

Observes results, feeds learning loop

Powered by Amazon Bedrock + AWS Strands SDK

Integrated with CloudWatch + Incident Tooling

DEMO WALKTHROUGH

From Inventory to Experiment in Minutes

1

DISCOVER

Inventory agent connects to online EC2 instance. Discovers IIS, ODBC drivers, DB dependencies, autoscaling config, and health-check endpoints via SSM Inventory.

2

HYPOTHESIZE

Agent generates failure hypotheses: block DB port via security group, impair IIS app pool, inject network latency — ranked by impact.

3

GENERATE

Document generator agent produces SSM Automation document with preconditions, safety checks, PowerShell impairment steps, and cleanup/rollback.

4

VALIDATE

Team reviews generated documents. Validates in production-like (lower-risk) environment before full experiment.

5

EXECUTE

FIS experiment template ties native FIS actions + SSM document. Runs via console, API, or CLI. Results feed monitoring.

SAFETY FIRST

Agent Guardrails & Experiment Safety

AI agents operate like detailed job descriptions — with explicit permissions, constraints, and rollback built in

AGENTS ARE ALLOWED TO

Discover inventory and dependencies

Generate SSM documents for approved targets

Execute experiments within isolation boundaries

Roll back and restore state after each test

AGENTS MUST NEVER

Impair Systems Manager agent (management connectivity)

Modify critical production infra beyond permitted scope

Run experiments without passing precondition checks

Skip cleanup or state restoration steps

PRECONDITION CHECKS BEFORE EVERY EXPERIMENT

✓ Instance is online

✓ Target service is running

✓ Sufficient disk space

✓ Required modules installed

FRAMEWORK

The Resilience Lifecycle

AI accelerates the design, testing, and learning phases — humans stay in control

SET OBJECTIVES

Define resilience goals, SLOs, and acceptable failure boundaries

DESIGN & IMPLEMENT

Architect for fault tolerance; AI assists with dependency mapping

EVALUATE & TEST

Run chaos experiments via FIS; AI generates scenarios automatically

OPERATE

Monitor live systems; AI-powered DevOps agents reduce MTTR

LEARN & RESPOND

Convert incidents/RCAs into automated tests; close the feedback loop

Resilience Lifecycle Framework — released Oct 2023

BUSINESS IMPACT

Faster, Smarter, Safer Resilience

~90%

Reduction in experiment setup time

Days → Hours

From hypothesis to running experiment

↓ MTTR

Faster incident response via AI-assisted monitoring and automated DR playbooks

Unknown Dependencies Surfaced

Inventory agents reveal hidden single points of failure before they cause outages

RCAs Become Tests

Past incidents are automatically converted into reproducible chaos experiments that validate mitigations

Human-AI Collaboration

AI drafts and accelerates; engineers validate and refine — shifting focus from writing plumbing to delivering resilience

BEST PRACTICES

Building Safe & Effective Experiments

SSM AUTOMATION DOCUMENTS

Make documents modular and idempotent — safe to run multiple times without side effects

Always include precondition validation before impairment steps

Build restoration/rollback flows into every document

Use native FIS actions where available; author SSM docs only for OS/app-level gaps

CHAOS EXPERIMENTS

Validate automation in lower-risk, production-like environments first

Keep experiments focused and controlled — purpose-driven chaos, not random destruction

Define isolation boundaries and strictly limit blast radius

Log everything: preconditions, execution steps, cleanup, and outcomes

"Practice chaos engineering with purpose rather than random destruction."

GET STARTED

Next Steps & Resources

YOUR NEXT STEPS

Review the AWS Resilience Lifecycle Framework — your north star for resilient application design (Oct 2023)

Explore FIS native actions library and best-practices blog — use as context for your agent prompts

Pilot the multi-agent chaos engineering code — validate and adapt for your environment

Start with one production-like workload — run inventory → hypothesize → test cycle

KEY RESOURCES

AWS Resilience Lifecycle Framework

FIS Actions & Template Library

SSM Inventory & Automation Docs Guide

AWS Strands Agent Framework (SDK)

AWS Resiliency Analyst & Fault Isolation Boundaries

Test at your own risk — validate and adapt before running in production

Build Resilience.

Before It Builds You.

Generative AI + AWS FIS gives your team the power to discover, design, and validate resilience experiments at unprecedented speed — so you're ready when it matters most.

Questions & Discussion

🌐 aws.amazon.com/fis

📚 AWS Resilience Hub

chaos-engineering
aws-fis
generative-ai
amazon-bedrock
resilience-testing
devops
disaster-recovery
automation