Made byBobr AI

AI-Powered Chaos Engineering: Automating Resilience on AWS

Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.

#chaos-engineering#aws-fis#generative-ai#amazon-bedrock#resilience-testing#devops#disaster-recovery#automation
Watch
Pitch
INNOVATION DECK

AI-Powered Resilience Testing & DR Validation

Using Generative AI + AWS Fault Injection Service to automate chaos engineering at scale

"Everything fails all the time." โ€” Dr. Werner Vogels, Amazon CTO

Made byBobr AI
THE PROBLEM

Resilience Testing Is Too Slow & Manual

Traditional resilience testing and DR validation relies on manual processes, tribal knowledge, and infrequent testing cycles โ€” leaving teams blind to failure scenarios until production incidents occur.

๐Ÿ•

Weeks to design a single chaos experiment

Engineers spend most time writing boilerplate, not validating

๐Ÿ”

Unknown dependencies & single points of failure

Inventory is scattered, assumptions are unverified

๐Ÿ“‹

RCAs gather dust

Past incidents are not converted into repeatable automated tests

Made byBobr AI
THE SOLUTION

Generative AI + AWS FIS: Automated Chaos at Scale

An agentic AI system that discovers, designs, and executes resilience experiments automatically

โšก

DISCOVER

AI agents scan your environment via AWS Systems Manager Inventory, mapping services, dependencies, and configuration automatically

๐Ÿง 

DESIGN

Bedrock-powered agents generate failure hypotheses and produce validated SSM Automation documents โ€” no manual scripting

๐Ÿš€

EXECUTE

AWS FIS orchestrates safe, controlled chaos experiments with rollback, preconditions, and isolation boundaries built in

~90% reduction in experiment setup time
Made byBobr AI
TECHNOLOGY STACK

Key AWS Services & Components

AWS Fault Injection Service (FIS)

Orchestrates resilience experiments with native chaos actions: CPU/memory stress, network latency, power interruption. Provides reusable experiment templates and safe execution guardrails.

AWS Systems Manager

Inventory module discovers installed software, services, network config, DB connections, and online EC2 instances. SSM Automation Documents implement custom OS/app-level impairments.

Amazon Bedrock (Generative AI)

Powers intelligent agents for inventory analysis, failure hypothesis generation, and automated SSM document creation. Reduces boilerplate from days to minutes.

AWS Strands Framework

Agent SDK for building and running multi-agent systems. Handles model selection, AWS API tool exposure, prompts, and callbacks โ€” minimal code required.

Made byBobr AI
ARCHITECTURE

Multi-Agent Chaos Engineering System

Specialized AI agents collaborate across the full resilience testing lifecycle

1
๐Ÿ”
Hypothesis Generator
Analyzes inventory, generates failure scenarios
2
๐Ÿ“Š
Prioritization Agent
Ranks experiments by risk & impact
3
๐Ÿ› ๏ธ
Experiment Designer
Generates SSM Automation docs (PowerShell)
4
โ–ถ๏ธ
Experiment Executor
Triggers FIS templates via API/CLI
5
๐Ÿ“ˆ
Monitor & Iterate
Observes results, feeds learning loop
Powered by Amazon Bedrock + AWS Strands SDK
Integrated with CloudWatch + Incident Tooling
Made byBobr AI
DEMO WALKTHROUGH

From Inventory to Experiment in Minutes

1
DISCOVER Inventory agent connects to online EC2 instance. Discovers IIS, ODBC drivers, DB dependencies, autoscaling config, and health-check endpoints via SSM Inventory.
2
HYPOTHESIZE Agent generates failure hypotheses: block DB port via security group, impair IIS app pool, inject network latency โ€” ranked by impact.
3
GENERATE Document generator agent produces SSM Automation document with preconditions, safety checks, PowerShell impairment steps, and cleanup/rollback.
4
VALIDATE Team reviews generated documents. Validates in production-like (lower-risk) environment before full experiment.
5
EXECUTE FIS experiment template ties native FIS actions + SSM document. Runs via console, API, or CLI. Results feed monitoring.
document-agent.yaml
description: "Impair IIS / DB Latency"
schemaVersion: "0.3"
mainSteps:
- name: PreconditionsCheck
action: aws:runCommand
inputs:
DocumentName: AWS-RunPowerShellScript
Script: |-
# Check if IIS is running
$status = (Get-WebAppPoolState "DefaultAppPool").Value
if ($status -ne "Started") {
Write-Error "Precondition failed"
exit 1
}
Write-Host "Precondition met."
- name: CleanupAndRollback
action: aws:runCommand
inputs:
Script: |-
# Restart App Pool
Start-WebAppPool -Name "DefaultAppPool"
Made byBobr AI
SAFETY FIRST

Agent Guardrails & Experiment Safety

AI agents operate like detailed job descriptions โ€” with explicit permissions, constraints, and rollback built in

โœ… AGENTS ARE ALLOWED TO
Discover inventory and dependencies
Generate SSM documents for approved targets
Execute experiments within isolation boundaries
Roll back and restore state after each test
๐Ÿšซ AGENTS MUST NEVER
Impair Systems Manager agent (management connectivity)
Modify critical production infra beyond permitted scope
Run experiments without passing precondition checks
Skip cleanup or state restoration steps
PRECONDITION CHECKS BEFORE EVERY EXPERIMENT
โœ“ Instance is online
โœ“ Target service is running
โœ“ Sufficient disk space
โœ“ Required modules installed
Made byBobr AI
FRAMEWORK

The Resilience Lifecycle

AI accelerates the design, testing, and learning phases โ€” humans stay in control

๐ŸŽฏ

SET OBJECTIVES

Define resilience goals, SLOs, and acceptable failure boundaries

๐Ÿ—๏ธ
AI-Assisted

DESIGN & IMPLEMENT

Architect for fault tolerance; AI assists with dependency mapping

๐Ÿงช
AI-Assisted

EVALUATE & TEST

Run chaos experiments via FIS; AI generates scenarios automatically

โš™๏ธ
AI-Assisted

OPERATE

Monitor live systems; AI-powered DevOps agents reduce MTTR

๐Ÿ“š
AI-Assisted

LEARN & RESPOND

Convert incidents/RCAs into automated tests; close the feedback loop

Resilience Lifecycle Framework โ€” released Oct 2023
Made byBobr AI
BUSINESS IMPACT

Faster, Smarter, Safer Resilience

~90%
Reduction in experiment setup time
Days โ†’ Hours
From hypothesis to running experiment
โ†“ MTTR
Faster incident response via AI-assisted monitoring and automated DR playbooks

Unknown Dependencies Surfaced

Inventory agents reveal hidden single points of failure before they cause outages

RCAs Become Tests

Past incidents are automatically converted into reproducible chaos experiments that validate mitigations

Human-AI Collaboration

AI drafts and accelerates; engineers validate and refine โ€” shifting focus from writing plumbing to delivering resilience

Made byBobr AI
BEST PRACTICES

Building Safe & Effective Experiments

SSM AUTOMATION DOCUMENTS

Make documents modular and idempotent โ€” safe to run multiple times without side effects

Always include precondition validation before impairment steps

Build restoration/rollback flows into every document

Use native FIS actions where available; author SSM docs only for OS/app-level gaps

CHAOS EXPERIMENTS

Validate automation in lower-risk, production-like environments first

Keep experiments focused and controlled โ€” purpose-driven chaos, not random destruction

Define isolation boundaries and strictly limit blast radius

Log everything: preconditions, execution steps, cleanup, and outcomes

"Practice chaos engineering with purpose rather than random destruction."

Made byBobr AI
GET STARTED

Next Steps & Resources

YOUR NEXT STEPS

1
Review the AWS Resilience Lifecycle Framework โ€” your north star for resilient application design (Oct 2023)
2
Explore FIS native actions library and best-practices blog โ€” use as context for your agent prompts
3
Pilot the multi-agent chaos engineering code โ€” validate and adapt for your environment
4
Start with one production-like workload โ€” run inventory โ†’ hypothesize โ†’ test cycle

KEY RESOURCES

๐Ÿ“–
AWS Resilience Lifecycle Framework
โšก
FIS Actions & Template Library
๐Ÿ”ง
SSM Inventory & Automation Docs Guide
๐Ÿค–
AWS Strands Agent Framework (SDK)
๐Ÿ›ก๏ธ
AWS Resiliency Analyst & Fault Isolation Boundaries
Test at your own risk โ€” validate and adapt before running in production
Made byBobr AI

Build Resilience.

Before It Builds You.

Generative AI + AWS FIS gives your team the power to discover, design, and validate resilience experiments at unprecedented speed โ€” so you're ready when it matters most.

Questions & Discussion
๐ŸŒ aws.amazon.com/fis
๐Ÿ“š AWS Resilience Hub
Made byBobr AI
Bobr AI

DESIGNER-MADE
PRESENTATION,
GENERATED FROM
YOUR PROMPT

Create your own professional slide deck with real images, data charts, and unique design in under a minute.

Generate For Free

AI-Powered Chaos Engineering: Automating Resilience on AWS

Learn how to use Generative AI and AWS Fault Injection Service (FIS) to automate resilience testing, disaster recovery validation, and chaos experiments.

INNOVATION DECK

AI-Powered Resilience Testing & DR Validation

Using Generative AI + AWS Fault Injection Service to automate chaos engineering at scale

"Everything fails all the time." โ€” Dr. Werner Vogels, Amazon CTO

THE PROBLEM

Resilience Testing Is Too Slow & Manual

Traditional resilience testing and DR validation relies on manual processes, tribal knowledge, and infrequent testing cycles โ€” leaving teams blind to failure scenarios until production incidents occur.

๐Ÿ•

Weeks to design a single chaos experiment

Engineers spend most time writing boilerplate, not validating

๐Ÿ”

Unknown dependencies & single points of failure

Inventory is scattered, assumptions are unverified

๐Ÿ“‹

RCAs gather dust

Past incidents are not converted into repeatable automated tests

THE SOLUTION

Generative AI + AWS FIS: Automated Chaos at Scale

An agentic AI system that discovers, designs, and executes resilience experiments automatically

AI agents scan your environment via AWS Systems Manager Inventory, mapping services, dependencies, and configuration automatically

Bedrock-powered agents generate failure hypotheses and produce validated SSM Automation documents โ€” no manual scripting

AWS FIS orchestrates safe, controlled chaos experiments with rollback, preconditions, and isolation boundaries built in

~90% reduction in experiment setup time

TECHNOLOGY STACK

Key AWS Services & Components

AWS Fault Injection Service (FIS)

Orchestrates resilience experiments with native chaos actions: CPU/memory stress, network latency, power interruption. Provides reusable experiment templates and safe execution guardrails.

AWS Systems Manager

Inventory module discovers installed software, services, network config, DB connections, and online EC2 instances. SSM Automation Documents implement custom OS/app-level impairments.

Amazon Bedrock (Generative AI)

Powers intelligent agents for inventory analysis, failure hypothesis generation, and automated SSM document creation. Reduces boilerplate from days to minutes.

AWS Strands Framework

Agent SDK for building and running multi-agent systems. Handles model selection, AWS API tool exposure, prompts, and callbacks โ€” minimal code required.

ARCHITECTURE

Multi-Agent Chaos Engineering System

Specialized AI agents collaborate across the full resilience testing lifecycle

1

๐Ÿ”

Hypothesis Generator

Analyzes inventory, generates failure scenarios

2

๐Ÿ“Š

Prioritization Agent

Ranks experiments by risk & impact

3

๐Ÿ› ๏ธ

Experiment Designer

Generates SSM Automation docs (PowerShell)

4

โ–ถ๏ธ

Experiment Executor

Triggers FIS templates via API/CLI

5

๐Ÿ“ˆ

Monitor & Iterate

Observes results, feeds learning loop

Powered by Amazon Bedrock + AWS Strands SDK

Integrated with CloudWatch + Incident Tooling

DEMO WALKTHROUGH

From Inventory to Experiment in Minutes

1

DISCOVER

Inventory agent connects to online EC2 instance. Discovers IIS, ODBC drivers, DB dependencies, autoscaling config, and health-check endpoints via SSM Inventory.

2

HYPOTHESIZE

Agent generates failure hypotheses: block DB port via security group, impair IIS app pool, inject network latency โ€” ranked by impact.

3

GENERATE

Document generator agent produces SSM Automation document with preconditions, safety checks, PowerShell impairment steps, and cleanup/rollback.

4

VALIDATE

Team reviews generated documents. Validates in production-like (lower-risk) environment before full experiment.

5

EXECUTE

FIS experiment template ties native FIS actions + SSM document. Runs via console, API, or CLI. Results feed monitoring.

SAFETY FIRST

Agent Guardrails & Experiment Safety

AI agents operate like detailed job descriptions โ€” with explicit permissions, constraints, and rollback built in

AGENTS ARE ALLOWED TO

Discover inventory and dependencies

Generate SSM documents for approved targets

Execute experiments within isolation boundaries

Roll back and restore state after each test

AGENTS MUST NEVER

Impair Systems Manager agent (management connectivity)

Modify critical production infra beyond permitted scope

Run experiments without passing precondition checks

Skip cleanup or state restoration steps

PRECONDITION CHECKS BEFORE EVERY EXPERIMENT

โœ“ Instance is online

โœ“ Target service is running

โœ“ Sufficient disk space

โœ“ Required modules installed

FRAMEWORK

The Resilience Lifecycle

AI accelerates the design, testing, and learning phases โ€” humans stay in control

SET OBJECTIVES

Define resilience goals, SLOs, and acceptable failure boundaries

DESIGN & IMPLEMENT

Architect for fault tolerance; AI assists with dependency mapping

EVALUATE & TEST

Run chaos experiments via FIS; AI generates scenarios automatically

OPERATE

Monitor live systems; AI-powered DevOps agents reduce MTTR

LEARN & RESPOND

Convert incidents/RCAs into automated tests; close the feedback loop

Resilience Lifecycle Framework โ€” released Oct 2023

BUSINESS IMPACT

Faster, Smarter, Safer Resilience

~90%

Reduction in experiment setup time

Days โ†’ Hours

From hypothesis to running experiment

โ†“ MTTR

Faster incident response via AI-assisted monitoring and automated DR playbooks

Unknown Dependencies Surfaced

Inventory agents reveal hidden single points of failure before they cause outages

RCAs Become Tests

Past incidents are automatically converted into reproducible chaos experiments that validate mitigations

Human-AI Collaboration

AI drafts and accelerates; engineers validate and refine โ€” shifting focus from writing plumbing to delivering resilience

BEST PRACTICES

Building Safe & Effective Experiments

SSM AUTOMATION DOCUMENTS

Make documents modular and idempotent โ€” safe to run multiple times without side effects

Always include precondition validation before impairment steps

Build restoration/rollback flows into every document

Use native FIS actions where available; author SSM docs only for OS/app-level gaps

CHAOS EXPERIMENTS

Validate automation in lower-risk, production-like environments first

Keep experiments focused and controlled โ€” purpose-driven chaos, not random destruction

Define isolation boundaries and strictly limit blast radius

Log everything: preconditions, execution steps, cleanup, and outcomes

"Practice chaos engineering with purpose rather than random destruction."

GET STARTED

Next Steps & Resources

YOUR NEXT STEPS

Review the AWS Resilience Lifecycle Framework โ€” your north star for resilient application design (Oct 2023)

Explore FIS native actions library and best-practices blog โ€” use as context for your agent prompts

Pilot the multi-agent chaos engineering code โ€” validate and adapt for your environment

Start with one production-like workload โ€” run inventory โ†’ hypothesize โ†’ test cycle

KEY RESOURCES

AWS Resilience Lifecycle Framework

FIS Actions & Template Library

SSM Inventory & Automation Docs Guide

AWS Strands Agent Framework (SDK)

AWS Resiliency Analyst & Fault Isolation Boundaries

Test at your own risk โ€” validate and adapt before running in production

Build Resilience.

Before It Builds You.

Generative AI + AWS FIS gives your team the power to discover, design, and validate resilience experiments at unprecedented speed โ€” so you're ready when it matters most.

Questions & Discussion

๐ŸŒ aws.amazon.com/fis

๐Ÿ“š AWS Resilience Hub

  • chaos-engineering
  • aws-fis
  • generative-ai
  • amazon-bedrock
  • resilience-testing
  • devops
  • disaster-recovery
  • automation