Header background

AI’s surprising role in chaos engineering

Software quality isn’t just about clean code. Bartek Pisulak, Director of Cloud Quality Engineering at Pegasystems, says it’s also about architecture, documentation, and processes working in harmony. But today’s systems are so complex that even small failures can ripple into outages across your entire stack. Understanding those interactions—and testing them—has become nearly impossible without help. That’s where AI comes in.

In this episode of the PurePerformance podcast, Bartek joins hosts Andi Grabner and Brian Wilson to explore how artificial intelligence is transforming chaos engineering, making it faster, smarter, and more effective at uncovering weaknesses before they become disasters.

How AI can enhance the five stages of chaos experiments

Pisulak walked through each of the stages of chaos experiments, and explained how AI fits into each one:

  1. Defining steady state. AI learns your system’s “normal,” tapping into observability systems to detect patterns. Traditionally, this might be done by simply setting thresholds for different metrics. But Pisulak suggests using AI to analyze historic data to learn normal system behavior and predict failures before they happen. This has been possible using causal AI systems since before generative AI hit the scene, but remains a key starting point.
  2. Generating hypotheses. Instead of hunting for weak spots manually, Pisulak recommends feeding your system’s architecture diagrams, logs, and documentation to AI-powered tools like Chaos Recommendation for Kraken or chaos eater that can surface edge cases, single points of failure, or unusual dependencies.
  3. Running experiments. This is where AI-coding tools come in. They make it much quicker and easier to write scripts to run your experiments.
  4. Verifying results. After experiments run, AI can sort through the avalanche of data and analyze your dependencies and software bill of materials (SBOM) to help prioritize fixes or note which ones are most likely to be false positives.
  5. Improving the system. AI can propose code changes or configuration improvements. In advanced use cases, it helps create a feedback loop, automatically fixing issues, re-running previously failed tests, and driving continuous improvement.

In short, the value of AI in chaos engineering is about making the entire process more intelligent and robust, freeing engineers to innovate and build stronger systems.

Our perspective: Organizations are under pressure to move from AI proofs-of-concept to demonstrating real value. Using AI in resilience testing, as described by Pisulak, could be an easy win for operations teams. As Dynatrace VP and Chief Technology Strategist Alois Reitbauer recently said on the CUBE, AI presents real potential to help operations teams move from firefighting mode to start thinking about making bigger-picture improvements that they just don’t have time to focus on.

Check out more episodes of PurePerformance to dive into the world of software performance and innovation.