WEBINAR

This is Fine: The SRE's Guide to Chaos & Observability

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure.

Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

On-demand

Register Now

Thank you for registering for this on-demand event. You will receive an email momentarily with a link to watch the session.

About this webinar

Join Gremlin and Dynatrace as we discuss techniques for maintaining and improving reliability in complex cloud environments. We will cover how to establish end-to-end observability across your environments and how to map their complex relationships. We will then provide a framework for safely and thoughtfully conducting Chaos Engineering experiments with Gremlin.

Finally, we will share how teams can incorporate continuous chaos experimentation into build and deploy pipelines using the concept of “quality gates” in Dynatrace to help you establish and adhere to reliability SLOs.

Agenda

Learn the history, principles and practice of Chaos Engineering
Discover how to improve your teams on-call skills
How observability and chaos work together to improve the reliability of distributed systems
How to use Gremlin and Dynatrace to enable your engineering team to have continuous improvement

About the speakers

Ana M Medina

Sr. Chaos Engineer

Gremlin

Ana Margarita is a Senior Chaos Engineer at Gremlin and helps companies avoid outages by running proactive chaos engineering experiments. Before Gremlin, she worked at various-sized companies including Google, Uber, SFEFCU, and Miami-based startups. Ana is an internationally recognized speaker and has presented at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others. Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

Andreas Grabner

DevOps Activist

Dynatrace

Andreas Grabner (@grabnerandi) has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences and regularly publishes articles on blog.dynatrace.com. In his spare time you can most likely find him on one of the salsa dance floors of the world!

Check out other webinars from Gremlin

Improving Resilience for GenAI Workloads on AWS

How to keep track of what’s running in your Gremlin team

How to test Istio and other service meshes

How to demonstrate your reliability progress

How to build a Test Suite that fits your requirements

Building Resilience from Architecture to Production with AWS & Gremlin

Integrating Gremlin with your observability tools

How Visa Cross-Border Solutions Reduces Outages by Testing System Resilience in Their SDLC

How to test serverless applications using Failure Flags

How to Build Resilience Throughout Your SDLC: Lessons from a Top 10 Bank

Confident Cloud Migrations How a Top 5 Bank Ensured Reliability With AWS and Gremlin

Building Resilience in the Cloud With the AWS Well Architected Framework and Gremlin

Get better reliability on AWS with our new release

5 essential resilience tests for a successful cloud migration

How to run fault injection tests on AWS managed services

How to test zone redundancy using Gremlin

How to run Chaos Engineering experiments in your CI/CD pipeline

How to test your systems for scalability and redundancy with Fault Injection

How to find Kubernetes reliability risks with Gremlin

How to find and test critical dependencies with Gremlin

Kubernetes Reliability Risks

Enterprise Chaos Engineering Certification Prep Session

More Reliability, Less Firefighting

Automate Reliability in Your CI/CD Pipeline

Secure yourself against expiring TLS certificates

Building a Culture of Reliability

Preparing for Traffic Spikes with Chaos Engineering

Introduction to Chaos Engineering

Validate Your Disaster Recovery Strategy: Ensuring Your Plan Works

The Road to Reliability

Running Your First 5 Chaos Experiments on Kubernetes

Serverless resilience: How to Build a Reliable Serverless Platform

RELIABILITY: The Next Big Development Trend

Recreating 3 Common Outages with Gremlin Scenarios

Reducing Trauma in Production with SLOs and Chaos Engineering

Beyond Chaos: Reliability in the Age of Cloud Native

Improving Network Resiliency & Performance with Network Attacks

Beyond Chaos Engineering: Using Reliability Scores to Drive Real Results

Planning and Architecting for Reliability - Part 1

Planning and Architecting for Reliability - Part 2

Organizational Reliability: What, Why, and How?

Improving system stability with Gremlin Resource Attacks

Is Your Microservice Actually a Distributed Monolith?

Navigating the Reliability Minefield

Incident Repro & Playbook Validation with Chaos Engineering

Introduction to Chaos Engineering

Gremlin Chaos Engineering Professional Certificate Prep Session

How Twilio Built a Culture of Reliability

Improving the Reliability of Financial Services with Chaos Engineering

Improving system resiliency with Gremlin State attacks

Getting Started With Chaos Engineering

Partner Update

Gremlin Chaos Engineering Practitioner Certificate Prep Session

Full-service Ownership: Owning Your Service from Code to Production

GameDays: Preparing Systems for the Real World

Five Hidden Barriers to Chaos Engineering Success

Chaos Engineering: When the Network Breaks

Continuous Validation of the AWS Well-Architected Framework with Chaos Engineering

Chaos Engineering: Test Your Systems NOT Your People

How to Baseline and Improve Reliability with Automated Scoring

Introduction to Chaos Engineering with Microsoft Azure

Achieving SLO Success with Golden Signals and Reliability Testing

Proactively improve reliability

Explore our tutorials to learn about the technologies and processes that help you manage reliability to a higher standard

Chaos Engineering: the history, principles, and practice

How To Establish a High Severity Incident Management Program

4 Chaos Experiments to Start With

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started