Thank you for joining us online!

We’ve all had to evolve over the last year, and we think virtual conferences should evolve as well. With Failover Conf 2: Fail Smarter, we wanted to create a more engaging, collaborative conference.

With panel discussions, lightning talks, fireside chats, dance parties, pet slideshows, and more; this wasn’t like any other virtual conference. This year’s talks discussed how remote teams have evolved their cultures of reliability, how companies have evolved their incident response plans, and how Chaos Engineering has helped teams evolve from traditional testing. We learned how teams have adapted over the past year and engaged with others in the reliability community.

Panel Discussion: The Evolution of Teams & Culture

Divya Balasubramanian, Senior Product Manager @ PagerDuty, Karishma Irani, Product Management Lead @ LaunchDarkly, Lena Reinhard, VP Product Engineering @ CircleCI, & Loretta Stokes, Director of Software Engineering Manager @ Eventbrite

The most successful organizations are the ones that embrace change and use it to become stronger and more resilient. In this panel discussion, we talked with engineering leaders about how they adapted to the challenges of 2020, what successes (and failures) they've seen, and where the future of reliable engineering is headed.

View talk

Panel Discussion: The Evolution of Observability & Monitoring

Ashley Miller, Senior Director, Engineering @ Datadog, Daniel Khan, Director of Technology Strategy & Head of Open Source @ Dynatrace, Emily Nakashima, VP of Engineering @ Honeycomb, & Stijn Polfliet, Director Developer Enablement @ New Relic

Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We sat down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.

View talk

Leaving the Nest: Guidelines, guardrails, and human error

Laura Santamaria, Developer Advocate @ LogDNA

When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, Laura discusses how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.

View talk

Implementing DevSecOps in the DoD

Nicolas Chaillan, Chief Software Officer @ United States Air Force

Delivering software quickly and securely is important for every organization, but it's even more important at the US Department of Defence (DoD) where reliability directly impacts national security. Nicolas Chaillan (Chief Software Officer, US Air Force) discusses the DoD Enterprise DevSecOps Initiative - an initiative he leads along with the DOD’s Chief Information Officer that brings automated software tools, services and standards to DoD programs. He also shares about Platform One, the Air Force's DoD-wide DevSecOps Enterprise Level Service that provides managed IT services capabilities, on-boarding, support, and baked-in zero trust security. This insight from operating at the most rigorous level will help you level up your own organization.

View talk

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape