Thank you for joining us online!
Thank you for joining us online in April 2020 to learn from the best and brightest in reliability and resilience engineering.
Being a resilient engineer means building systems that are hardened against the expected failures and resilient enough to withstand the unexpected ones.
This year we expected the opportunity to gather in-person to share our knowledge and experiences building production systems with one another. Then the unexpected happened, forcing many events to cancel or postpone.
But we were resilient. When one opportunity fell through, we "fail(ed)over" to another.
Reliability Matters more than Ever
Chaos and uncertainty are all around us. Tammy kicks off Failover Conf by sharing why reliability and resiliency matter now more than ever - and how you can achieve it.
The Future of DevOps is Resilience Engineering
For more than a decade, many of us have been working to bring Devops to organizations around the world. We’ve made amazing progress, but there’s so much more to do. Now that we have continuous integration & deployment widespread and developers are taking more ownership of production, what’s next?
Amy will talk about what Resilience Engineering is, how it relates to devops, and how she thinks it gives us the science and research we need to take our organizations to the next level of robustness while remaining agile and growing our ability to care for the people around us.
Pitfalls in Measuring SLO’s
We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook.
This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure.
As an SLO advocate and a design researcher, we collected user feedback through iterative deployments to learn what challenges users were running into. This conversation will discuss how we iterated our design, based on user feedback; how we deployed, what we learned, and re-deployed; and how we collected information from our users and from the alerts our system fired.
In this talk, we will discuss how we brought the theory of SLOs to practice, and what we learned that we hadn’t expected in the process. We’ll discuss implementing the SLO feature and burn alerts; and our experiences from working with the SRE team who started using the alerts. Our hope is that when you buy or build your SLO tools, you’ll know what to look for, and how to get started. implementers will be able to start with a more solid ground, and that we will be able to advance the state of SLO support for all teams that wish to implement them.
The major design points will be broken into a discussion of what we actually built; a number of unexpected technical features; and ways that we had to educate users beyond the standard SLO guidelines. The talk is largely conceptual: no live code will be shown, although some innocent servers may well die in the process of being visualized.
Fight, Flight, or Freeze - Releasing Organizational Trauma
When humans are faced with a traumatic experience, our brains kick in with survival mechanisms. These mechanisms are the familiar fight or flight response, but can also include the freeze response - which occurs when we are terrified or feel that there is no chance of escape.
In this talk I will explain the background of fight, flight, and freeze, and how it applies to organizations. Based on my own experiences with post-traumatic stress (PTS), I will give examples and suggestions on how to identify your own organizational trauma and how to help heal it.
Sufferers of post-traumatic stress continue to feel these fight, flight, and freeze responses long after the trauma has passed because our brains are unable to differentiate between the memory of trauma and an actually occurring event. When activated or triggered, the brain reverts to these behaviors, which are then expressed in the person’s body (through posture, disassociation, muscle tension, etc).
The same can occur to organizations - once an organization has experienced a trauma (a large outage, say) the “memory” of that trauma leads to a deregulated state whenever activated (by symptoms of similar indicators, such as system alerts, customer issues, and more). The organization will insist on revisiting the same fight, flight, or freeze response as the embedded trauma has caused, which, like a triggered post-traumatic stress sufferer, is a false equivalency.
By removing the inaccurate traumatic associations of previous outages and organizational pain through game days, and other techniques, we can reduce the “scar tissue” of our organization and move forward in a balanced manner.
Swim Don’t Sink: Why Training Matters to a Site Reliability Engineering Practice
Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place.
The specific training needs of each engineer varies depending on several factors including:
- The maturity of your organization in adopting DevOps / SRE principles, practices, and culture
- The knowledge those individuals have about your organization and infrastructure
- The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model
This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity.
Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale.
Performing Chaos in a serverless world
Chaos engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior. The principles of Chaos Engineering have been around for years, and we have now reached the point where Chaos Engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries.
Planning and performing chaos experiments on traditional infrastructure with virtual machines and microservices using containers has been battle-tested by many large organizations, but serverless functions and managed services present different failure modes and level of abstraction. In this talk we focus on how to apply the principles of Chaos Engineering to serverless, both for serverless functions and managed services. This covers how hypothesis can be formed to fit serverless, what the experiments can achieve and how to practically perform them. With tools for Chaos Engineering, both commercial and open-source, getting more mature most of them still have focus primarily on virtual machines and containers.
We’ll look at what tools are out there to help with chaos experiments for serverless and managed services, but also how you can build your own. Join as we move from talking about the principles to performing real chaos in a serverless world!
Human-in-the-Loop DevOps
Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks.
In this talk, we'll discuss the spectrum of automation in DevOps, common patterns of tasks that can be automated away, like CI/CD and monitoring, and ones that can be partially automated with Human-in-the-Loop DevOps, like incident response. We'll share examples of interfaces that pull humans into the loop at critical junctures and allow humans to add maximal value while automating the tedium. Lastly, we'll discuss how Human-in-the-Loop DevOps can improve the on-call experience and improve efficiency.
Slowdown is the New Outage
While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss. This session compares slowdowns vs outages and the resulting need for insight more than observability.
By understanding these difference, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.
Improving a Distributed System Post-Incident
In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.
This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had to reevaluate our system's performance & overhaul several keys areas of our codebase, as well as improve our monitoring, testing processes, database interactions, and reliability. Viewers will learn about these improvements and how they can apply them to their own systems to achieve greater reliability and performance.
Additionally, viewers will learn how to effectively leverage monitoring practices to uncover inefficiencies in their system, tips for creating a testing process to properly stress your system before deploying to production, and how to rally a team together during a high-pressure incident.
The Halo of Resilience Engineering
Recent world-impacting events have caused us all to have to rethink the way we go about our daily work; in this talk, we'll look at how some of the pillars of Resilience Engineering might help you and your team deal with the changes we're all being forced to confront.
How to Fail with Serverless
Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems.
In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain.
In this talk, we’ll explore the built-in tools and processes that AWS has in place to appropriately deal with failures in distributed serverless applications. We’ll discuss retry behaviors and strategies for dealing with errors in:
- Asynchronous Lambda function invocations (DLQs, retries, and throttling)
- Event source mappings (Kinesis, SQS, and DynamoDB streams)
- Step Functions (task failures, transient issues, and fallback states)
- Lambda invocations from AWS services (synchronous and asynchronous)
- Calls to AWS services (using the AWS SDK and other protocols)
- Third-party API calls (utilizing circuit breakers and other fallback methods)
While this talk focuses on the AWS ecosystem, many of these strategies are adaptable to other cloud providers as well.
Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation
Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples.
Risk Reduction is trying to make sure bad things happen as rarely as possible. It's anti-lock brakes and vaccinations and irons that turn off by themselves and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often.
Harm Mitigation is what we do so that when bad things do happen, they are less catastrophic. Building fire sprinklers and seatbelts and needle exchanges are all about making the consequences of something bad less terrible.
This talk is focused on understanding where we can prevent problems and where we can just make them less bad, and what kinds of tools we can use to make every disaster a disappointing fizzle.
Audiences will leave with a clearer understanding of risk and harm, and a set of tools than can be used to minimize future problems.
I'm going to talk about why we need to understand both avoiding problems and making them less catastrophic, and what kinds of tools are appropriate to each.
I think that developers need to be thinking about failure states more than we currently do. We talk about avoiding them, or testing them away, but we don't talk about how to make even failure a better experience.
Built-in Application Resiliency
When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.
Avoid downtime. Use Gremlin to turn failure into resilience.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.