What is Site Reliability Engineering?

A PRIMER FOR ENGINEERING LEADERS

Site Reliability Engineering (SRE) is the outcome of combining IT operations responsibilities with software development. With SRE there is an inherent expectation of responsibility for meeting the service-level objectives (SLOs) set for the service they manage and the service-level agreements (SLAs) we promise in our contracts.

SLOs set targets for reliability that are frequently referred to as error budgets. Data is gathered while the system is operating and compiled as service-level indicators (SLIs) to help guide decision making by SREs about what parts of the system need to be prioritized for enhancement. To help with this, engineers automate anything they can rather than repeat tasks. This automation creates more engineer time by eliminating toil, time that can be spent on making the site more and more reliable.

The focus on reliability is the main factor separating DevOps and Site Reliability Engineering, not automation.

The focus on reliability is the main factor separating DevOps and Site Reliability Engineering, not automation. In an SRE team engineers accept the responsibility that the person who builds software also ships it and owns it in production. In that sense, SRE is sometimes referred to as the next stage of DevOps.

Applying this automation expectation to operations tasks leaves software developers and system administrators who are becoming site reliability engineers with the task of learning how to deal with complex issues that may be outside of their previous experience. They are now expected to handle issues like latency, performance, high availability (HA), complex distributed production systems, system monitoring, emergency response and disaster recovery, and even change management with only the amount of human interaction that is absolutely necessary. This leads to greater and greater efficiency.

An SRE team is software development, systems administration, and IT operations all merged together. Any given member may be strongest as a sysadmin, a dev, or a dba, but no member does only one thing. They all work together toward a common goal without the traditional walls obstructing communication and cadence.

What are the Foundations and Benefits of Site Reliability Engineering?

The basic principles of Site Reliability Engineering are easy to explain, but take some work to apply. We start with the foundational understanding that today's computer systems and platforms have an inherent complexity that is unprecedented.

Our Systems Are Complicated

The number of moving parts and discrete, functional units within our architectures are vast and impossible for any one person to completely understand at any given moment. Plus, our systems are constantly changing. New capacity is being added. Failover systems. Load balancing. Canary deployments. Old containers that are no longer needed are constantly being removed.

Where we used to design a system on paper, drawing out system diagrams and system architectures, today our systems have changed before the ink is dry with any attempt to do so. They can at best be considered approximations.

Automation Helps Us Manage Complexity

Much of the administration of large systems today involves mundane, repetitive tasks. Our load has increased because of an influx of online shoppers hitting our site after marketing sent out a special opportunity sale email? Bring up more capacity? The load has gone down because tax season is over and everyone who has filed on time has finished and already received their results? Remove unneeded redundant application servers.

No engineer wants to do the same thing over and over and over. Repetition quickly becomes drudgery and leads to job hunting in search of a new and interesting challenge. We love solving problems and coming up with useful solutions. We love it even more when it relieves us of tasks we don't enjoy and gives us more time for fun and beneficial work.

We love automation even more when it is reliable and makes our systems more resilient to high impact events. We all appreciate it when we don't have to put out a metaphorical fire that resulted from a greater number of concurrent users than is typical. When our systems monitor, alert, and even react before this becomes a problem, everyone benefits.

Who Benefits From Site Reliability Engineers?

Any company large enough to need more than a small handful of people to manage their systems will benefit from Site Reliability Engineering. Any of us with systems large enough or vital enough to require 99% availability or greater will benefit. If uptime matters, well-implemented SRE will help you improve it.

The most benefit comes to companies with large numbers of users, whether internal to the company or external customers. In addition, companies processing large amounts of data or who have work loads that fluctuate from resource heavy to light.

In these instances, many companies are moving much of their processing and computing power to cloud-based services. Some have moved everything to the cloud while others have a reason to use a hybrid architecture that keeps sensitive data like personally identifiable information (PII) or company financials in house while moving other processing to the cloud. Still others have an internally owned data center using virtualization or an internal cloud.

What Do Site Reliability Engineers Do on a Daily Basis?

Site Reliability Engineers start by looking at the system, then taking the easiest and most mundane tasks and automating them. This frees up more time for coding new features and preparing for potential problems.

Site Reliability Engineers tend to come from either a software development background or a systems or operations background. All of us take time doing each of these tasks: writing code and managing the system. This is why we are well-suited to both know what would be useful to automate and also to write the code that does the automation.

Disaster Mitigation

While the simplest of low-hanging fruit tasks are dealt with, we also look at disaster mitigation and preparation schemes and write runbooks, plans for how to deal with bad things when bad things happen.

Both senior and junior SREs worth together to try to automate as many of the discovered mitigation tasks as they can: spinning up extra database servers when response times are slow, rerouting traffic around overloaded app servers when CPU usage or networking capacity is getting a little too close to capacity, and so on.

Configure and Use Monitoring for Observability

All this requires good monitoring to achieve observability into the system and whether components are functioning as anticipated. Guess who is in charge of that? Yep, SREs. Monitoring requires a knowledge of the system and what data would be meaningful and useful. It also requires good tooling and taking the time to learn how to use it well.

We can monitor "everything," but the result of doing that is a firehose blast of information that is overwhelming and quickly becomes ignored. Instead, we take our time to thoughtfully consider which metrics tell us what we need to know about the system, preferably well before user-impacting problems occur and far, far before downtime happens.

We can't catch everything, but doing this scientifically helps us catch as much as we know we need to catch while preventing us from an overload of noise and distraction.

Site and Software Maintenance

To be secure and up to date, a system must be maintained. It is not acceptable to have outdated software versions or old configurations when you are aiming for quality, stability, and safety.

SREs spend some of their time making sure software is properly updated in a timely manner. They may automate things like version checks, expiration dates for things like security certificates, and dependency needs.

Incident Management and Incident Repair

It’s right there in the name, "site reliability." One of the primary tasks for the SRE is keeping a site up and running, and when it stops running because something failed, getting it back up and running as quickly as possible.

An incident happens. An alert is sent out. Pager duty. The SRE on call stops whatever she is doing and starts working to find out what the problem is. Everyone in the on-call team gathers, perhaps in person or maybe using a video or audio conference call. Information is gathered and shared. Incident playbooks and runbooks are pulled out and used to prioritize what to look at and for to try to get things working again. If they fail to help, which sometimes happens, ideas and potential fixes are discussed. Responsibilities are spread across team members. Everyone works together to do whatever research and tasks are required to get the system back up, running, and available.

There are several metrics that are used to measure the speed and efficiency of incident response, such as:

  • Mean time to detect (MTTD), which measures the average time needed to discover a problem
  • Mean time to resolve (MTTR), which measures how long it takes to fix a failed system
  • Mean time to failure (MTTF), which is the average amount of time a defective system can continue running before it fails; this is similar to uptime and helps teams plan for future replacement of system components before they stop working
  • Mean time between failures (MTBF), which measures the the average time a system or component is working properly

Prevent Data Loss

The most well-known and seemingly obvious job of an SRE is to maintain system availability. Perhaps less obvious, but even more important is preserving the integrity of our data. Durability. The prevention of data loss.

Data is the most important thing we have in our systems. Every component exists to do something with or for data. Input (receive) data, store data, process data, transform data, use data, output data, the list goes on.

Some data is proprietary. Some is sensitive. Many types of data can only be handled according to strict regulatory standards.

Without good data, our systems have no value. If our data is not properly protected and it is stolen, we could be fined, sued, or even go bankrupt, and our customers who have entrusted us become vulnerable in ways we must work to prevent.

If our data store becomes corrupted or a database crashes, we had better hope we have good backup systems and redundancy built in. SREs are responsible for that, too, and this is another place where good tooling and knowing how to use it matters.

Prevent Recurrence of Past Problems

SREs look at problem events from the past and try to prevent them from recurring. This is where events like blame-free retrospectives are incredibly valuable. Talking through a problem issue by issue, noting what happened and how, without making anyone the scapegoat will elicit useful ideas and participation in keeping the problem from happening again.

The fun part is when we get to automate the mitigation and test it with a little Chaos Engineering to prove to ourselves that we have actually prevented future disasters.

Incident Analysis

These terms can be used in two ways. We can analyze an incident while it is in progress, looking for how to repair and recover. That was covered under Incident Repair. Here we are thinking of the other sort of analysis, the one that happens after a problem is fixed and everything is working again.

Some places call the information gathering and presentation process a post-mortem. We prefer to call it a retrospective. Regardless of the name, SRE culture insists that it be done in a blameless manner. We aren’t looking for a scapegoat, we are looking to learn. This isn’t about who made an error (even if the incident is caused by human error) because often the human error is because someone did something they shouldn’t have been permitted to do by the tools or software anyway.

Every detail available about an incident is gathered together, assembled in a logical (frequently chronological) order, and presented by the team. We share in order to learn.

Sometimes the retrospective is shared across the company or at least across affected business units. Sometimes, it is even shared publicly. In these cases, information about how the problem was fixed and also how it will be mitigated against or prevented from recurring is also included.

Learn and Share Skills

A huge part of Site Reliability Engineering is perspective. SRE teams commit to working for mutual benefit; to sharing information in a way where as many as possible can benefit. Getting the job is not the end of learning. Understanding a system and managing it well as part of a mature and successful SRE team is not the final stage. This philosophy of mutual respect, sharing, teamwork, and focus on building a better future system together rather than worrying about "who broke it" last time is key.

Senior SREs take time to teach junior SREs using specially written onboarding runbooks that script out main tasks they need to learn, by mentoring and actively communicating best practices and institutional knowledge. Doing Site Reliability Engineering well absolutely requires the development of a community within each team and across teams. A failure here will ultimately lead to failure of the practice.

Engineers with a talent for presentation often find opportunities to share their knowledge at conferences and other events, typically attended by other SREs and DevOps practitioners and those who want to learn more about the practices. This provides great opportunities to learn new skills, become aware of new technologies, and to cross-pollinate and hybridize practices that work across companies. One thing most love to share are outage stories. There is a power in good storytelling for SREs to convey meaningful information in an engaging way.

In addition, there are many opportunities to find other practitioners on the web. Many write blog posts for the company or on their personal site to share what they are learning. We find our fellow SREs on Twitter, Meetup groups, Reddit, some LinkedIn community groups, and specialized topical Slack channels (transparency moment...that last link goes to the Gremlin-sponsored Chaos Engineering Slack, which has participants from across the industry well beyond Gremlin).

How Did Site Reliability Engineering Begin?

The history starts back in the early 2000s and predates the better-marketed term, DevOps. The title "Site Reliability Engineer" was invented by Ben Treynor, a Vice President of Engineering at Google who is ultimately responsible for thousands of software engineers. His LinkedIn profile says,

If Google ever stops working, it’s my fault.

Other companies like Amazon and Netflix, had similar activities begin at similar times. All of them were looking for ways to make their already large-scale deployments more reliable, efficient, and scalable. However, it was a team from Google who literally wrote the book on SRE based on the company’s practices.

In addition to combining software engineering and operations roles, the big change was a perspective shift. The move from reactive firefighting when problems arise to proactive hardening of infrastructure is a big deal. It requires more focused attention up front, but ultimately saves SRE worker time and company money by preventing potential failures.

It is this sort of perspective shift that led to the creation of Chaos Engineering as a practice embraced by SREs and DevOps practitioners. Anyone wanting to prove that the mitigation schemes and preventative actions implemented actually works suddenly had tools at hand to do so.

How Much Do Site Reliability Engineers Get Paid?

We cover this question in greater depth in SRE vs DevOps - Can they Coexist or do they Compete?. The very short answer is that salaries depend a lot on factors like location, engineer experience, and company.

As of early 2020, the annual salary range for Site Reliability Engineers across the United States is from about $75,000 at the low end up to around $450,000 or more in some extreme outlying cases. The median salary in the USA is about $236,000. Learn more in How Much Money Do Site Reliability Engineers Make in Salary & Stock.

What Skills Do I Need to Become a Site Reliability Engineer?

We cover this question in greater depth in The Roles and Responsibilities of SREs.

A top notch Site Reliability Engineer candidate will have a natural though process in prioritization. That is, they are able to sift through information and discern what is important and what is not. They will also have excellent interpersonal communication skills.

They will also have a skill set including some level of familiarity with:

  • Git and hosts like GitHub and/or GitLab
  • Vim (because this editor is widely available on pretty much any server you are likely to encounter)
  • Linux fundamentals like package management, user account management, and directory and file permissions
  • Basic server software management such as for Apache httpd or Nginx
  • SSH
  • Shell scripting, such as with Bash
  • Programming with languages like Python and perhaps Go and even Rust
  • Automation
  • Networking
  • Monitoring, logging, and observability
  • Testing, including the ability to write both unit tests and tests for use in CI
  • Databases, both relational like MySQL and Postgres as well at least passing familiarity with newer NoSQL/NewSQL options like Cassandra, MongoDB, and Neo4j

Those just entering SRE in a junior level position are not expected to know all of these at the start, but are expected to learn what is needed to work successfully on the system(s) they are hired to keep up and running efficiently. See our sample SRE job description and interview questions article for more.

What Tools do SREs Use?

Site Reliability Engineering can at times be chaotic. It is always and ever changing. Managing the work within this environment takes planning. Standardizing a tool set across a team is always a good idea.

To begin, SREs have to be able to track work and progress in order to be successful. To that end, one of the first tools used by an SRE organization is a good issue tracker like JIRA or Pivotal Tracker.

SREs write many of their tools alongside the software they manage. Placing that code in a repository like Git is vital. Having everyone use the same IDE, libraries, and build process such as a CI/CD tool like Jenkins or Spinnaker makes working together much more efficient and smooth.

The process of deploying the service(s) owned by an SRE team into the wider cloud application architecture is important. Many teams use containers such as Docker or Kubernetes for this.

Teams typically automate everything they can, including configurations. Tools like Ansible and Terraform are useful for this.

How Does Site Reliability Engineering Fit Into a Wider Engineering Organization?

In a younger organization that has a large cloud deployment, SRE roles may be the norm. In older organizations that still have corporate-owned data centers and dedicated development versus operations teams, SRE may be the new, unique role. Evolution only seems fast in reverse.

During a transition period (which can last for many years in some cases), a company may have a dev team still doing product development using waterfall methods and throwing code over the wall to ops who are charged with making that code work in production. Of course, that is only after multiple reviews and approvals by change review boards, deployments in testing and stage environments, and so on.

Those same companies, may at the same time, have new teams running as pilot projects with permission to use agile development methods, DevOps and/or Site Reliability Engineering practices, and automated CI/CD pipelines. These are running alongside the more traditional teams and may sometimes be seen as competitors.

The best way to prove the value of Site Reliability Engineering as a perspective is to do it with excellence. Learn the perspectives. Enact the core values. Push stable and needed features at a faster cadence than ever seen before and have the same people who write the code be responsible for keeping it working well in production.

Teams may be comprised of traditional developer/engineer roles, operations experts, monitoring gurus, and so on, with just a handful of team members having the title "Site Reliability Engineer." That is not unusual. In places like this, the entire team owns the code and the operations aspects and the SRE is typically the one who knows the most about managing the code; building, deploying, configuring, and so on. A database engineer is likely still the one focusing on data reliability, but the SRE helps with a wider perspective while benefiting from her knowledge.

In the end, SRE is all about collaboration and cooperation, bringing people with disparate skills together with a common purpose and having them share knowledge and responsibilities efficiently for the positive benefit of enhanced system efficiency and uptime.

How Can We Create Our First Site Reliability Engineering Team?

The best way to be successful is to first know what we are trying to accomplish. Too often we see projects or paradigm shifts begin with too little planning. We want flexibility in our implementation as we learn and need to adapt to changing circumstances, but we still need to go in with a plan.

We start by defining our need. What do we want an SRE team to do? To manage? A good place to begin is by planning that this new team will take over responsibility for one relatively small service or system.

Why? Because at this point the wider organization is learning the culture and the process at the same time and you want to set up your first team for success as they have twice the amount of learning to do. Begin preparing the overall organization for what is coming. They will need time to adapt to the idea of a team running like this. Some may push back. Listen to them, gently teach and guide, but do not get sidetracked.

Next, think about the tools this team will require to be successful. Don’t rush out and buy them yet, as you may hire a set of experienced engineers who know of better ways and better tools, but at least do your research on typical costs so that you can set a preliminary budget expectation.

How many people will be included in this team? Do you need 24/7/365 or is this a team that can work standard hours together with a person or two on call as needed? We recommend the latter for your first team, learn how to implement SRE on something that is low risk, then move upward.

Define the culture you want to cultivate in this SRE team. Write it down so that you can describe it in the job listings you will create next.

To continue your headcount planning, write job descriptions for each of the ideal candidates you want to hire for this team. Remember, not everyone will come from an identical background, and that is what you want--multiple perspectives! Be clear about the responsibilities and expectations for team members.

Decide whether team members will all be titled "Site Reliability Engineer" or will there be a mix of SREs and traditional roles and titles? Remember that individual team members may have specialties, but they will all be expected to contribute across the team to all needs and roles, so you need flexible people who are excited to learn together as they share their own expertise.

Now, set your salary budget, your team budget, and your tools budget for the first year. Estimate higher than you think you need.

We recommend hiring a good set of experienced SREs who love their jobs first. Expect that this initial set of 3-4 leaders will begin things by learning what exists today, how it works, and thinking about how they want to take over. Consider a mix of internal and external hires to help make this process smooth institutionally while also providing some new insights and perspectives.

These first hires will need a little time to get to know each other's personalities, strengths and talents, and perhaps any gaps they see in the team as a whole. Trust their instincts when they tell you what they will need as a team to be successful. Ask them about who to hire next and include them in the team growth process.

Let experience and learned knowledge about the system they will run guide how things progress from here.

9 Steps for Building an SRE program
Learn how to strengthen your approach to incident management by incorporating SRE best practices.
Read More→

How Do We Measure the Success of Our SRE Team?

Here are some thoughts. If we start by compiling a list of downtime and failures that caused problems, even without downtime, and get a sense of how expensive that was for the company, then we can use that as a baseline to measure against.

If we implement good SRE practices and give our engineers the tools and the headcount our teams need to be successful, you should see a decrease in things like downtime, both in number of incidents and in the severity while seeing faster response times and smaller customer and business impacts.

So much of Site Reliability Engineering is about measuring the right things and using that data to inform future work prioritization. When we get the initial measurements and metrics wrong, we change as soon as we notice. It is better to move to new instrumentation, different metrics, and evolved policies than to stick with what we have just because we have an existing baseline we want to measure against.

We know when our uptime improves. We know when our site is more or less responsive. We know when customers, both internal and external, are happy with the service being provided to them. Measure what you think is most important and focus on improving that. When you are happy with the success of that metric, focus on a different one. The main thing is to pick something to improve and improve it. Focused improvements, even small ones, add up.

Site Reliability Engineering is about far more than speeding up disaster recovery. It is about preventing downtime, improving system responsiveness to stressors, and making our systems as efficient and reliable as humanly possible.

Previous
Site Reliability Engineering
Next

Guide Chapters

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape