Preventing Incidents and Inspiring Innovation

How Upwork uses Gremlin to verify resilience to past failures, find issues to prevent new incidents, and uncover unknowns so service owners can deliver a better, more reliable experience.

The Challenge:

How can you decrease incidents in complex ecosystems?

Like many large companies, Upwork utilizes a complex ecosystem of containerized microservices. Unfortunately, the ever-changing nature of these systems make it difficult to fully understand how all the moving pieces interact and when services will fail. No matter how well individual components are tested, there are circumstances where standard QA tests can’t provide visibility into how services will react once part of the larger ecosystem and under production workloads.

Upwork’s Reliability Engineering Team needed a holistic approach that could proactively find these gaps, then identify, raise, and resolve any issues found in them—before those issues caused incidents. They needed to be able to simulate real-world scenarios safely with tooling where they have confidence in the results.

That’s when they turned to Chaos Engineering and Gremlin.

Upwork needed a discipline that could holistically support its complex environment. Instead of taking a reactive approach, problem mitigation could be proactively taken by identifying, raising, and resolving issues before they become serious incidents."

Angel Boscan

SITE RELIABILITY ENGINEER, UPWORK

‍

The Solution:

Standardized testing with Gremlin and collaborate with service owners

The Reliability Engineering Team had executive buy-in from the beginning, but they still needed to prove the value and effectiveness of their practice before scaling up efforts. They started with a small proof of concept where they ran tests in staging on a problematic service. As confidence grew, they started hosting regular “GameHours” with service owners to surface new issues and verify resilience to past failures.

Much of that confidence is due to the metrics developed by the Reliability Engineering team. Using these metrics, the team was able to demonstrate how they safely prevented incidents and outages, improving the uptime and reliability of Upwork’s ecosystem.

Currently, these tests are performed in a staging environment, but over the next year, the Reliability Engineering team plans to test in production environments using test suites tailored to each individual service and regularly report reliability scores to service owners and the organization.

Gremlin's Chaos Engineering tools can safely and securely inject failure into systems to find weaknesses before they cause customer-facing issues. This approach is useful to experiment with specific failure patterns across infrastructure."

Angel Boscan

SITE RELIABILITY ENGINEER, UPWORK

‍

How Upwork uses Gremlin to prevent future incidents and increase reliability

1. Verify fixes and resilience to past failures

A crucial part of any post-mortem is introducing tests to detect and prevent the same issue from recurring and causing another outage. But what do you do when the incident was caused by something your current tooling can’t check?

Upwork has a culture of using best-of-breed tools, and aren’t afraid to evaluate a new tool if it can genuinely increase their capabilities. And that’s where their use of Gremlin started: verifying fixes that couldn’t be tested with other tools.

As one service owner said, “This is something I couldn’t have checked without the tooling—it’s something I couldn’t have known.”

Using Gremlin and Chaos Engineering best practices, the Reliability Engineering team built specific experiments that replicate the real-world scenario that caused the incident, such as a spike in traffic or dropped network access to a dependency. This allowed them to verify that the fix worked—and that other problems weren’t introduced.

2. Prevent incidents before they cause outages

A common practice for Gremlin users is to hold GameDays, where you bring multiple service owners together to spend a day running tests then addressing any issues or failures that are uncovered. In fact, the Gremlin platform has a capability specifically to help run GameDays with your teams.

Upwork used a more focused version of this approach that they call GameHours. Every week, they brought together service owners for a specific service and any other directly related service. Then they’d take a couple hours to simulate real-world failure scenarios in a controlled environment. These failures ranged from simulating sudden spikes in user traffic to inducing failures in specific components.

These GameHours are held weekly and focus on a different service each time. This systematic approach allowed the Reliability Engineering team to work directly with service owners to identify issues, get them addressed, and then verify the fixes—all before the code even hit production.

By adopting GameHours and regular testing, the engineering team has been able to actively prevent major incidents from occurring in the first place.

3. Improve the stability and capability of Upwork

Upwork’s Chaos Engineering efforts began with a mandate from engineering leadership to proactively address the causes of incidents. While that executive mandate has greatly contributed to the success of the program, it also meant that the Reliability Engineering team was expected to prove results to the business.

The SREs aligned around three key metrics: the number of bugs found, the number of tests run, and the safety of those tests. Over their first year, they found a large number of critical issues, ran tests on services across the organization, and never once negatively impacted production, thus proving the effectiveness and safety of their efforts.

But the tests had an impact beyond improved reliability and decreased incidents. When service owners saw how their services reacted to real-world scenarios, it would give them ideas for improving the architecture to work more effectively—even if the test was successfully passed. These additional insights allowed engineering to actively improve the core Upwork platform and products, providing even more value from the Reliability Engineering team’s Chaos Engineering program.

What’s next?

Now that service owners and the Reliability Engineering team are comfortable with Chaos Engineering, the next step is to scale up and expand efforts to provide even more value and reliability by testing in production and enabling teams to run their own testing.

The team plans to improve coverage by increasing the variety and volume of tests. Using known failure patterns and best practice test suites built into Gremlin’s Reliability Management platform, the SREs will roll out more testing capabilities to more teams across the organization.

At the same time, they’re going to optimize their existing efforts so they can move faster through refined internal processes, tailored test suites for different services, and Reliability Scores based on test suites within Gremlin. They’re also going to take advantage of Gremlin’s Reliability Management platform and pre-built testing scenarios to give service owners the self-service capability to run their own verification tests without having to slow down.

By using Gremlin and Chaos Engineering, the Upwork Reliability Engineering team was able to decrease the number of incidents, actively prevent outages, and help service owners provide better experiences for Upwork customers. And over the next few years, they hope to expand that effort to have a greater impact on the company and make the entire ecosystem more reliable—and to improve the customer experience with Upwork.

‍

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started