How Sephora improves performance and availability

Gremlin helps the world’s leading prestige beauty retail brand smoothly migrate from monolithic to Kubernetes—and to pull off Black Friday and Cyber Monday without any major issues.

Seamless migration

Testing in Performance Environment uncovered and resolved P0 and P1 issues before deployment.

Standardized reliability

Root causes and real-world failures used to create reliability standards tested in every sprint.

Executive Summary

Sephora was in the middle of a multi-year migration from a legacy monolithic system to Kubernetes-based microservices. The new systems promised improvements in flexibility, scalability, and reliability, but it was also a whole new architecture. Using Gremlin, the Performance Engineering team was able to standardize reliability testing across teams to identify and address issues before they hit production. The result? Performance and availability improvements that helped contribute to a tremendously successful holiday season on the new microservices platform without a single major issue or outage.

We developed our team's expertise to handle issues that other teams couldn't test. At the same time, we championed Chaos Engineering, demonstrating its value in rapidly evaluating various scenarios.”

Lead Performance Engineer, Sephora

‍

The Challenge:

How do you migrate seamlessly and maximize its benefits?

Like many companies, Sephora was in the midst of a multi-year migration from legacy systems to microservices. These migrations offer enormous potential for improvements in performance, scalability, and availability, but they also substantially increase the complexity of systems and architectures.

The Performance Engineering team was tasked with making the migration as seamless as possible while also making sure there wouldn’t be failures when traffic was switched over to the new systems.

The length of the migration further complicated the tasks, since many microservices would still have legacy systems as dependencies during the transition.

This is the first time we went with the new microservices platform for holiday sales. It was a tremendous success for us without any major issues. We made a significant impact on that contribution from our team, and Gremlin helps us to raise our bar.”

Lead Performance Engineer, Sephora

‍

The Solution:

Build reliability standards based on real failure conditions

The Performance Engineering team turned to Fault Injection testing and Gremlin. Testing in a Performance environment that closely mirrors Production, they replicated real past issues to verify resilience, then ran new tests to uncover further issues. Whenever they uncovered failures, they tracked down the root causes, then used these to implement new testing and reliability standards as part of sprints across the product teams.

By using Gremlin, the Performance Engineering team was able to uncover and prevent failures that would have led to P0, P1, and P2 outages in production and institute testing to detect and prevent their reoccurrence in the future.

All this effort contributed to an incredible result: When Sephora switched over to the new microservices system for the first time during the 2024 holiday season, they didn’t suffer a single major issue despite the massive increases in traffic.

‍

How Gremlin helped Sephora deliver a flawless migration and customer experience

1. Standardize reliability

Most engineers have an idea of where failures could occur in their systems, and the Performance Engineering team at Sephora was no different. But instead of hoping failure wouldn’t happen, they set out to prove resilience to failure. By using Gremlin, they were able to replicate failure conditions and verify that the system would react accordingly—and address issues when it didn’t.

One key focus of their tests was circuit breakers that prevent cascading failure in the systems. These are notoriously hard for engineers to test, but by using Gremlin, the Performance Engineering team was able to make sure the circuit breakers were set at the right level and functioning correctly.

Once the tests were dialed to the right level, they were then rolled out as part of sprint performance testing. As a result, each sprint had a set of standardized resilience tests run in a Performance environment, allowing the team to uncover P0 and P1 issues and address them before code was promoted to the production environment.

And the Performance Engineering team is just getting started. Their goal is to integrate resilience testing into the CI/CD pipeline for all services, not just critical ones—and to start testing in the Production environment to make sure issues are found before they impact customers.

2. Testing through the migration

The migration from legacy monolithic systems to a new microservices architecture brought a lot of benefits to Sephora, including more scalability, flexibility, and reliability. However, as with any modernization effort, it also brought increased complexities. It was the Performance Engineering team’s job to make sure the new microservices system was set up correctly to achieve the performance and availability goals—a task that came with unique challenges.

For example, Kubernetes is incredibly scalable, but how do you make sure that the cluster and applications are all set up to scale correctly? Or that traffic is redirected correctly if a node or dependency is unavailable? By using Gremlin, they were able to validate scalability settings and ensure that systems failed over correctly by introducing latency or making services unavailable.

Another challenge was connecting the new microservices to legacy databases. Migrations of this magnitude take years of work, which means the system will be expected to be performant while in a hybrid state. The Performance Engineering team was able to simulate failures in communication between legacy systems and microservices, then address issues so the two were able to work seamlessly in production.

3. Delivering higher availability and performance

Throughout the first year of their journey with Gremlin, the Performance Engineering team was able to improve the availability and performance of Sephora’s systems—sometimes with the same action. In one instance, latency performance issues were flagged between microservices. By using Gremlin, the Performance Engineering team was able to show that the latency was caused by a failure that had the potential to become a large outage if left unaddressed. The issue was fixed, the fix verified by further testing, and then the code promoted, increasing both availability and performance in production systems.

It wasn’t long before service owners noted the effectiveness of these efforts. After a few short months, developers started creating Jira tickets asking Performance Engineering for tests. While the first year was spent focusing on P0 services, the roadmap is to roll out testing to additional services in the future—and to enable teams to run their own resilience tests.

But of course, the biggest proof comes when the tested and verified systems run in production with real customer loads, and that came during the holiday season. This marked the first holiday season running the new microservices systems instead of the monolithic legacy application. Due to the combined efforts of the Sephora engineering team, including the testing performed by the Performance Engineering team with Gremlin, they went through the entire holiday season—including Black Friday and Cyber Monday—without a single major issue interrupting service.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started