How Ritchie Bros Creates a Culture of Reliability
Gremlin helps the world's largest auctioneer of commercial assets and vehicles create a seamless customer experience by helping them modernize with confidence, build an innovative engineering culture, and keep their applications available.
Executive Summary
As the world’s largest auctioneer of commercial assets and vehicles, Ritchie Bros’ online bidding system needs to be performant and reliable while tackling complex data and payment problems. By using Gremlin and Chaos Engineering, they can deliver a seamless customer experience, improve availability, and create a culture of reliability that fuels ongoing innovation.
How do you know you won? Part of it is you have to know it’s still working the next day. It’s not that it’s working 10 minutes after you released it—it still works a month, 10 months later. It just works and it self heals, and the only way you know that is to go create the chaos.”
The Challenge:
How do you modernize in the face of technology inertia?
The commercial and industrial equipment market is filled with unique technological hurdles. With a wide variety of databases and identification systems, performant, accurate data validation can be challenging. At the same time, the customer base is spread out over low signal coverage areas and is less familiar with the inner workings of software and applications, making it easy to lose customer trust over a single bad experience.
To overcome these hurdles, the technology team at Ritchie Bros was tasked with modernizing systems, consolidating databases, increasing performance, and improving system availability—a monumental task at the best of times, let alone under these challenging conditions.
Given how chaotic the systems are, this was why it was so important to us to test this chaos instead of testing it in front of the customer.”
The Solution:
Build a culture of taming chaos
Ranbir Chawla, SVP of Engineering, turned to Chaos Engineering. After looking at the disparate systems and databases, he realized that he needed long-term solutions for reliability and performance. So, he focused on building a culture of reliability by creating and resolving chaos before it led to outages.
They started with small teams, performing tests and verifying the resilience of new code. After the first few teams successfully launched stable, resilient applications, engineers took notice. This started a cultural shift where engineers looked for what could go wrong, anticipated failures, and used resilience testing to validate their fixes.
By using Gremlin, they’re creating a culture of reliability where engineers ship better code and modernize applications without causing failures—which means they can rest easy at night without getting paged.
How Ritchie Bros builds a model engineering culture and a better customer experience
1. Modernize with confidence
Unlike consumer automobiles with a centralized VIN system, commercial and industrial equipment has a wide variety of ID and classification systems that vary from manufacturer to manufacturer. Since each piece of equipment is a substantial investment, data validation is essential. At the same time, Ritchie Bros has grown through acquisition, requiring the integration of a variety of databases, auctioning systems, and more, many with duplicate functions built on outdated platforms.
Modernization isn’t just nice to have—it’s essential to building the performant, available system customers demand.
Chaos Engineering and resilience testing allowed Chawla’s teams to thoroughly test their code by simulating real-world conditions before deployment, allowing them to launch their code confidently and without causing incidents.
“You can’t measure what you can’t see,” said Chawla, “ and you need chaos testing so we know we really hit the goal of being ready to release.”
2. Create a seamless customer experience
Company growth means there’s now a steady stream of high-value auctions going through the Ritchie Bros system, and the high amount of traffic means teams can no longer put the system in maintenance mode for updates. These increased demands not only require a modernization of core applications, but a modernization of the engineering platform itself, one that allows updates to be shipped during auctions without disrupting systems.
As an example, one team was tasked with incorporating Kubernetes and microservices to connect the website to the bidding interface for a faster, more performant, and more stable customer experience. Solving this required using a technology the team hadn’t used before. By using Gremlin, the team was able to fully test their code in pre-production to make sure it could handle the production environment.
Their testing uncovered potential P0 and P1 issues, but by the time it left their team for staging, they were confident in its performance. And, sure enough, the application launched seamlessly while auctions were going on without any disruption to production.
3. Give your engineers confidence and work-life balance
Chawla considers it one of his core duties as SVP of Engineering to create an engineering platform engineers want to use and a culture they want to be a part of. Gremlin has proven essential for creating both.
“I haven’t seen anything that competes,” Chawla says of Gremlin. “Gremlin has a focus and you nailed the focus. I’m looking for tools that do what they do well and better than everybody else.”
As a best-of-breed tool, Gremlin gives the engineers using it confidence in the capabilities of their software. Instead of worrying about getting pages after they go home at night, they’re able to focus on innovation and building performant, reliable code—because by the time the code ships, they know it’s stable.
Chawla’s goal is to create an environment where engineers can turn off their phones or go on vacation knowing they won’t get called. And with Gremlin currently being used regularly by over half of his team of 300 engineers, he’s making that culture of innovation and respect a reality.
Avoid downtime. Use Gremlin to turn failure into resilience.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.