4 Keys to a Resilient System

Have you ever noticed that some software is like the energizer bunny—it keeps going no matter what’s wrong—whereas, other systems seem to fall over whenever someone sneezes? What’s the difference? What makes some systems much more robust and resilient than others?

Resilient systems that withstand anything come from the mindset that things will go wrong, rather than hoping they won’t. By deciding that failure is inevitable, these systems are built to handle these failures from the ground up, rather than as a bolt on feature. There are four key characteristics of robust, resilient systems that survive well.

Redundant

In the current world of virtual machines, cloud computing, and containers, it’s becoming easier and easier to build whole armies of applications. Each of these applications might have at least two but often, many more instances running. Not only does this allow you to spread out the load across several machines (virtual or otherwise), but if something happens to one instance (fatal crash, VM dies, hardware fails, etc.) there is a redundant copy running that can take over and carry on where the lost application left off.

Building redundant systems requires some forethought. Some questions you might ask yourself as you’re building such a system:

  • How will you determine if a copy of the application has failed?
  • How will you route traffic?
  • How will you coordinate work that should only happen once across the cluster (e.g. a scheduled job)

Replicated

Whereas a redundant system has multiple instances of an application running, a replicated system has multiple copies of the data available. That is, when the main source of data fails, you have a backup copy available to use. This fall-back might happen automatically, in other cases, it might require manual intervention.

I’m not talking about regularly scheduled backups here. If you lose your primary database in the middle of the day, you don’t really want to have to restore from a backup made at 2am. Instead, use a system that keeps your backup copies up to date in near real-time. For example, when you publish a message to Kafka or insert an entry into a distributed cache, that data is copied to another system immediately. There may be a small window for data loss, depending on the system and how you use it, but it’s measured in milliseconds or seconds, not hours or days. Products such as Oracle’s Golden Gate can keep a backup copy of your database in sync in near real time.

Some of the better systems not only replicate your data, they keep it replicated.  For example, if you lose a node in an Elasticsearch cluster, it will copy that data from the backup to another active node. This ensures that you always have a full set of backups in the system

Degrades Gracefully

Even if your application is built with redundancy and replicated data, there are some failures that will cause these systems to fail.  In this case, resilient applications will degrade gracefully, attempting to let the user continue without interruption.

A few ways that resilient applications can degrade gracefully.

  • Serve cached data. This is the last known good value. Depending on the application, this may be sufficient.
  • Reduce functionality. Perhaps the core of the application still works, but some of the extra widgets and features aren’t available?
  • Buffer writes. Maybe your payment processor is down, but you could write the transaction to a message queue for later processing.

Whatever technique you use, the idea is to allow the user to continue to work with the system, even if all the functionality is not available.

Recovers Automatically

Having a system that is redundant, replicated and can degrade gracefully is nice, but do you really want to be the one that has to restart applications once things have come back online? I’ve worked on systems that gave up attempting to reconnect to an external system. I’ve seen systems where the connection pool didn’t rebuild it’s connections when the database came back online.  A resilient systems recover automatically once things have been restored to a good state again.

Often, recovery involves some kind of retry loop.  This might mean spinning up a background thread that pings an external system to see if it’s back online. It might just retrying the previous request again, hoping the outage is temporary. In either case, there should be some kind of delay between requests.  Often the delay should increase after each request. There’s no point pounding a sick system with requests that won’t succeed.

Conclusion

Building robust, resilient systems can be tricky.  If you keep these four characteristics in mind, it can be a bit easier to manage. It might be extra time and effort to handle these situations well. You may not even know what all the failure conditions might be until your application fails the first time. In the end, it is worth the work to build your systems to be resilient. Your customers and users will thank you for it.

Discussion Question: What do you think are the keys of building a resilient system? What techniques do you use?