High availability is all the rage these days. I’ve written about how it’s a matter of when, not if things will go wrong. We want to build resilient systems, but how much effort should we put into recovering from failure? What is the cost of high availability? How much effort should go into recovering from errors and failures?
I was on vacation with my family in Florida recently. While there, we visited the Kennedy Space Center where they had exhibits on everything from the lunar landings, Space Shuttle missions and travel to Mars. After our visit, I got to thinking about the cost of failure and the cost to mitigate that failure.
How far should we go when faced with the possibility of system failures? In the case of space travel, people’s lives and millions of dollars are on the line, so spending the time and money to build resilient, redundant systems isn’t unreasonable. But that’s not always the case.
Cost vs. Cost
There is a cost building the systems needed to recover and respond to failures. High availability systems have three main costs to consider:
- Development. You need to work out the approach, write the software and test it.
- Capital and Operational. To run redundant systems, you’ll need additional servers, virtual or physical. Those cost money to buy or rent as well as time and effort to maintain them
- Maintenance. Self-healing systems tend to be more complex. That complexity translates into additional maintenance costs.
On the other hand, if you don’t respond gracefully to a failure, there’s a cost to that as well.
- Monetary. Perhaps orders are unable to flow through the system.
- Reputation. Maybe orders are flowing in, but the experience is so terrible, users don’t return and tell their friends not to use your product or service.
- Support Fatigue. Maybe the failure means your support staff needs to be watching and bouncing servers or applications all the time.
- Physical Harm. In the case of the space missions, failure could mean serious injury or death.
How Much is Too Much?
Balancing the cost of failure with the cost of building highly available systems is the tricky bit. If you’re building a systems for a space mission, manned or not, it has to be right. There’s no way to swap out a server or replace a component once it’s been launched. Software failures and the inability to recover can mean the difference between successfully landing or adding an additional crater to the surface. Spending the time and money to build in redundant and resilient systems make a lot of sense.
It also means that the type of systems you build have to be quite robust and able to run automatically or be invoked remotely. This is very different than, say, the systems in your automobile.
In a car, some systems, such as the airbag, must work every time or people get injured or killed. But the response to such a failure can be much different. If a problem is detected with the airbag, an obnoxious warning light or message on the dashboard might be sufficient. Perhaps you could prevent the car from starting. One allows for a degraded experience (less safety) whereas the other favors safety over convenience (can’t drive the car).
What Should You Build?
So what does all of this mean to those of us who aren’t NASA engineers or don’t work on automobile safety systems?
Consider the costs of systems failures with the cost of designing systems to handle them. If you’re building an e-commerce site and consumers can’t buy products, that probably deserves a higher level of effort. You’ll want to look at queuing systems, redundancy, and other techniques to allow orders to flow through. If users can’t change their avatar image, that’s probably not worth the same level of effort. A simple message indicating there’s a problem and a suggestion to try again later is all that’s required.
Some systems don’t have a good failure response. The heat shield on the bottom of the Space Shuttle, made up of thousands of tiles, was the only system that didn’t have redundancy. The best you can do is test the heck out of it and build to the highest quality you can.
Be Deliberate
Always be thinking about when, not if something will go wrong and consider the cost of handling the error. The response may ultimately be, “do nothing”, but make that a conscious decision. It’s easy to get caught up in the pure technical solution to a problem, but if you can take a step back and look at it from a business perspective, I think you’ll find that you’ll over engineer the right things, not all things.
If you’re a rocket scientist, ignore everything I just said. Go right ahead and over engineer everything.
Question: At what point is the cost of building high availability systems too much?