When Software Goes Wrong

“What should I do if the web service/database/etc. is down?”

Have you ever heard a variation of this question? Perhaps you’ve even asked it? This isn’t a terrible question to ask. It shows that you are considering the possibility that a problem could occur.

However, asking the question this way subtly assumes the condition may not occur.

The better question to ask is, “What should I do when the system is down?”

It’s a subtle difference, but asking the question this way implies that the error will occur. If a problem will definitely occur at some point, then you have to do something to handle it. This is especially true when talking to external systems.

Sh…er…Stuff Happens

Checking for, and handling, error conditions is something I’ve been hearing since my C and C++ classes in college. There, I was admonished to always check the return codes from the various system functions my programs called. At the time, it seemed like a lot of work. After all, what could go wrong?

Lots of things, as it turns out.

Ever had an NFS mount go out to lunch on you? Suddenly, reading data from a file doesn’t seem as safe as it used to.

Of course, the network is always reliable too, right? After all, we have redundant switches, with cross connects and immediate fail over. That is, until an innocent typo, during a routine maintenance, shuts the whole thing down.

Oh, and that highly available, clustered, redundant system you’re using? It turns out it has a bug that, when one of the members of the cluster starts thrashing in a garbage collection loop, brings the rest of the cluster down with it.

These aren’t hypothetical scenarios. I’ve seen each and every one in my 20 years as a programmer.

Things will go wrong. It’s only a matter of when.

Handling Errors

Given that problems are going to happen, how do you build software in a way to deal with them? The answer to that question depends a great deal on your application. How critical is the operation? That is, what bad things happen when this fails? How you answer this latter question may determine how much effort you put into handling the error.

Log It

The simplest solution is to log the error to the application log file. I would consider this the minimum solution. There are very few times when you should catch an error and simply ignore it.

What should you write to the log file? Anything that would be helpful in understanding what went wrong and what code was involved. Often this involves a stack trace, but also include as much data about the specific transaction as possible.

Of course, you can go overboard with this. If this error will happen on a fairly regular basis, a full stack trace may be redundant. You also don’t need to log errors at every level of your application. If an exception is thrown from the database layer, simply log the error at the highest level of the application that handles the error, not every layer in between.

Retry

Sometimes, when software goes wrong, the failure may be temporary. Perhaps your request to the web service timed out, or you got a 503 HTTP response code. Maybe the resource you wanted was locked. In these cases, it may be reasonable to retry the operation. You may want to retry multiple times. However, there are a couple of gotchas:

Limit the number of retries. At some point, you have to decide that the remote service isn’t responding and you should give up.
Delay between retry attempts. If you called a web service that was overloaded, making your call again immediately isn’t going to help anything. Wait a second or two (or longer) before trying again.
Consider the upstream callers. If every operation in the chain of calls goes through a retry loop with a delay between attempts, the overall request by the user may become unreasonably long.

Fail Fast

If your application keeps making calls to an external system (database, web service, etc.) and getting errors, it doesn’t make any sense to keep calling into that system. In fact, your additional calls may be making things worse!

The design pattern to use in this scenario is called the “Circuit Breaker” pattern. After some number of failed attempts, the circuit “opens”, preventing further requests from flowing through the system. This allows the application to fail right away, rather than wait for the remote system to return the inevitable error…again.

This is such a useful pattern, that the folks at Netflix came up with Hystrix as a way to encapsulate the pattern.

Degrade Gracefully

Whatever method you use for handling errors, your application should try to display something reasonable to the user. There’s nothing more frustrating than having a 500 error page pop up in the middle of trying to get something done.

Hide the offending content. If the error was generated while trying to generate a non-essential widget on the page, just hide the widget.
Have a reasonable fallback. Maybe you cached the previous response from the external system. The Netflix recommendation page, at one point, would display the original movie over and over, if the recommendation engine failed to respond.
Provide an alternate path. Perhaps you can’t submit the request into your ordering system at the moment, but maybe you can send an email to your customer service address with the details so it can be hand entered later?

So the next time you find yourself wondering if something goes wrong, remind yourself that the answer is always “yes”. Then decide what to do when software goes wrong and you’ll build much more resilient, robust applications. Future you will thank you for it.

Discussion Question: What measures do you take when things go wrong in your applications?

David Hay

Building developers one line of code at a time