Fail Early, Fail Fast, Fail Often

Software fails.  After all, it’s written by humans and humans are fallible. We make mistakes. We forget. We don’t foresee all the possible scenarios. Given these truths, how do we build robust, resilient systems? Knowing that we will eventually break something, we should build our systems to alert us to these faults as soon as possible. We should build our systems to fail early, fail fast, fail often.

I know a thing or two about breaking things. I’ve been called out of the department picnic because the database connection pool changes I released the night before had a major problem. I’ve had to scramble to fix code and get it released because the load was heavier than expected or that edge case didn’t turn out to be that near the edge after all.

After 20 years of development, I still get it wrong on a fairly regular basis.

In almost every case that I can think of, much of the pain and late nights would have been averted if the problem had shown up earlier in the development process. It is well known that finding defects later in the process costs more. These costs come in many forms:

  • Loss of money. If your website doesn’t work, customers will go elsewhere
  • Loss of time. Developers spend time switching contexts and debugging. QA may have to wait until the bug is fixed.
  • Loss of reputation. If your customers can’t get what they need or paid for, they’ll tell their friends what a lousy service you’re running.
  • Loss of life and limb. In extreme cases, people could get hurt or worse. Think of the computer in your car or an airplane or the space shuttle.

What, exactly, do I mean when I say that systems should fail early, fast, and often?

Fail Early

If it’s more expensive to find defects late in the development process, then it makes sense to invest in tools, build systems and process that flag problems as early as possible.

Some strategies you might use are as follows:

  • Rely on compile time checks. With these checks, the IDE will usually alert you to issues before you even compile!
  • Incorporate build time analysis. This includes things like unit tests and static code analysis, which alert you to problems before you’ve even deployed your code or started up the application.
  • Run automated tests. After your application starts, run as many automated tests as you can, even on your development machine. The sooner these tests flag an error, the sooner you can get to the task of fixing the problem.

Fail Fast

Even if you have systems in place to find defects early in the development process, it’s not helpful if your tests take three days to execute.  Failing fast means that we design our systems to report back as quickly as possible. For example:

  • Prevent application startup. When your application encounters an errors doing startup, typically you should just stop right there, report the error and shut down, rather than continue on in an unstable state. I see this all the time in web applications.  There is a startup error (e.g. Spring wiring issue) and the application fails to deploy, but Tomcat (or your favorite container) stays up and running, leaving you to wonder why it won’t serve requests.  You have to look at the log files to figure out what’s wrong.
  • Validate input. There is no point in performing an expensive operation if it is reasonably known that the input is bad and the whole operation will fail.  In one instance, an application made many HTTP calls, only to get back a 400 response code (bad input) from the remote service.  The client library could have easily checked the value before sending it to the server, avoiding the expense of the remote service call.
  • Verify external systems. If your application relies on an external system (database, remote API, etc.) to do it’s work, it might make sense to make sure that system is available. If the required system isn’t available, the application could prevent start up or pause processing until the system becomes available again.

A word of caution is in order. When I suggest that your applications should fail fast, I’m referring to unexpected errors.  There are many problems you should expect (IO errors, remote system failures, invalid input).  Your application should handle these errors and respond accordingly. But the sooner your application can let you know about a problem, the better.

Fail Often

The more often you run tests, the more likely it is that you’ll find failures. The more often you find failures, the better you get at learning how to avoid them in the first place. While sometimes unpleasant, learn to embrace each failure as an opportunity to learn something and improve in the future.

I’ve missed more than 9000 shots in my career. I’ve lost almost 300 games. 26 times, I’ve been trusted to take the game winning shot and missed. I’ve failed over and over and over again in my life. And that is why I succeed. — Michael Jordan

Every failure is an opportunity to identify something that should be improved in your development process.

  • Automation improvements. If there are manual steps in your process, that’s an opportunity for defects to slip in or errors to be made.  Look for ways to make the process repeatable and automate it.
  • Additional tests. If you find a defect late in your process, it means your automated test suite isn’t complete. Fill in the gap with the missing test so it doesn’t happen again.
  • Process improvements. Some things don’t lend themselves well to automation, but they do follow a pattern.  Look for improvements to the process to catch these things early.  This might include things like manual testing, code reviews, better defined best practices and coding standards, etc.

Each time something breaks, it can be painful to recover.  But if you learn, improve and keep moving, keep iterating, eventually it becomes better, stronger, and more resilient.

Conclusion

As software developers, we strive to build systems that are correct and robust.  But even the best of us miss things and have defects in our code.  Embrace these failures as a learning experience. Build tools, systems, and practices that allow these failures to be found as quickly and as early in the development process as possible.  Adopt this mindset and you might just find that you succeed early, succeed fast, and succeed often.

Question: What systems, tools and practices do you put in place to find defects as quickly as possible?