Software will fail. Your application won’t be perfect when it goes out the door. You can’t perfect something that hasn’t been “finished” and released into production. In spite of all your testing, both automated and manual, there will be problems. Users will exercise features in unexpected ways. Problems with the network or hardware will introduce cases that are hard to test. Developers make mistakes. If you accept this reality, then learn how to learn from failure.
When you hit a wrong note, it’s the next note that makes it good or bad - Miles Davis
There are really only three steps to handling any problem: Identify, Fix, and Learn.
Identify the Problem
The first step is to identify the problem. You can’t fix a problem you don’t understand. Before you do any work, perform a Root Cause Analysis (RCA). For some organizations, there are regulations requiring this. Others require an RCA as company policy. Even if there’s no formal policy, it’s just good practice. You can be as formal or relaxed about it as you want or are required, but seek first to understand.
Sometimes the problem is easy to identify, other times it requires more digging. In either case, log files are your friend. If you’ve been good about logging errors and other interesting information, you can usually find the problem by sifting through the various log files your application generates.
Some of the things you might find in the logs are:
- **Stack Traces.**Errors in the application are often accompanied by a stack trace in the application log file. These are extremely useful in pinpointing the exact bit of code causing the problem.
- Usage Patterns. Individual log messages can be interesting, but patterns in the aggregate can be quite telling. Access logs or other informational messages might point to a sudden spike in traffic to your website generally or to a particular URL.
- Timing. The amount of time taken for particular set of operations or requests can be a clue. Timestamps, typically included with log messages, can also pinpoint exactly when the issue began. Maybe the problem coincided with a maintenance outage on another system?
Log aggregation tools, either in house or cloud provided, can be extremely useful in looking for patterns across multiple log files and types of log files. When you combine these with any metrics you are gathering and graphing, it becomes much easier to pinpoint when a problem started and what may have caused it.
Not finding an obvious cause? You may need to make an educated guess and add log messages around the suspect area. Sometimes, identifying the root cause of a problem involves adding code that will help identify the issue when it shows up again.
Fix the Problem
Once you know what’s wrong, the task of fixing the problem is usually either really easy, maybe even a one-line change, or really hard, meaning there was a more systemic or architectural issue. By the time you get to this point, it’s natural to rush in and just fix things. Maybe you were up all night fighting a production outage and you just want to put the whole thing behind you.
Resist this temptation.
You went through all the effort of tracking down the issue, so do yourself a favor and make sure you write some tests that expose the problem. You don’t want someone coming in later and accidentally undoing all your hard work!
When things are on fire and systems are down, it is natural to want to get things fixed as quickly as possible. But don’t subvert good engineering principles in the process or you may compound the issue by introducing a new bug.
Take a deep breath and think. Is there a feature toggle you can flip or another mechanism to route around the code in question, giving you some breathing room to fix the issue properly? If not, work quickly, but don’t forget your best practices in the process.
Learn the Lessons
Whether the system had a minor blip or a major outage, make sure to pause and really absorb the lessons learned. For me, when an issue is resolved, I usually tempted to move along to the next thing. But take a moment, even if it’s just five minutes, to jot down what you learned and what you could do differently to avoid the same type of issue in the future. Some general categories of things I’ve learned in the past include:
- Testing improvements. Perhaps you could resolve to start doing more test driven development (TDD) or write more automated tests, in general.
- Process improvements. Is there some additional step that would help catch these errors? Perhaps doing more formal code reviews, or pair programming? Would a tool such as PMD, FindBugs, or CheckStyle help flag suspicious looking code?
- Design or code improvements. Is there a particular design pattern or coding idiom that should be applied more consistently? Maybe there’s a better way to deal with Exceptions in your code? Is there a decoupling strategy that would have made this issue easier to recover from. Are there ways to make code more resilient to failures in the future?
Conclusion
If you don’t take a moment and learn from failure, the hard work put into solving the problem is wasted effort. Take the time to understand the issue, fix it right the first time and learn from the issue. Failure is inevitable. Embrace the problems that arise as opportunities for learning. Do this, and you’ll become a better software developer.