Edge Cases Will Kill You

Distributed systems can fail in lots of different ways. Messages not received, messages sent multiple times, messages received out of order, and concurrency issues arising from different parts of the system receiving or not receiving related messages in the right order, or at the right time. Network partitions force you to decide between consistency and availability. Good architecture and technology stacks can shield us from a lot of these problems, however if you start picking away at the edges you can invariably find problems that can still arise in complex distributed systems. This goes double if your architecture isn’t rock solid, not followed as closely as it should be, or your separation of concerns aren’t so separate.

I’ve seen a culture arise in teams where consideration for these edge cases, and attempts to systemize solutions for when they arose came to dominate the effort by orders of magnitude.

If you let them, edge cases will kill you

The behaviour starts out innocently enough - people raise concerns about possible failure scenarios, and because no-on wants to be seen to be dropping the ball on quality those concerns get a generous hearing. Usually little consideration of or investigation into the likelihood of these scenarios happening is given. Sometimes designs are altered to explicitly account for those failure scenarios. Next time features are discussed someone else will have thought up a new variation on one of these corner cases, or another part of the system that would be vulnerable to the same issue. Team members seek to out-do each other by thinking up more complex and terrible ways systems could fail. The road to hell is paved with good intentions.

The Cost of Complexity

The costs you pay for this kind of complexity are hard to quantify, but they are very real, even once the initial implementation is complete, tested and deployed. Firstly there is the cognitive load required to grok the code when changes need to be made. Second there is the regression testing effort, since the team has accepted that the system will “handle” these edge cases someone is probably going to want to verify that - and complex, distributed, multi-machine, distributed-system-quorum-failure-network-partition-solar-flare-hit-the-hard-drive-raid-card-failed edge case scenarios are not easy to test repeatably. Lastly there is the precedent this sets - that these are the kind of scenarios that are “worth” testing - even though sometimes people only have a pretty sketchy idea about how often they arise, or the cost to systemise the solution, or the cost of not doing so. Complexity, once it has taken root, is very hard to get rid of.

Take a Risk Based-Approach

Don’t get me wrong - thinking about and discussing these issues are fine. But as soon as you start systemising the solution to them you run the risk of wasting time and money solving a problem that seems terrible, but might happen so infrequently as to be a non-issue. I think the proper way to approach these kinds of scenarios is to use a risk-based approach. What is the likelihood of this thing happening? What is the impact if it does? Once these parameters are known by looking at existing data, or estimated, along with an estimate of the effort to systemise the fix and test it, then a product owner (or whoever prioritises product features) can make a decision about it. If a decision to handle the edge case is made then leveraging existing platform features, or adding new features to your platform to allow multiple classes of similar issues to be handled in a uniform way is often a good way to approach these kinds of problems.

You Need A Fall-Back Solution Anyway

The irony is even if the team does try to systemise the fix, in most cases you probably need a (probably highly manual) fallback solution anyway, because things will go wrong that you didn’t anticipate.

Remember, You Can Always Fix It Later

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. –John Gall.

Software developers are not great at estimating. Your team’s estimates about the frequency or impact of a particular issue might be wrong in a negative way, and the team is now spending a significant amount of time manually fixing issues that could be automatically handled. At this point there is nothing to stop you adding this warranted complexity to the system to save you effort.

Photo Credit: H&E Stain of Oligodendroglioma Nephron