Fault tolerant systems

Recently I listened to a speech about Fault Tolerant Systems at the Berlin Expert Days held by Uwe Friedrichsen. I made myself some notes and this is what the today’s post is about.

First of all Friedrichsen separated different kind of failure-types:

Crash Failure: Pretty obvious this means a failure that causes a system or a part of it to crash.

Omission Failure: This kind of failure leads to a system that is unreachable. Although the effect to the user may be the same than the Crash Failure the system is still live.

Timing Failure: A good example of a timing failure is a timeout when some resources are requested by a client, e.g. the webserver does not respond within a given timeframe. A reason may be that the system is under load and cannot handle it. Clients reload the page, increase the load of your system and multiple Timing Failures may lead to a system crash.

Response Failure: A failure which returns an unexpected and wrong response to a given request.

Byzantine Failure: A failure which happens arbitrary and is hard to reproduce or happens only under specific constellations. A race-condition through concurrent access might be an example here.

Measure, don’t guess!

After this categorization he proceeded to two important metrics when it comes to failures: Mean Time To Failure (MTTF) and Mean Time To Recovery (MTTR). MTTF is the time that elapsed until a clean and normal behaving system starts to reproduces failures. If this number is very high your system can be classified as stable and you are doing a good job. On the other hand, if you know your MTTF it means that failures are a common to your system and it therefore cannot be stable. More important is the MTTR: how long does it take you to repair a broken system and bring it back to a stable state. Of course this depends on multiple factors like overall code quality, etc.

Keep it simple – it will fail

Friedrichsen then advices us to follow three principles when creating a system that should be tolerant against failures:

  • KISS
  • Design for failure
  • Design incrementally

KISS is an abbreviation for Keep it simple, Stupid! It basically means that you should design every part of your system as simple as possible. Yes, you are a skilled coder that can combine three different design patterns to solve the problem and keep it extensible. However this has (at least) two disadvantages: mostly you ain’t gonna need the extensibility (YAGNI – google it) and when it comes to failures one has to know why you implemented the way you did (you documented it – right?). The MTTR would have been shorter when the one that has to solve the problem would only have to look into one class and not into different packages, modules or even frameworks (small sidenote: I don’t advice you to write a system into one class with 10K lines of code)!

The Design For Failure-principle goes hand in hand with KISS: you can be certain that your system will fail. I’ll guarantee it will. When you respect that and design your system that way it will be easier to fix problems when they arise. It also means that you think about failures (in Java: Exceptions) and how to handle them, when they are thrown.

I think the last point in the list – Design Incrementally – means that you have to break the task into smaller parts, implement and test those. When you do it that way you can test subsystems early instead of a Pandora’s Box after several months of coding the whole application.

A critical part is also detecting errors. The larger your application and the more sensible your data is, it becomes more important to detect errors early. It starts with throwing and logging Exceptions in your code (and not silently ignoring them), automatic (email) notifications in case of failures and live monitoring of the system through humans. The more time and work you spend here, the less time it will take to recognize and eliminate the failure.

Handle it

Finally Friedrichsen showed some patterns concerning failures:

Redundancy: This is mostly on a hardware-level. Having your database replicated to another server in realtime makes it possible to recover in case of a defect very fast. A rollback of a huge MySQL on the other hand can take several hours in which your system will be unavailable. In times of always-on and availability this is mostly unacceptable. Another example comes from aviation: some critical parts in an aircraft are designed redundant. For example a calculation is done simultaneously by different systems (different software) to ensure correctness and availability.

Escalation: Back to the code level: when a subsystem (e.g. the database-layer) detects an error, like connection-loss to the database, it should escalate that also to the business-layer so it can react on the new situation, e.g. stop accepting requests and raise monitoring-events.

Error Handler: You can forward exceptions to an Error Handler that reacts on the error, like just logging the error, retry the operation, perform a rollback, rolling forward or just reset data.

Shed load: Main idea behind this is that a long response time is worse than rejecting some a request. Rejecting is fast and directly visible whereas a timeout can take a while and feels unresponsive. One can implement this via a gatekeeper that has to be passed by every request. The keeper decides whether or not to forward the request or reject it depending on plausible metrics.

Marked data: When you know that there was a failure and the data you’re currently working with is corrupt you can mark it with a dirty flag. Go ahead and handle it via routine maintenance (e.g. a script that fixes it or a cronjob).

Small patches: When you’ve fixed a bug, make atomic patches. In that way you minimize the risk of adding new bugs to the system.

Cost/Benefit: Sometimes it is very expensive fixing a bug that does not harm the system. When the tradeoff between the costs of fixing the failure and the damage caused by it is made, it sometimes makes sense to live with the failure without removing it. In my opinion this is a very dangerous approach because of the Broken Window Theory: once your system is broken, you tend to work less correct because – hey the system is already corrupt – why should I give my best?

Timeouts: Instead of waiting infinitely for a resource you should always consider using a timeout where possible. In Java-World, when working with locks, you can use tryLock(timeoutMillis) instead of just lock() which blocks an undefined timeframe in case the lock is held by another Thread.

Fail Fast: Some operations are expensive. In a setup where an expensive operation is requested by a client and after that a resource is required which may be unavailable it is a good idea to check the availability before computation of the expensive action. This can be implemented by a guard that checks the availability and reports it to the system that processes the client-requests. In case of unavailability reject the response or handle it in another way.