Understanding Fault Tolerance and the strategies of resilience and eventual consistency are extremely important to microservices. This article an update and expansion of an article written in April of this year. It’s the first part in a series of articles explaining how the MicroProfile Fault Tolerance specification is used in microservices.
The Rise of Resilience and Eventual Consistency
The Rise of Resilience and Eventual Consistency
For decades the prevailing wisdom for handling transactions in distributed systems has been the use of SQL Relational Databases, binary communication protocols, and two-phase commit transactions.
All-or-nothing and reliable transactions were paramount, data had to be safely stored above all else, sacrificing the user experience and cost. The objects on those transactions could also be very complex, frequently using multiple tables and even different databases. Typically, if a transaction failed, the user would receive an error requiring his action; usually to resubmit the request or contact support. To keep response times low, vertical scaling with costly “big iron” was common.
With the rise of microservices and the use of HTTP for communication and NoSql databases for persistence, we see the rise of eventual consistency. The focus on reliability or the perfect operation at all times had to shift to resilience; the ability of an application to recover from certain types of failure and yet remain functional.
Resilience
Once the mindset shifts to resilience, requests are made smaller in scope, RESTful oriented, and eventually consistent. This allows requests to return faster, but it also means that the data might take some time to propagate throughout a cluster. This cluster is now made of smaller, cheaper machines or cloud-based and, if errors happen, resilient systems attempt to find a solution before quitting and throwing the problem into the user’s lap. This is one side of resilience.
The other one is the need to perform large-scale automated error handling and recovery. The problem of the app using dozens of interdependent microservices where one of them goes belly up causes a dramatic increase in latency (all calls to it are hitting a timeout) that cascades and takes the whole thing down. That’s a serious problem.
Making Systems More Resilient
In order to make systems more resilient, a few design patterns were devised and are now gathered under the Fault Tolerance (FT) umbrella:
- Bulkhead – isolate failures in part of the system.
- Circuit breaker – offer a way to fail fast.
- Retry – define criteria on when to retry.
- Fallback – provide an alternative solution for a failed execution.
On a broader resilience scale we can also find, among others, the following patterns:
- Health endpoint monitoring – implement functional checks in an application that external tools can access.
- Leader election – elect a coordinating leader for other instances.
- Compensating transaction – undo the work performed by a series of steps.
Why MicroProfile Fault Tolerance?
Some early APIs addressing these resilience and eventual consistency issues were Netflix’s Hystrix and the Failsafe library. These libraries used different APIs but had a number of similarities which were necessary for microservices. When the MicroProfile project started it soon became obvious that a single approach to resilience and eventual consistency was required and the MicroProfile FT specification was created.
The MicroProfile Fault Tolerance specification is part of the Eclipse MicroProfile, the open source community specification for Enterprise Java Microservices.
The MicroProfile community specification is hosted by the Eclipse Enterprise for Java (EE4J) open source initiative. EE4J is based on the Java Platform, Enterprise Edition (Java EE) standards, and uses Java EE 8 as the baseline.
Since the introduction of MicroProfile Fault Tolerance specifications, the projects have adopted it and are implementing the new standard including:
- Geronimo Safegard library, the library included in TomEE 7.1;
- Wildfly Swarm Fault Tolerance library;
- Payara Micro;
- Open Liberty (IBM)
The standard
The MicroProfile Fault Tolerance specifications have been evolving since 2017. At first, this is how entire MicroProfile ecosystem looked:
MicroProfile Fault Tolerance version 1.0, includes the following aspects:
- Timeout
- Bulkhead
- Circuit breaker
- Retry
- Fallback
- Asynchronous
Note: While the Asynchronous aspect looks like the old EJB @Asynchronous annotation, it isn’t. Yes, it allows fast return with the execution on a different thread but doesn’t need EJBs.
On MicroProfile Fault Tolerance, version 1.1, released May of 2018, was included in version 1.4 of the MicroProfile spec and contained a few improvements:
- Add MicroProfile Metrics support!
- Support for exponential backoff retry. Each subsequent retry might take longer to execute, hence relieving pressure from systems already in trouble.
For the upcoming Fault Tolerance 1.2, the biggest changes will be around @Asynchronous execution handling and the ability to use of Future and CompletionStage.
In the next blog post, we will take a closer look at MicroProfile Fault Tolerance and learning how to use it using TomEE. Stay tuned!