Planet-scale systems are subject to inconsistencies. Distributed systems are distributed in terms of services, regions, and time, and different parts of the system may have different perceptions of reality. One service might believe that an order is paid, and another that it is not. These inconsistencies are not errors per se; they are expected as a result of concurrency, partial failure, and asynchronous communication. There is no need to eliminate the inconsistencies, but to identify and resolve them in a safe and efficient manner and ensure that the system is accessible to its customers.
One such example would be a payment service recording a successful payment, and the state of the order being updated prematurely; the state of the order would then be verified during the reconciliation process. Depending on the result, the system may retry the missing step, move the order back to a pending state, or alert an operator.
Distributed systems have services that are independent and own their own data. This can lead to inconsistencies in a number of ways. The system can be left in an inconsistent state in the case of a failure in a multi-phase operation. Messages can be reordered, or even lost or duplicated. Data can be concurrently updated by services. Even if compensating actions are used for failures, those actions can also fail and result in inconsistencies. These examples show that inconsistency is a natural consequence of the design of distributed systems to be scalable and fault-tolerant.
Building the Foundation with Distributed Patterns
To cope with inconsistency, today’s systems use a number of well-known patterns that make up a pattern language. One of them is to split complex operations into smaller steps, with compensating actions when something goes wrong. These actions may not perfectly undo the original operation, but they can move the system back toward a valid state rather than leaving the data in a “half-baked” state.
Another is to maintain a history of changes as events, which allows the system to reconstruct its state from a particular point in time and to find inconsistencies. Reliable messaging patterns are crucial: durable publication, retries, idempotent processing, deduplication, and dead-letter handling reduce message loss, duplicate side effects, and inconsistent state.
Detecting Inconsistency Through Continuous Observation
The integrated framework should be able to detect inconsistencies as well. This is where consistency checking can be used. Polling different services and ensuring consistency among them can help in detecting inconsistencies. For instance, if one service indicates that a transaction has been completed and another indicates that the transaction is still in progress for longer than expected, then there may be an inconsistency that needs investigation. This turns inconsistency detection from a passive process into an active one, as inconsistencies can be detected before they escalate into a larger problem or become visible to customers.
As the detection of inconsistencies becomes an expected part of system operation rather than an exception, it can begin to identify patterns. If inconsistencies repeatedly appear between specific operations, teams can use that pattern to anticipate where problems are likely to occur. It could, for example, monitor the fact that payments are typically slower compared to order confirmations. This contributes to its intelligence as it shifts to predictive monitoring as opposed to reactive monitoring. It not only identifies problems but also anticipates them.
It also takes advantage of utilizing the state of the services when inconsistencies occur. This provides clues as to what happened and how. This increased awareness makes detection more than just validation. It takes on a forensic nature and can be applied to help improve the system, avoid problems, and increase the resilience of distributed services.
Correcting Inconsistencies with Intelligent Reconciliation
The discrepancy may not be substantial enough to be identified. Some solution has to be found to rectify the problem. An integrated approach would entail having a reconciliation engine in place with the ability to identify and rectify the issue. It may involve repeating a step, including another step that was not initially added, or even turning back a few steps. It could also send an alert to a human operator if it is not able to fix itself. The primary point that should be mentioned here is that correction cannot be ad hoc. It has to be based on very clear rules regarding which service can be trusted, which steps can be redone, which steps should be reversed or compensated, and when a human operator is needed.
The second element of a holistic approach is to recognize the reality that consistency does not occur instantly. Mechanisms should be developed that anticipate and permit inconsistencies and resolve them. This prevents rigorous synchronization, to the benefit of flexibility and resilience. Some inconsistencies in the user interface can be resolved quickly, while inconsistencies in the background may take longer to resolve. The fact that observability provides an overview of the health of a system and data flow is also imperative.
Toward a Unified Approach
One such approach, however, involves not only detection and correction but also a combination of the two. We create, solve, and study problems with good communication and data patterns. Later on, rather than attempting to rectify the inconsistency, the system takes it into account when it comes to experimentation and learning. This enables distributed systems to be highly scalable and to trust the data.
Finally, a unified approach to data inconsistency is one of managing imperfection, not perfection. The detection, correction, and accommodation of data inconsistency can make distributed systems scalable and resilient to uncertainty and change.