Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Optimistic Replication and Resolution

  • Marc ShapiroEmail author
Living reference work entry

Later version available View entry history

DOI: https://doi.org/10.1007/978-1-4899-7993-3_258-3

Synonyms

Asynchronous Replication; Lazy replication; Optimistic replication; Reconciliation-based data replication

The term “optimistic replication” is prevalent in the distributed systems and distributed algorithms literature. The database literature prefers “lazy replication.”

Definition

Data replication places physical copies of a shared logical item onto different sites. Optimistic replication (OR) [17] allows a program at some site to read or update the local replica at any time. An update is tentative because it may conflict with a remote update. Such conflicts are resolved after the fact, in the background. Replicas may diverge occasionally but are expected to converge eventually (see “Eventual Consistency”).

OR avoids the need for distributed coordination prior to using an item. It allows a site to execute even when remote sites have crashed, when network connectivity is poor or expensive, or while disconnected from the network.

The defining characteristic of OR is that any communication between sites occurs in the background, after local commitment, i.e., off the critical path of the application.

OR enables parallelism, and updates occur and propagate quickly. The OR approach is well adapted to distributed databases over slow or failure-prone networks, and OR is essential to be able to access remote data with high availability. Prominent examples include geo-replication (see “Multi Datacenter Consistency”) and mobile computing scenarios. Indeed, the CAP Theorem states that, in a network that is prone to disconnection, it is not possible to ensure both strong consistency and availability. When availability is paramount, for instance in e-commerce applications, this leads to the choice of the weak consistency levels (such as “Eventual Consistency”) supported by OR.

Disconnected operation, the capability to compute while disconnected from a data source, e.g., in mobile computing, requires OR. In computer-supported cooperative work, OR enables a user to temporarily insulate himself from other users. In cloud computing OR enables the system to remain available for reads and writes even when some fewer is slow OR partitioned away.

Historical Background

(The vocabulary used in this history is defined in Section “Foundations.”)

The first historical instance of OR is Johnson’s and Thomas’s Last-Writer-Wins replicated database (1976).

Usenet News (1979) supports a large-scale ever-growing database of (read-only) items, posted by users all over the world. A Usenet site connects infrequently (e.g., daily) with its peers. New items are flooded to other sites and are delivered in arbitrary order. Users occasionally observe ordering anomalies, but this is not considered a problem. However, system administrators must deal manually with conflicts over administrative operations.

In 1984, Wuu and Bernstein’s replicated mutable key-value-pair database uses an operation log, transmitted by an anti-entropy protocol: site A sends to site B only the tail of A’s log that B has not yet seen [23]. Concurrent operations either commute or have a natural semantic order; non-concurrent operations execute in happens-before order.

The Lotus Notes system (1988) supports cooperative work between mobile enterprise users. It replicates a database of discrete items in a peer-to-peer manner. Notes is state-based and uses a Last-Writer Wins policy. A deleted item is replaced by a tombstone.

Several file systems, designed in the early 1990s to support disconnected work, e.g., Coda [9], are state based and use version vectors for conflict detection. Conflicts over some specific object types (e.g., directories or mailboxes) cause automatic resolver programs to run. The others must be resolved manually.

The Computer-Supported Cooperative Work (CSCW) community invented (1989) a form of OR called Operational Transformation (OT). Conflicting operations are transformed, by modifying their arguments, in order to execute in arbitrary but causal order [20].

Golding (1992) [5] studies a replicated database of mutable key-value pairs. This system purges an operation from the log when it can prove that it was delivered to all sites. Consistency is ensured by defining a total order of operations.

Bayou (1994–1997) is a seminal general-purpose database for mobile users [13]. Bayou is operation based and uses an anti-entropy protocol. Each site executes transactions in arbitrary order; transactions remain tentative. The eventual serialization order is the order of execution at a designated primary site. Other sites roll back their tentative state, and re-execute committed transactions in commit order.

In 1996, Gray et al. argued that OR databases cannot scale [7], because conflict reconciliation is expensive, conflict probability rises as the third power of the number of nodes, and the wait probability further increases quadratically with disconnection time.

In 1999, Breitbart et al. [2] describe a partially replicated database that uses a form of OR. Each item has a designated primary site and may be replicated at any number of secondary sites. A read may occur at a secondary site but a write must occur on the primary. It follows that write transactions update a single site. If transactions are serializable at each site, and update propagation is restricted to avoid ordering anomalies, then transactions are serializable despite lazy propagation.

Cloud computing has sparked a new interest in OR. In order to avoid synchronization, which is bad for performance and for fault-tolerance, AP (Available under Partitia) databases are designed in an OR style, supporting only weakly-consistent key-value storage, such as Last-Writer Wins (Cassandra) or Multi-Value Register (Dynamo).

Geo-replication (see “Multi Datacenter Consistency”) places database replicas at several data centers around the globe, for improved responsiveness and fault tolerance. Although a replica may be strongly consistent internally, geo-replication typically uses OR between data centers to ensure availability. Examples include Walter [19], Eiger [11], or Riak.

Around 2010, several researchers proposed the concept of a Replicated Data Type (RDT) [3, 15, 16, 18]. An RDT is similar to an ordinary data type; for instance, read-write register, set, map, graph, etc., may constitute RDT types. Abstractly, an RDT is similar to the corresponding ordinary abstract data type; for instance, the interface to a register RDT might have read and assign methods, whereas a set RDT would have methods for testing whether an element is a member of the set and for adding and removing elements to/from the set. Internally, an RDT is replicated, to provide reliability, availability, and responsiveness. Encapsulation hides the details of replication and conflict resolution.

Foundations

Figure 1 depicts a logical item x, concretely replicated at three different sites. In OR, any site may submit or initiate a transaction reading or writing the local replica. If the transaction succeeds locally, the system propagates it to other sites and replays the transaction on the remote sites, in a lazy manner, in the background. Local execution is tentative and may be resolved against a conflict with a concurrent remote transaction. (The happens-before and concurrency relations are defined formally by Lamport [10]. Transaction A happens-before B, if OR replayed B was initiated on some site after A executed at that site. Two transactions are concurrent if neither happens-before the other.)
Fig. 1

Three sites with replicas of logical item x. Site 1 initiates transaction f; Site 2 initiates g. The system propagates and replays on remote sites. Site 3 executes in the order g;f, whereas Site 1 replays f before g. Eventually, Site 2 will also execute f

OR is opposed to pessimistic (or eager) replication, where a local transaction terminates only when it commits globally. Pessimistic replication logically establishes a total order for committed transactions, at the latest when each transaction terminates. In contrast, OR generally relaxes the ordering requirements and/or converges to a common order a posteriori. The effects of a tentative transaction can be observed; thus OR protocols may violate the isolation property and allow cascading aborts and retries to occur.

Transmitting and Replaying Updates

In OR, updates are propagated lazily, in the background, after the transaction has terminated locally. Transmission usually uses peer-to-peer epidemic or anti-entropy techniques (see entry on “Peer-to-Peer Content Distribution”).

There are two main approaches to update transmission and delivery. In the state-based approach, a sender transmits the updated (after-values) of the object; a receiver merges the received value into its local state. In the operation-based approach, the sender transmits the program of the update transaction itself; a receiver replays the code of the transaction on its local replica.

The state-based approach is often perceived as being the simpler of the two. In the common case of last-writer-wins, state-based merge often reduces to overwriting the local replica with the received value; this is guaranteed to be deterministic. In the more general case, the merge procedure must be carefully designed to ensure convergence. The state-based approach tolerates unreliable and out-of-order delivery. However, if the replicated object is large, then state-based transmission is expensive, and replay is subject to false conflicts.

Conversely, the cost of transmitting an operation is often very small, similar to a remote procedure call.

Furthermore, logical operations are more likely to commute than writes; thus operation-based replay typically causes fewer aborts. However, the operation-based approach assumes communication layer, that ensures reliable, exactly-once delivery in happened-before order.

Conflicts

Each transaction taken individually is assumed correct (the C of the ACID properties), i.e., it maintains semantic invariants. For example, ensuring that a bank account remains positive, or that a person is not scheduled in two different meetings at the same time.

As is clear from Fig. 1, concurrent transactions may be delivered to different sites in different orders (see section “Scheduling Transactions Content and Ordering”). However, requires that local schedules be equivalent. In this respect, one may classify pairs of concurrent transactions as commuting, non-commuting, and antagonistic. Transactions conflict if they are mutually non-commuting or mutually antagonistic.

The relative execution order of commuting transactions is immaterial; they require no remote synchronization. Formally, two transactions T 1 and T 2 commute if execution order T 1;T 2 returns the same results to the user and leaves the database in the same state as the order T 2;T 1. For instance, depositing €10 in a bank account commutes with a depositing €20 into the same account and also commutes with withdrawing €100 from an independent account.

If running concurrent transactions together would violate an invariant, they are said antagonistic. Safety requires aborting one or the other (or both). For instance, if T 1 schedules me in a meeting from 10:00 to 12:00, and T 2 schedules a different meeting from 11:00 to 13:00, they are antagonistic since no combination of both T 1 and T 2 can be correct.

If two transactions are non-commuting and neither is aborted, then their relative execution order must be the same at all sites. Consider for instance T 1 = “transfer balance to savings” and T 2 = “deposit €100.” Both orders T 1;T 2 and T 2;T 1 make sense, but the result is clearly different. There must be a system-wide consensus on the order T 1;T 2 or T 2;T 1.

Conflict Resolution and Reconciliation

Conflict resolution rewrites or aborts transactions to remove conflicts. Conflict resolution can be either manual or automatic. Manual conflict resolution simply allows conflicting transactions to proceed, thereby creating conflicting versions; it is up to the user to create a new, merged version.

Reconciliation detects and repairs conflicts and combines non-conflicting updates. Thus transactions are tentative, i.e., a tentatively successful transaction may have to roll back for reconciliation purposes. OR resolves conflicts a posteriori (whereas pessimistic approaches avoid them a priori).

In many systems, data invariants are either unknown or not communicated to the system. In this case, the system designer conservatively assumes that, if concurrent transactions access the same item, and one (or both) writes the item, then they are antagonistic. Then, one of them must abort, or both.

A few systems, such as Bayou [22] IceCube [14] or CISE [6] support an application-specific check of invariants. Bailis et al. [1] shows that application-level enforcement of invariants is error prone.

Last Writer Wins

When transactions consist only of reads and assignments, a common approach is to ensure a global precedence order.

For instance, many replicated file systems follow the “Last Writer Wins” (LWW) approach. Files have timestamps that increase with successive versions. When the file system encounters two concurrent versions of the same file, it overwrites the one with the smallest timestamp with the “younger” one (highest timestamp). The write with the smallest timestamp is lost; this approach violates the Durability property of ACID.

Semantic Resolvers

A resolver is an application-specific conflict resolution program that automatically merges two conflicting versions of an item into a new one. For example, the Amazon online book store resolves problems with a user’s “shopping cart” by taking the union of any concurrent instances. This maximizes availability despite network outages, crashes, and the user opening multiple sessions.

A resolver should ensure that the conflicting transactions are made to commute. In a state-based approach, a resolver generally parses the item’s state into small, independent sub-items. Then it applies an LWW policy to updated and tombstoned sub-items and a union policy to newly created sub-items.

The most elaborate example exists in Bayou. A Bayou transaction has three components: the dependency check, the write, and the merge procedure. The former is a database query that checks for conflicts when replaying. The write (a SQL update) executes only if the consistency check succeeds. If it fails, the merge procedure (an arbitrary but deterministic program) provides a chance to fix the conflict. However, it is very difficult to write merge procedures in the general case.

Operational Transformation

In Operational Transformation (OT), conflicting operations are transformed [20]. Consider two users editing the shared text “abc.” User 1 initiates insert(“X”,2) resulting in “aXbc” and User 2 initiates delete(3), resulting in “ab.” When User 2 replays the insert, the result is “aXb” as expected. However for User 1 to observe the same result, the delete must be transformed to delete(2).

In essence, the operations were specified in a non-commuting way, but transformation makes them commute. OT assumes that transformation is always possible. The OT literature focuses on a simple, linear, shared edit buffer data type, for which numerous transformation algorithms have been proposed.

OT requires two correctness conditions, often called TP1 and TP2. TP1 requires that, for any two concurrent operations A and B, running “A followed by {B transformed in the context of A}” yield the same result as “B followed by {A transformed in the context of B}.” TP1 is relatively easy to satisfy and is sufficient if replay is somehow serialized.

TP2 requires that transformation functions themselves commute. TP2 is necessary if replay is in arbitrary order, e.g., in a peer-to-peer system. The vast majority of published non-serialized OT algorithms have been shown to violate TP2 [12].

Conflict-Free Replicated Data Types (CRDTs)

The common memory-cell data model is not well suited to an OR system, since concurrent assignments do not commute. OR will benefit from a data model where concurrent updates can be merged, ensuring that replicas converge without requiring synchronisation or consensus. For instance, concurrent increment and decrement operations to a shared counter can be naturally merged, because they commute.

Conflict-free Replicated Data Types (CRDTs) generalise this approach [18]. A CRDT is an abstract data type that extends some sequential type, and encapsulates algorithms ensuring that concurrent updates are merged deterministically and are guaranteed to converge. Thanks to this property, replicas of a CRDT can be updated in parallel without synchronisation. CRDT types include registers, counters, sets, maps, graphs and sequences.

When used in a sequential way, a CRDT type behaves just like its sequential counterpart. Furthermore, if two updates commute in the sequential specification, then executing the same two updates concurrently will converge to the same state. For instance, the result of concurrently adding elementsandto some CRDT set are the same as adding them in any order. This means that a CRDT type is plug-in replacement for the corresponding sequential data type.

The key challenge in CRDT design is providing a sensible concurrency semantics for updates that do not commute in the sequential specification. Thus, the concurrent specification of concurrently adding and removing the same elementto a set might be “add wins,” i.e.,appears in the set; but, it could equally be “remove wins” or “highest timestamp wins” depending on application requirements.

Scheduling Transactions Content and Ordering

In order to capture any causal dependencies, transactions execute in happens-before order i.e., causal consistency. As explained in Section “Conflicts,” antagonistic transactions cause aborts, and non-commuting transactions must be mutually ordered. This so-called serialization requires a consensus, violating the availability requirement.

Whereas pessimistic approaches serialize a priori, most OR systems execute transactions tentatively in arbitrary order and serialize a posteriori. Some executions are rolled back; cascading aborts may occur.

A prime example is the Bayou system [22]. Each site executes transactions in the order received. Eventually, the transactions reach a distinguished primary site. If the dependency checks of a transaction fails at the primary, then it aborts everywhere. Transactions that succeed commit and are serialized in the execution order of the primary.

The IceCube system showed that it is possible to improve the user experience by scheduling operations intelligently [14]. IceCube is a middleware that relieves the application programmer from many of the complexities of reconciliation. Multiple applications may coexist on top of IceCube. Applications expose semantic annotations, indicating which operation pairs commute or not, are antagonistic, dependent, or have an inherent semantic order. The user may create atomic groups of operations from different applications. The IceCube scheduler performs an optimization procedure over a batch of operations, minimizing the number of aborted operations. The user commits any of the alternative schedules proposed by the system.

Freshness of Replicas

Applications may benefit from freshness or quality-of-service guarantees, e.g., that no replica diverges by more than a known amount from the ideal, strongly consistent state. Such guarantees come at the expense of decreased availability.

The Bayou system proposes qualitative “session guarantees” on the relative ordering of operations [21]. For instance, Read-Your-Writes (RYW) guarantees that a read observes the effect of a write by the same user, even if initiated at a different site. RYW ensures, that immediately after changing his password, a user can log in with the new password. Other similar guarantees are Monotonic-Reads, Writes-Follow-Reads, and Monotonic-Writes. The conjunction of their guarantees is equivalent to causal consistency.

Systems such as TACT control replica divergence quantitatively [8]. TACT provides a time-based guarantee, allowing an item to remain stale for only a bounded amount of time. TACT implements this by pushing an update operation to remote replicas before the time limit elapses. TACT also provides “order bounding,” i.e., limiting the number of uncommitted operations: when a site reaches a user-defined bound on the number of uncommitted operations, it stops accepting new ones.

Finally, TACT can bound the difference between numeric values. For this, each replica is allocated a quota. Each site estimates the progress of other sites, using vector clock techniques. The site stops initiating operations once its cumulative modifications, or the estimated remote updates to the item, reach the quota. At that point, the site pushes its updates and pulls remote operations. For example, a bank account might be replicated at ten sites. To guarantee that the balance observed is within €50 of the truth, each site’s quota is €50/10 = €5. Whenever the difference estimated by a site reaches €5, it synchronizes with the others.

Optimistic Replication Versus Optimistic Concurrency Control

The word “optimistic” has different, but related, meanings when used in the context of replication and of concurrency control.

Optimistic replication (OR) means that updates propagate lazily. There is no a priori total order of transactions. There is no point in time where different sites are guaranteed to have the same (or equivalent) state. Cascading aborts are possible.

Optimistic concurrency control (OCC) means that conflicting transactions are allowed to proceed concurrently. However, in most OCC implementations, a transaction validates before terminating. A transaction is serialized with respect to concurrent transactions, at the latest when it terminates, and cascading aborts do not occur.

Key Applications

Usenet News pioneered the OR concept, allowing to share write-only information over a slow, but cheap network using dial-up modems over telephone lines.

Mobile users want to be able to work as usual, even when disconnected from the network. Thus, mobile computing is a key driver for OR applications. Systems designed for disconnected work that use OR include the Coda file system [9], the Bayou shared database [22], or the Lotus Notes collaborative suite.

Another important application area is Computer-Supported Collaborative Work. In this domain, users must be able to update shared artefacts in complex ways without interfering with one another. OR allows a user to insulate himself temporarily from other users. A key example is the Concurrent Versioning System (CVS), which enables collaborative authoring of computer programs [4]. Bayou and Lotus Notes, just cited, are also designed for collaborative work.

OR is used for high performance and high availability in large-scale web sites. A well-known example is Amazon’s “shopping cart,” which is designed to be highly available, even if the same user connects to several instances of the Amazon store discussed earlier. For this reason, many NoSQL databases embrace the Available under Partitia (AP) option (i.e., DDCAS Theorem) which is OR.

Cross-References

Recommended Reading

  1. 1.
    Bailis P, Fekete A, Franklin MJ, Ghodsi A, Hellerstein JM, Stoica I. Feral concurrency control: an empirical investigation of modern application integrity. In: SIGMOD, Melbourne. ACM; 2015. p. 1327–42. http://doi.acm.org/10.1145/2723372.2737784
  2. 2.
    Breitbart Y, Komondoor R, Rastogi R, Seshadril S. Update propagation protocols for replicated databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia; 1999. p. 97–108.Google Scholar
  3. 3.
    Burckhardt S, Leijen D. Semantics of concurrent revisions. In: ESOP. LNCS. Saarbrücken; 2011. Vol. 6602. p. 116–135. http://dx.doi.org/10.1007/978-3-642-19718-5_7
  4. 4.
    Cederqvist P, et al. Version Management with CVS. Bristol: Network Theory; 2006.Google Scholar
  5. 5.
    Golding RA. Weak-consistency group communication and membership. PhD thesis, University of California, Santa Cruz. 1992. Technical Report no. UCSC-CRL-92-52. Available at: ftp://ftp.cse.ucsc.edu/pub/tr/ucsc-crl-92-52.ps.Z
  6. 6.
    Gotsman A, Yang H, Ferreira C, Najafzadeh M, Shapiro M. Cause I’m strong enough: reasoning about consistency choices in distributed systems. In: POPL, St. Petersburg. 2016. p. 371–84. http://dx.doi.org/10.1145/2837614.2837625
  7. 7.
    Gray J, Helland P, O’Neil P, Shasha D. The dangers of replication and a solution. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Montreal; 1996. p. 173–82.Google Scholar
  8. 8.
    Haifeng Yu, Amin V. Combining generality and practicality in a conit-based continuous consistency model for wide-area replication. In: Proceedings of 21st International Conference on Distributed Computing Systems, Arizona; 2001.Google Scholar
  9. 9.
    Kistler JJ, Satyanarayanan M. Disconnected operation in the Coda file system. ACM Trans Comp Syst. 1992;10(5):3–25.CrossRefGoogle Scholar
  10. 10.
    Lamport L. Time, clocks, and the ordering of events in a distributed system. Commun ACM. 1978;21(7):558–65.CrossRefzbMATHGoogle Scholar
  11. 11.
    Lloyd W, Freedman MJ, Kaminsky M, Andersen DG. Don’t settle for eventual: scalable causal consistency for wide-area storage with COPS. In: SOSP, Cascais. ACM; 2011. p. 401–16. http://doi.acm.org/10.1145/2043556.2043593
  12. 12.
    Oster G, Urso P, Molli P, Imine A. Proving correctness of transformation functions in collaborative editing systems. Rapport de recherche RR-5795, LORIA – INRIA Lorraine. 2005. Available at: http://hal.inria.fr/inria-00071213/
  13. 13.
    Petersen K Spreitzer MJ, Terry DB, Theimer MM, Demers AJ. Flexible update propagation for weakly consistent replication. In: Proceedings of 16th ACM Symposium on Operating System Principles, St. Malo; 1997. p. 288–301.Google Scholar
  14. 14.
    Preguiça N, Shapiro M, Matheson C. Semantics-based reconciliation for collaborative and mobile environments. In: Proceedings of International Conference on Cooperative Information Systems, Catania; 2003. p. 38–55.Google Scholar
  15. 15.
    Preguiça N, Marquès JM, Shapiro M, Leţia M. A commutative replicated data type for cooperative editing. In: ICDCS, Montréal; 2009. p. 395–403. http://doi.ieeecomputersociety.org/10.1109/ICDCS.2009.20
  16. 16.
    Roh H-G, Jeon M, Kim J-S, Lee J. Replicated abstract data types: building blocks for collaborative applications. JPDC. 2011;71(3):354–68. http://dx.doi.org/10.1016/j.jpdc.2010.12.006
  17. 17.
    Saito Y, Shapiro M. Optimistic replication. ACM Comput Surv. 2005;37(1):42–81.CrossRefzbMATHGoogle Scholar
  18. 18.
    Shapiro M, Preguiça N, Baquero C, Zawirski M. Conflict-free replicated data types. International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS). Volume 6976 of Lecture Notes in Computer Science, Grenoble; 2011. Springer. p 386–400.Google Scholar
  19. 19.
    Sovran Y, Power R, Aguilera MK, Li J. Transactional storage for geo-replicated systems. In: SOSP, Cascais. ACM; 2011. p. 385–400. http://doi.acm.org/10.1145/2043556.2043592
  20. 20.
    Sun C, Ellis C. Operational transformation in real-time group editors: issues, algorithms, and achievements. In: Proceedings of International Conference on Computer-Supported Cooperative Work, Seattle; 1998. p. 59.Google Scholar
  21. 21.
    Terry DB, Demers AJ, Petersen K, Spreitzer MJ, Theimer MM, Welch BB. Session guarantees for weakly consistent replicated data. In: Proceedings of International Conference on Parallel and Distributed Information Systems, Austin; 1994. p. 140–9.Google Scholar
  22. 22.
    Terry DB, Theimer MM, Petersen K, Demers AJ, Spreitzer MJ, Hauser CH. Managing update conficts in Bayou, a weakly connected replicated storage system. In: Proceedings of 15th ACM Symposium on Operating System Principles, Copper Mountain; 1995. p. 172–82.Google Scholar
  23. 23.
    Wuu GTJ, Bernstein AJ. Efficient solutions to the replicated log and dictionary problems. In: Proceedings of ACM SIGACT-SIGOPS 3rd Symposium on the Principles of Distributed Computing, Vancouver; 1984. p. 233–42.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  1. 1.UPRC-LiP6 and INRIA ParisParisFrance

Section editors and affiliations

  • Bettina Kemme
    • 1
  1. 1.School of Computer ScienceMcGill UniversityMontrealCanada