Abstract. Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test.
In this work, we propose a new programming model for distributed data that makes consistency properties explicit and uses a type system to enforce consistency safety. With the Inconsistent, Performance-bound, Approximate (IPA) storage system, programmers specify performance targets and correctness requirements as constraints on persistent data structures and handle uncertainty about the result of datastore reads using new consistency types. We implement a prototype of this model in Scala on top of an existing datastore, Cassandra, and use it to make performance/correctness tradeoffs in two applications: a ticket sales service and a Twitter clone. Our evaluation shows that IPA prevents consistency-based programming errors and adapts consistency automatically in response to changing network conditions, performing comparably to weak consistency and 2-10$\times$ faster than strong consistency.
To provide good user experiences, modern datacenter applications and web services must balance the competing requirements of application correctness and responsiveness. For example, a web store double-charging for purchases or keeping users waiting too long (each additional millisecond of latency [26, 36]) can translate to a loss in traffic and revenue. Worse, programmers must maintain this balance in an unpredictable environment where a black and blue dress [42] or Justin Bieber [38] can change application performance in the blink of an eye.
Recognizing the trade-off between consistency and performance, many existing storage systems support configurable consistency levels that allow programmers to set the consistency of individual operations [4, 11, 34, 58]. These allow programmers to weaken consistency guarantees only for data that is not critical to application correctness, retaining strong consistency for vital data. Some systems further allow adaptable consistency levels at runtime, where guarantees are only weakened when necessary to meet availability or performance requirements (e.g., during a spike in traffic or datacenter failure) [59, 61]. Unfortunately, using these systems correctly is challenging. Programmers can inadvertently update strongly consistent data in the storage system using values read from weakly consistent operations, propagating inconsistency and corrupting stored data. Over time, this undisciplined use of data from weakly consistent operations lowers the consistency of the storage system to its weakest level.
In this paper, we propose a more disciplined approach to inconsistency in the Inconsistent, Performance-bound, Approximate (IPA) storage system. IPA introduces the following concepts:
Consistency Safety, a new property that ensures that values from weakly consistent operations cannot flow into stronger consistency operations without explicit endorsement from the programmer. IPA is the first storage system to provide consistency safety.
Consistency Types, a new type system in which type safety implies consistency safety. Consistency types define the consistency and correctness of the returned value from every storage operation, allowing programmers to reason about their use of different consistency levels. IPA's type checker enforces the disciplined use of IPA consistency types statically at compile time.
Error-bounded Consistency. IPA is a data structure store, like Redis [54] or Riak [11], allowing it to provide a new type of weak consistency that places numeric error bounds on the returned values. Within these bounds, IPA automatically adapts to return the strongest IPA consistency type possible under the current system load.
We implement an IPA prototype based on Scala and Cassandra and show that IPA allows the programmer to trade off performance and consistency, safe in the knowledge that the type system has checked the program for consistency safety. We demonstrate experimentally that these mechanisms allow applications to dynamically adapt correctness and performance to changing conditions with three applications: a simple counter, a Twitter clone based on Retwis [55] and a Ticket sales service modeled after FusionTicket [1].
Unpredictable Internet traffic and unexpected failures force modern datacenter applications to trade off consistency for performance. In this section, we demonstrate the pitfalls of doing so in an undisciplined way. As an example, we describe a movie ticketing service, similar to AMC or Fandango. Because ticketing services process financial transactions, they must ensure correctness, which they can do by storing data in a strongly consistent storage system. Unfortunately, providing strong consistency for every storage operation can cause the storage system and application to collapse under high load, as several ticketing services did in October 2015, when tickets became available for the new Star Wars movie [21].
To allow the application to scale more gracefully and handle traffic spikes, the programmer may choose to weaken the consistency of some operations. As shown in Figure 1, the ticket application displays each showing of the movie along with the number of tickets remaining. For better performance, the programmer may want to weaken the consistency of the read operation that fetches the remaining ticket count to give users an estimate, instead of the most up-to-date value. Under normal load, even with weak consistency, this count would often still be correct because propagation is typically fast compared to updates. However, eventual consistency makes no guarantees, so under heavier traffic spikes, the values could be significantly incorrect and the application has no way of knowing by how much.
While this solves the programmer's performance problem, it introduces
a data consistency problem. Suppose that, like Uber's surge pricing,
the ticket sales application wants to raise the price of the last 100
tickets for each showing to $15. If the application uses a strongly
consistent read to fetch the remaining ticket count, then it can use
that value to compute the price of the ticket on the last screen in
Figure 1. However, if the programmer reuses getTicketCount
which used a weak read to calculate the price, then this count could be arbitrarily wrong.
The application
could then over- or under-charge some users depending on the consistency of
the returned value. Worse, the theater expects to make $1500 for
those tickets with the new pricing model, which may not happen with
the new weaker read operation. Thus, programmers need to be careful
in their use of values returned from storage operations with weak
consistency. Simply weakening the consistency of an operation may lead
to unexpected consequences for the programmer (e.g., the theater not
selling as many tickets at the higher surge price as expected).
In this work, we propose a programming model that can prevent using inconsistent values where they were not intended, as well as introduce mechanisms that allow the storage system to dynamically adapt consistency within predetermined performance and correctness bounds.
We propose a programming model for distributed data that uses types to control the consistency–performance trade-off. The Inconsistent, Performance-bound, Approximate (IPA) type system helps developers trade consistency for performance in a disciplined manner. This section presents the IPA programming model, including the available consistency policies and the semantics of operations performed under the policies. §4 will explain how the type system's guarantees are enforced.
ADT / Method | Consistency(Strong) | Consistency(Weak) | LatencyBound(_) | ErrorTolerance(_) |
---|---|---|---|---|
Counter .read() | Consistent[Int] | Inconsistent[Int] | Rushed[Int] | Interval[Int] |
Set .size() | Consistent[Int] | Inconsistent[Int] | Rushed[Int] | Interval[Int] |
Set .contains(x) | Consistent[Bool] | Inconsistent[Bool] | Rushed[Bool] | N/A |
List[T] .range(x,y) | Consistent[List[T]] | Inconsistent[List[T]] | Rushed[List[T]] | N/A |
UUIDPool .take() | Consistent[UUID] | Inconsistent[UUID] | Rushed[UUID] | N/A |
UUIDPool .remain() | Consistent[Int] | Inconsistent[Int] | Rushed[Int] | Interval[Int] |
The IPA programming model consists of three parts:
Set[T]
) on distributed storage.
Programmmers annotate ADTs with consistency policies to choose their desired level of consistency. The consistency policy on the ADT operation determines the consistency type of the result. Table 1 shows some examples; the next few sections will introduce each of the policies and types in detail. Together, these three components provide two key benefits for developers. First, the IPA type system enforces consistency safety, tracking the consistency level of each result and preventing inconsistent values from flowing into consistent values. Second, the programming interface enables performance–correctness trade-offs, because consistency policies on ADTs allow the runtime to select a consistency level for each individual operation that maximizes performance in a constantly changing environment. Together, these systems allow applications to adapt to changing conditions with the assurance that the programmer has expressed how it should handle varying consistency.
The base of the IPA type system is a set of abstract data types (ADTs) for distributed data structures. ADTs present a clear abstract model through a set of operations that query and update state, allowing users and systems alike to reason about their logical, algebraic properties rather than the low-level operations used to implement them. Though the simplest key-value stores only support primitive types like strings for values, many popular datastores have built-in support for more complex data structures such as sets, lists, and maps. However, the interface to these datatypes differs: from explicit sets of operations for each type in Redis, Riak, and Hyperdex [11, 25, 31, 54] to the pseudo-relational model of Cassandra [32]. IPA's extensible library of ADTs allows it to decouple the semantics of the type system from any particular datastore, though our reference implementation is on top of Cassandra, similar to [57].
Besides abstracting over storage systems, ADTs are an ideal place from which to reason about consistency and system-level optimizations. The consistency of a read depends on the write that produced the value. Annotating ADTs with consistency policies ensures the necessary guarantees for all operations are enforced, which we will expand on in the next section.
Custom ADTs can express application-level correctness constraints.
IPA's Counter
ADT allows reading the current value as well as increment and decrement operations. In our ticket sales example, we must ensure that the ticket count does not go below zero. Rather than forcing all operations on the datatype to be linearizable, this application-level invariant can be expressed with a more specialized ADT, such as a BoundedCounter
, giving the implementation more latitude for enforcing it. IPA's library is extensible, allowing custom ADTs to build on common features; see §5.
Previous systems [4, 11, 34, 58, 61] require annotating each read and write operation with a desired consistency level. This per-operation approach complicates reasoning about the safety of code using weak consistency, and hinders global optimizations that can be applied if the system knows the consistency level required for future operations. The IPA programming model provides a set of consistency policies that can be placed on ADT instances to specify consistency properties for the lifetime of the object. Consistency policies come in two flavors: static and dynamic.
Static policies are fixed, such as Consistency(Strong)
which states that operations must have strongly consistent behavior.
Static annotations provide the same direct control as previous approaches
but simplify reasoning about correctness by applying them globally on the ADT.
Dynamic policies specify a consistency level in terms of application requirements, allowing the system to decide at runtime how to meet the requirement for each executed operation. IPA offers two dynamic consistency policies:
A latency policy LatencyBound(x)
specifies a target latency for operations on the ADT (e.g., 20 ms). The runtime can choose the consistency level for each issued operation, optimizing for the strongest level that is likely to satisfy the latency bound.
An accuracy policy ErrorTolerance(x%)
specifies the desired accuracy for read operations on the ADT. For example, the size
of a Set
ADT may only need to be accurate within 5% tolerance. The runtime can optimize the consistency of write operations so that reads are guaranteed to meet this bound.
Dynamic policies allow the runtime to extract more performance from an application by relaxing the consistency of individual operations, safe in the knowledge that the IPA type system will enforce safety by requiring the developer to consider the effects of weak operations.
Static and dynamic policies can apply to an entire ADT instance or on individual methods. For example, one could declare List[Int] with LatencyBound(50 ms)
, in which case all read operations on the list are subject to the bound. Alternatively, one could declare a Set
with relaxed consistency for its size
but strong consistency for its contains
predicate.
The runtime is responsible for managing the interaction between these policies. In the case of a conflict between two bounds, the system can be conservative and choose stronger policies than specified without affecting correctness.
In the ticket sales application, the Counter
for each event's tickets could have a relaxed accuracy policy, specified with ErrorTolerance(5%)
, allowing the system to quickly read the count of tickets remaining. An accuracy policy is appropriate here because it expresses a domain requirement—users want to see accurate ticket counts. As long as the system meets this requirement, it is free to relax consistency and maximize performance without violating correctness.
The List
ADT used for events has a latency policy that also expresses a domain requirement—that pages on the website load in reasonable time.
The key to consistency safety in IPA is the consistency types—enforcing type safety directly enforces consistency safety. Read operations of ADTs annotated with consistency policies return instances of a consistency type. These consistency types track the consistency of the results and enforce a fundamental non-interference property: results from weakly consistent operations cannot flow into computations with stronger consistency without explicit endorsement. This could be enforced dynamically, as in dynamic information flow control systems, but the static guarantees of a type system allow errors to be caught at compile time.
The consistency types encapsulate information about the consistency achieved when reading a value.
Formally, the consistency types form a lattice parameterized by a primitive type T
, shown in Figure 2.
Strong read operations return values of type Consistent[T]
(the top element),
and so (by implicit cast) behave as any other instance of type T
. Intuitively, this equivalence is because the results of strong reads are known to be consistent, which corresponds to the control flow in conventional (non-distributed) applications.
Weaker read operations return values of some type lower in the lattice (weak consistency types), reflecting their possible inconsistency.
The bottom element Inconsistent[T]
specifies an object with the weakest possible (or unknown) consistency. The other consistency types follow a subtyping relation $\prec$ as illustrated in Figure 2.
The only possible operation on Inconsistent[T]
is to endorse it.
Endorsement is an upcast, invoked by endorse(x)
, to the top element Consistent[T]
from other types in the lattice:
The core type system statically enforces safety by preventing weaker values from flowing into stronger computations. Forcing developers to explicitly endorse inconsistent values prevents them from accidentally using inconsistent data where they did not determine it was acceptable, essentially inverting the behavior of current systems where inconsistent data is always treated as if it was safe to use anywhere. However, endorsing values blindly in this way is not the intended use case; the key productivity benefit of the IPA type system comes from the other consistency types which correspond to the dynamic consistency policies in §3.3 which allow developers to handle dynamic variations in consistency, which we describe next.
The weak consistency type Rushed[T]
is the result of read operations performed on an ADT with consistency policy LatencyBound(x)
.
Rushed[T]
is a sum (or union) type, with one variant per consistency level available to the implementation of LatencyBound
.
Each variant is itself a consistency type (though the variants obviously cannot be Rushed[T]
itself).
The effect is that values returned by a latency-bound object carry with them their actual consistency level.
A result of type Rushed[T]
therefore requires the developer to consider the possible consistency levels of the value.
For example, a system with geo-distributed replicas may only be able to satisfy a latency bound of 50 ms with a local quorum read (that is, a quorum of replicas within a single datacenter).
In this case, Rushed[T]
would be the sum of three types Consistent[T]
, LocalQuorum[T]
, and Inconsistent[T]
.
A match statement destructures the result of a latency-bound read operation:
set.contains() match {
case Consistent(x) => print(x)
case LocalQuorum(x) => print(x+", locally")
case Inconsistent(x) => print(x+"???")
}
The application may want to react differently to a local quorum as opposed to a strongly or weakly consistent value.
Note that because of the subtyping relation on consistency types, omitted cases can be matched by any type lower in the lattice, including the bottom element Inconsistent(x)
;
other cases therefore need only be added if the application should respond differently to them. This subtyping behavior allows applications to be portable between systems supporting different forms of consistency (of which there are many).
Tagging values with a consistency level is useful because it helps programmers tell which operation reorderings are possible (e.g. strongly consistent operations will be observed to happen in program order). However, accuracy policies provide a different way of dealing with inconsistency by expressing it in terms of value uncertainty. They require knowing the abstract behavior of operations in order to determine the change in abstract state which results from each reordered operation (e.g., reordering increments on a Counter has a known effect on the value of reads).
The weak consistency type Interval[T]
is the result of operations performed on an ADT with consistency policy ErrorTolerance(x%)
.
Interval[T]
represents an interval of values within which the true (strongly consistent) result lies.
The interval reflects uncertainty in the true value created by relaxed consistency, in the same style as work on approximate computing [15].
The key invariant of the Interval
type is that the interval must include the result of some linearizable execution. Consider a Set
with 100 elements. With linearizability, if we add
a new element and then read the size
(or if this ordering is otherwise implied), we must get 101 (provided no other updates are occurring). However, if size
is annotated with ErrorTolerance(5%)
, then it could return any interval that includes 101, such as $[95,105]$ or $[100,107]$, so the client cannot tell if the recent add
was included in the size. This frees the system to optimize to improve performance, such as by delaying synchronization. While any partially-ordered domain could be represented as an interval (e.g., a Set with partial knowledge of its members), in this work we consider only numeric types.
In the ticket sales example, the counter ADT's accuracy policy means that reads of the number of tickets return an Interval[Int]
. If the entire interval is above zero, then users can be assured that there are sufficient tickets remaining.
In fact, because the interval could represent many possible linearizable executions, in the absence of other user actions, a subsequent purchase must succeed.
On the other hand, if the interval overlaps with zero, then there is a chance that tickets could already be sold out, so users could be warned. Note that ensuring that tickets are not over-sold is a separate concern requiring a different form of enforcement, which we describe in §5.
The relaxed consistency of the interval type allows the system to optimize performance in the common case where there are many tickets available, and dynamically adapt to contention when the ticket count diminishes.
The consistency policies introduced in the previous section allow programmers to describe application-level correctness properties. Static consistency policies (e.g. Strong) are enforced by the underlying storage system; the annotated ADT methods simply set the desired consistency level when issuing requests to the store. The dynamic policies each require a new runtime mechanism to enforce them: parallel operations with latency monitoring for latency bounds, and reusable reservations for error tolerance. But first, we briefly review consistency in Dynamo-style replicated systems.
To be sure a strong read sees a particular write, the two must be guaranteed to coordinate with overlapping sets of replicas (quorum intersection).
For a write and read pair to be strongly consistent (in the CAP sense [17]), the replicas acknowledging the write ($W$) plus the replicas contacted for the read ($R$) must be greater than the total number of replicas ($W + R > N$). This can be achieved, for example, by writing to a quorum ($(N+1)/2$) and reading from a quorum (QUORUM
in Cassandra), or writing to $N$ (ALL
) and reading from 1 (ONE
) [22].
Because overall consistency is dependent on both the strength of reads and writes, it really does not make sense to specify consistency policies on individual operations in isolation. Declaring consistency policies on an entire ADT, however, allows the implementer of the ADT to ensure that all combinations of reads and writes achieve the specified consistency.
Static consistency policies are typically enforced by the underlying datastore, but they require the designer of each ADT to carefully choose how to implement them. To support the Consistency(Strong)
policy, the designer of each ADT must choose consistency levels for its operations which together enforce strong consistency. For example, if a developer knows that updates to a Counter
are more common, they may choose to require the read
operation to synchronize with all replicas (ALL
), permitting increment
and decrement
to wait for only a single replica (ONE
) without violating strong consistency.
The time it takes to achieve a particular level of consistency depends on current conditions and can vary over large time scales (minutes or hours) but can also vary significantly for individual operations. During normal operation, strong consistency may have acceptable performance while at peak traffic times the application would fall over. Latency bounds specified by the application allow the system to dynamically adjust to maintain comparable performance under varying conditions.
Our implementation of latency-bound types takes a generic approach: it issues read requests at different consistency levels in parallel. It composes the parallel operations and returns a result either when the strongest operation returns, or with the strongest available result at the specified time limit. If no responses are available at the time limit, it waits for the first to return.
This approach makes no assumptions about the implementation of read operations, making it easily adaptable to different storage systems. Some designs may permit more efficient implementations: for example, in a Dynamo-style storage system we could send read requests to all replicas, then compute the most consistent result from all responses received within the latency limit. However, this requires deeper access to the storage system implementation than is traditionally available.
The main problem with our approach is that it wastes work by issuing parallel requests. Furthermore, if the system is responding slower due to a sudden surge in traffic, then it is essential that our efforts not cause additional burden on the system. In these cases, we should back off and only attempt weaker consistency. To do this, the system monitors current traffic and predicts the latency of different consistency levels.
Each client in the system has its own Monitor (though multi-threaded clients can share one). The monitor records the observed latencies of reads, grouped by operation and consistency level. The monitor uses an exponentially decaying reservoir to compute running percentiles weighted toward recent measurements, ensuring that its predictions continually adjust to current conditions.
Whenever a latency-bound operation is issued, it queries the monitor to determine the strongest consistency likely to be achieved within the time bound, then issues one request at that consistency level and a backup at the weakest level, or only weak if none can meet the bound. In §6.2.1 we show empirically that even simple monitors allow clients to adapt to changing conditions.
We implement error bounds by building on the concepts of escrow and reservations [27, 44, 48, 50]. These techniques have been used in storage systems to enforce hard limits, such as an account balance never going negative, while permitting concurrency. The idea is to set aside a pool of permissions to perform certain update operations (we'll call them reservations or tokens), essentially treating operations as a manageable resource. If we have a counter that should never go below zero, there could be a number of decrement tokens equal to the current value of the counter. When a client wishes to decrement, it must first acquire sufficient tokens before performing the update operation, whereas increments produce new tokens. The insight is that the coordination needed to ensure that there are never too many tokens can be done off the critical path: tokens can be produced lazily if there are enough around already, and most importantly for this work, they can be distributed among replicas. This means that replicas can perform some update operations safely without coordinating with any other replicas.
Reservations require mediating requests to the datastore to prevent updates from exceeding the available tokens. Furthermore, each server must locally know how many tokens it has without synchronizing. We are not aware of a commercial datastore that supports custom mediation of requests and replica-local state, so we need a custom middleware layer to handle reservation requests, similar to other systems which have built stronger guarantees on top of existing datastores [8, 10, 57].
Any client requests requiring reservations are routed to one of a number of reservation servers. These servers then forward operations when permitted along to the underlying datastore. All persistent data is kept in the backing store; these reservation servers keep only transient state tracking available reservations. The number of reservation servers can theoretically be decoupled from the number of datastore replicas; our implementation simply colocates a reservation server with each datastore server and uses the datastore's node discovery mechanisms to route requests to reservation servers on the same host.
Reservations have been used previously to enforce hard global invariants in the form of upper or lower bounds on values [10], integrity constraints [9], or logical assertions [37]. However, enforcing error tolerance bounds presents a new design challenge because the bounds are constantly shifting.
Consider a Counter
with a 10% error bound, shown in Figure 3. If the current value is 100, then 10 increments can be done before anyone must be told about it. However, we have 3 reservation servers, so these 10 reservations are distributed among them, allowing each to do some increments without synchronizing. If only 10 outstanding increments are allowed, reads are guaranteed to maintain the 10% error bound.
In order to perform more increments after a server has exhausted its reservations, it must synchronize with the others, sharing its latest increments and receiving any changes of theirs. This is accomplished by doing a strong write (ALL
) to the datastore followed by a read. Once that synchronization has completed, those 3 tokens become available again because the reservation servers all temporarily agree on the value (in this case, at least 102).
Read operations for these types go through reservation servers as well: the server does a weak read from any replica, then determines the interval based on how many reservations there are. For the read in Figure 3, there are 10 reservations total, but Server B knows that it has not used its local reservations, so it knows that there cannot be more than 6 and can return the interval $[100,106]$.
Error tolerance policies set an upper bound on the amount of error; ideally, the interval returned will be more precise than the maximum error when conditions are favorable, such as when there are few update operations. Rather than assuming the total number of tokens is always the maximum allowable by the error bound, we instead keep an allocation table for each record that tracks the number of tokens allocated to each reservation server. If a reservation server receives an update operation and does not have enough tokens allocated, it updates the allocation table to allocate tokens for itself. The allocation table must preserve the invariant that the total does not exceed the maximum tokens allowed by the current value. For example, for a value of 100, 10 tokens were allowed, but after 1 decrement, only 9 tokens are allowed. Whenever this occurs, the server that changed the bound must give up the “lost” token out of its own allocations. As long as these updates are done atomically (in Cassandra, this is done using linearizable conditional updates), the global invariant holds. Because of this synchronization, reading and writing the allocation table is expensive and slow, so we use long leases (on the order of seconds) within each reservation server to cache their allocations. When a lease is about to expire, the server preemptively refreshes its lease in the background so that writes do not block unnecessarily.
For each type of update operation there may need to be a different pool of reservations. Similarly, there could be different error bounds on different read operations. It is up to the designer of the ADT to ensure that all error bounds are enforced with appropriate reservations. Consider a Set
with an error tolerance on its size
operation. This requires separate pools for add
and remove
to prevent the overall size from deviating by more than the bound in either direction, so the interval is $[v-\texttt{remove.delta},v+\texttt{add.delta}]$ where $v$ is the size of the set and delta
computes the number of outstanding operations from the pool.
In some situations, operations may produce and consume tokens in the same pool – e.g., increment
producing tokens for decrement
– but this is only allowable if updates propagate in a consistent order among replicas, which may not be the case in some eventually consistent systems.
IPA is implemented mostly as a client-side library to an off-the-shelf distributed storage system, though reservations are handled by a custom middleware layer which mediates accesses to any data with error tolerance policies. Our implementation is built on top of Cassandra, but IPA could work with any replicated storage system that supports fine-grained consistency control, which many commercial and research datastores do, including Riak [11].
IPA's client-side programming interface is written in Scala, using the asynchronous futures-based Phantom [45] library for type-safe access to Cassandra data. Reservation server middleware is also built in Scala using Twitter's Finagle framework [63]. Communication is done between clients and Cassandra via prepared statements, and between clients and reservation servers via Thrift remote-procedure-calls [6]. Due to its type safety features, abstraction capability, and compatibility with Java, Scala has become popular for web service development, including widely-used frameworks such as Akka [35] and Spark [5], and at established companies such as Twitter and LinkedIn [2, 18, 29].
The IPA type system, responsible for consistency safety, is also simply part of our client library, leveraging Scala's sophisticated type system.
The IPA type lattice is implemented as a subclass hierarchy of parametric classes, using Scala's support for higher-kinded types to allow them to be destructured in match statements, and implicit conversions to allow Consistent[T]
to be treated as type T
. We use traits to implement ADT annotations; e.g. when the LatencyBound
trait is mixed into an ADT, it wraps each of the methods, redefining them to have the new semantics and return the correct IPA type.
IPA comes with a library of reference ADT implementations used in our experiments, but it is intended to be extended with custom ADTs to fit more specific use cases.
Our implementation provides a number of primitives for building ADTs, some of which are shown in Figure 4. To support latency bounds, there is a generic LatencyBound
trait that provides facilities for executing a specified read operation at multiple consistency levels within a time limit. For implementing error bounds, IPA provides a generic reservation pool which ADTs can use. Figure 4 shows how a Counter with error tolerance bounds is implemented using these pools.
The library of reference ADTs includes:
Counter
based on Cassandra's counter, supporting increment and decrement, with latency and error bounds
BoundedCounter
CRDT from [10] that enforces a hard lower bound even with weak consistency. Our implementation adds the ability to bound error on the value of the counter and set latency bounds.
Set
with add
, remove
, contains
and size
, supporting latency bounds, and error bounds on size
.
UUIDPool
generates unique identifiers, with a hard limit on the number of IDs that can be taken from it; built on top of BoundedCounter
and supports the same bounds.
List
: thin abstraction around a Cassandra table with a time-based clustering order, supports latency bounds.
Figure 4 shows Scala code using reservation pools to implement a Counter with error bounds. The actual implementation splits this functionality between the client and the reservation server.
The goal of the IPA programming model and runtime system is to build applications that adapt to changing conditions, performing nearly as well as weak consistency but with stronger consistency and safety guarantees. To that end, we evaluate our prototype implementation under a variety of network conditions using both a real-world testbed (Google Compute Engine [28]) and simulated network conditions. We start with microbenchmarks to understand the performance of each of the runtime mechanisms independently. We then study two applications in more depth, exploring qualitatively how the programming model helps avoid potential programming mistakes in each and then evaluating their performance against strong and weakly consistent implementations.
To control for variability, we perform our experiments with a number of simulated conditions, and then validate our findings against experiments run on globally distributed machines in Google Compute Engine.
We use a local test cluster with nodes linked by standard ethernet and Linux's Network Emulation facility [62] (tc netem
) to introduce packet delay and loss at the operating system level. We use Docker containers [24] to enable fine-grained control of the network conditions between processes on the same physical node.
Table 2 shows the set of conditions we use in our experiments. The uniform 5ms link simulates a well-provisioned datacenter; slow replica models contention or hardware problems that cause one replica to be slower than others, and geo-distributed replicates the latencies between virtual machines in the U.S., Europe, and Asia on Amazon EC2 [3]. These simulated conditions are validated by experiments on Google Compute Engine with virtual machines in four datacenters: the client in us-east, and the storage replicas in us-central, europe-west, and asia-east. We elide the results for Local (same rack in our testbed) except in Figure 11 because the differences between policies are negligible, so strong consistency should be the default there.
Network Condition | Latencies (ms) | ||
---|---|---|---|
Simulated | Replica 1 | Replica 2 | Replica 3 |
Uniform / High load | 5 | 5 | 5 |
Slow replica | 10 | 10 | 100 |
Geo-distributed (EC2) | 1 ± 0.3 | 80 ± 10 | 200 ± 50 |
Actual | Replica 1 | Replica 2 | Replica 3 |
Local (same rack) | <1 | <1 | <1 |
Google Compute Engine | 30 ± <1 | 100 ± <1 | 160 ± <1 |
We start by measuring the performance of a simple application that randomly increments and reads from a number of counters with different IPA policies. Random operations (incr(1)
and read
) are uniformly distributed over 100 counters from a single multithreaded client (allowing up to 4000 concurrent operations).