Paper Insights - Dynamo: Amazon's Highly Available Key-value Store
This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB, Cassandra, and Azure Cosmos DB.
A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers.
Recommended Read - Paper Insights - Cassandra - A Decentralized Structured Storage System where I discuss in details about failure detections and gossip protocols.
Let's visit some fundamental ideas (with a caution that there are several of them!).
Distributed Hash Tables (DHTs)
A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include:
- Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control.
- Fault Tolerance: The system remains reliable even when nodes join, leave, or fail.
- Scalability: It efficiently handles systems with thousands or millions of nodes.
- Anonymity: Participant identities can be hidden.
Keyspace Partitioning
DHTs distribute ownership of the keyspace among participating nodes. Common partitioning methods include:
- Consistent Hashing: This widely-used method partitions the keyspace based on the distance δ(k1, k2) between keys. Each node is assigned an identifier. The node with identifier ix owns all keys k where δ(ix, k) is minimal.
- Rendezvous Hashing: All clients employ the same hash function h(.) to map a key to one of the available servers.
- Locality-Preserving Hashing: This method assigns similar keys to similar nodes, enhancing data locality.
Overlay Network
Each node maintains a routing table to direct requests to the node responsible for the specific key. This is called an overlay network.
A trade-off exists between routing table size and the number of hops required for a request to reach its destination. Larger routing tables generally reduce the number of hops.
Consistent Hashing
x
to node H(x) % N
, where H(x)
is the hash of the key and N
is the number of nodes.N
(e.g., node additions/removals) invalidate all cached values.Chord Hash:
Tree Hash:
H(x)
to traverse a tree and locate the responsible node.Challenge: Node failures require efficient reassignment of the failed node's data.
Kademlia Hash:
x
on the node closest by XOR distance. XOR distance is a function which, for each bit, the function returns zero if the two bits are equal and one if the two bits are different.Load Distribution in Consistent Hashing
Better Solution: Redundant Node Placement
CARP (Cache Array Reading Protocol)
hi
, where i
ranges from 1 to N. For an input x
, calculate a hash value for each machine:x
to the machine with the largest hash value Hi
.Pros: When a machine fails, the load is automatically redistributed. Since the hash function is independent of machine failures, the next machine chosen for a given input will be determined randomly based on its hash value.
Cons: Mapping an input to a machine requires calculating N
hash values, resulting in a linear time complexity of O(n). This can significantly impact performance, especially during high-traffic periods. In contrast, distributed hashing protocols like Chord and Kademlia offer logarithmic time complexity (O(log n)) for finding the responsible node, making them more efficient for large-scale systems.
Routing Algorithm in Consistent Hashing
In decentralized systems with millions of machines, several critical challenges arise:
- Dynamic Membership: Nodes frequently join and leave the network.
- Decentralized Control: Nodes are owned and operated by independent entities.
Challenges of Early Peer-to-Peer Systems:
- Napster: Relied on a centralized server to maintain a global index and route requests. This single point of failure made the system vulnerable to disruption.
- Gnutella: Employed a decentralized approach where each peer connected to a limited number of neighbors. However, locating resources often involved costly broadcasts to all connected peers, leading to significant overhead.
Consistent Hashing as a Solution:
Consistent hashing provides an effective mechanism for routing requests in decentralized systems.
Routing Algorithm for Chord:
Each node is assigned a unique identifier (e.g., a hash value).
Routing Algorithm for Kademlia:
Each node is assigned a unique identifier (e.g., a hash value).
Each node maintains a routing table with entries based on its identifier's binary representation:
b1'
: Entries for nodes whose first bit differs from the current node.
b1b2'
: Entries for nodes whose second bit differs.
b1b2b3'
: Entries for nodes whose third bit differs, and so on.
In essence, this algorithm allows nodes to efficiently route requests towards the node responsible for the target hash by iteratively correcting one bit at a time.
Problem:
In a system with a million nodes, Chord typically requires around 20 hops to locate a specific key. However, the failure of a single node along the routing path can cause the entire lookup to fail.
Kademlia addresses this issue through replication. Kademlia employs r=20, meaning each key is stored on 20 closest nodes. To support replication, each node in Kademlia maintains 20 entries for each prefix of its own ID, each pointing to one of the kth closest nodes for that prefix.
Constant Size Routing Tables (Koorde Hash)
Koorde hash is an algorithm that can achieve constant size routing tables independent of N. In tree-based distributed systems, efficient routing can be achieved by strategically placing nodes within the tree structure:
SQL vs NoSQL Databases
- Atomicity: Transactions execute as a single, indivisible unit, either succeeding or failing completely.
- Consistency: Ensures that all data modifications maintain the integrity of the database schema (e.g., data type constraints, referential integrity).
- Isolation: Transactions appear to execute serially, even when occurring concurrently, preventing unexpected interactions. This is one of the most important and distinctive features of SQL databases.
- Durability: Committed data persists permanently, even in the event of system failures.
However the SQL databases are limited by their scalability - it is difficult to distribute data across multiple nodes while maintaining ACID properties. Additionally, vertical scaling is limited by the capacity of individual servers.
NoSQL databases were developed to address the limitations of SQL databases, particularly in terms of scalability and flexibility.
There are different types of NoSQL databases supporting different data models.
- Key-Value Stores: (e.g., Redis, Memcached, DynamoDB) Simple data structures with fast read/write operations.
- Document Stores: (e.g., MongoDB, CouchDB) Store data in flexible, document-like structures.
- Wide-Column Stores: (e.g., Cassandra, HBase) Efficiently handle large datasets with sparse attributes.
- Graph Databases: (e.g., Neo4j) Optimized for representing and querying relationships between data.
They utilize simpler consistency protocols (e.g., Dynamo's vector clocks) compared to more complex protocols like Paxos required by some SQL databases.
Quorum Systems
Quorum systems are used to ensure that multiple nodes in a system agree on a decision or state. Quorum systems are effective for linearizability, which ensures that operations on a single key appear to occur in a specific order. However, they are not suitable for strict serializability, which guarantees that concurrent transactions appear to execute in a serial order.
Transactions in Quorum Systems
Transactions within a quorum system exhibit the following characteristics:
- Atomicity: Transactions either complete successfully or fail entirely.
- Linearizability: Operations on individual keys are executed in a well-defined order.
To enable transactions, a two-phase commit protocol is typically employed:
-
Write Phase:
- The value for the key is written to a specific number of replicas (C).
- Each replica provides a commit vote, essentially locking the key for other writes.
-
Commit Phase:
- Once enough votes are collected, the transaction is committed.
While atomicity is inherent to transactions, it's not always necessary in a quorum system. Maintaining consistent states across all replicas can be resource-intensive and unnecessary for achieving linearizability.
Voting Protocol for Quorums
The most common way of using quorum systems involves voting. In this scenario, atomicity of operations may not be guaranteed, but with proper voting configuration, the system can appear linearizable to the client, even if individual replicas are in inconsistent states.
Given N servers, W (write quorum), and R (read quorum), the system can achieve linearizability if:
R + W > N
Consider a system with three servers (A, B, C), where R = 2 and W = 2. Say, a write operation succeeds on server A but fails on the others.
- The overall write operation fails. However, because this action is non-atomic, node A is left in an inconsistent state.
- The client can read data from replicas A and B. Upon detecting conflicting values from A and B, it will read from C, receiving a version the same as A.
- Server A can then repair its state by fetching the correct value from another server.
Linearizability cannot be guaranteed when R + W <= N. In such cases, conflicts can arise.
Say a write W1 succeeded at replica A and B and a write W2 to the same key succeeded at replica B and C. Since B had observed both writes, it can break ties. But if B is down, then there would be a conflict. In this case, W = 2 but R = 1 and hence R + W = 3 <= N.Vector Clock
Vector Clocks are a powerful technique for establishing causal order among events in a distributed system. They enhance Lamport's clock by maintaining a separate logical clock for each node within the system.
In an N-node system, a vector clock is an N-dimensional vector, where each element represents the logical clock of a corresponding node.
Event Handling:- Local Event: Increment the local node's clock (i.e., the i-th element of the vector).
- Send Event: Increment the local node's clock. Transmit the current vector clock (
V[1...N]
) along with the message.
- Receive Event:
- Increment the local node's clock.
- Update each element of the local vector by taking the maximum of the corresponding value in the received message and the current local value.
V[j] = max(Vmsg[j], V[j])
∀ j ≠ i
Comparing Vector Clocks:
- Equality: Two vector clocks, V1 and V2, are equal if and only if all their corresponding elements are equal.
V1 = V2
iff V1[i] = V2[i]
∀ i
∈ {1, ..., N}.
- Less Than or Equal To: Vector clock V1 is less than or equal to V2 if and only if every element of V1 is less than or equal to the corresponding element in V2.
V1 ≤ V2
iff V1[i] ≤ V2[i]
∀ i ∈ {1, ..., N}.Vector Clocks in Quorum System
While quorum systems with R + W <= N cannot guarantee linearizability, vector clocks can be employed to effectively resolve conflicts and facilitate eventual reconciliation.
Approach:
- Each node maintains and updates its vector clock according to the rules described earlier.
- When a write operation occurs on a key, the current state of the vector clock is associated with the written value.
<key, value, <A: 1, B: 2>>
where <A: 1, B: 2>
represents the vector clock at the time of the write.Conflict Resolution:
- Partial Ordering: Vector clocks enable partial ordering of values.
V1 <= v2
iff V1[i] <= V2[i]
∀ nodes i.
- Conflict Resolution Strategies:
- Choose a value randomly.
- Select the value associated with the most recent physical timestamp.
Neither of these conflict resolution strategies guarantees linearizability.
Anti-Entropy using Merkle Trees
Say, a set of replicas own a keyspace. Maintaining consistency among these replicas is crucial, a concept often referred to as anti-entropy.
Merkle trees offer a robust mechanism for detecting conflicts and efficiently updating outdated values within this keyspace. This is particularly valuable in quorum systems, where replica inconsistencies can frequently arise.
A merkle tree is a binary tree-based data structure where each leaf node holds the hash of some items. Internal nodes, in turn, hold the hash of their respective child nodes. In this case, the leaf nodes hold the hash of values of all the keys. The hierarchical structure enables efficient verification of data integrity and facilitates the identification of discrepancies between replicas.
Failure Detectors
Dynamo
Dynamo, a pioneering key-value NoSQL database, has significantly influenced the landscape of modern NoSQL systems. A key innovation of Dynamo lies in its deliberate trade-off of strict consistency for achieving low-latency and high availability.
Recognizing the critical dependencies of numerous other Amazon services on Dynamo, the system prioritizes stringent tail latency requirements. Specifically, it focuses on achieving high performance for the 99.9th percentile of tail latency. This emphasis stems from the understanding that in many scenarios, an incorrect response is preferable to experiencing no response or delayed response. The additive nature of latency SLAs across interdependent systems underscores the potential for severe cascading failures if even a small fraction of requests experience prolonged latencies.
The API
Following the design principles of many successful distributed systems, Dynamo boasts a remarkably simple API:
Get(key) -> (context, list of values)
: Retrieves the value(s) associated with the given key, along with contextual information.Put(key, context, value) -> void
: Stores the specified value under the given key, utilizing the provided context for conflict resolution.
The context parameter in these operations is an opaque byte string. This context information is crucial for resolving write conflicts using a vector clock mechanism.
Consistent Hashing in Dynamo
Dynamo was developed in 2007, during a period of intense interest in decentralized systems. Reflecting this trend, Dynamo employs a peer-to-peer architecture where all nodes are identical and utilize consistent hashing for efficient overlay network formation. This design adheres to the principles of incremental scalability, decentralization, symmetry, and heterogeneity.
Traditional consistent hashing presents challenges, such as key redistribution when a new node joins, potentially triggering expensive merkle tree recomputations across multiple nodes. That is why, Dynamo employs a variant of consistent hashing.
Dynamo's strategy:
- The keyspace is divided into Q equally sized partitions.
- If the system has S nodes, Q/S tokens are assigned to the nodes. The number of tokens are assigned based on its physical capacity of a node.
This approach offers several advantages:
- When a node leaves, only the new nodes responsible for its partitions need to recompute merkle trees, as the partitions themselves remain fixed.
- Partitioning (defined by Q) and partition placement are decoupled.
- The routing table maintains a constant number of entries (Q).
Additionally, we define a fairness metric as follows:
Note, the same physical machines may end up at multiple positions on the ring, also known as virtual nodes.
Handling Client Operations
Each key k is assigned to a coordinator node. This coordinator node is responsible for replicating the keys that fall within its assigned range.
For redundancy, the coordinator replicates the key to N - 1 successor nodes within the ring. This results in a system where each node is responsible for the region of the ring between itself and its Nth predecessor.
The list of nodes responsible for storing a particular key is termed the preference list. Since each physical node may be responsible for multiple ranges on the ring, the preference list strategically skips positions on the ring to ensure that it only contains distinct physical nodes.
Clients can interact with the system through two primary methods:
- Directly using the assignment map: This approach offers efficient and low-latency access.
- Through a load balancer: This may introduce additional latency.
If a client initially contacts a different node (not the key's coordinator), the request will be redirected to the correct coordinator.
Unlike traditional quorum systems that enforce linearizability by adhering to the R + W > N rule, Dynamo deviates from this approach. This deviation is intentional, as adhering to R + W > N can lead to significant latency increases, especially when N is large.
Instead of strict quorum rules, Dynamo utilizes vector clocks associated with each key's value to handle conflicting writes. To ensure durability, the coordinator synchronously replicates the write to at least one other node before acknowledging the client's request.
For applications demanding high-performance and consistent reads, a configuration of R=1 and W=N may be suitable.
Section 4.4 provides a detailed example of how vector clocks are employed to reconcile conflicting values for a key. While the specific details are well-explained in the paper, the core concept is that D3 precedes D4 because Sx(D3) = Sx(D2) and Sy(D3) > Sy(D2). D3 and D4 are considered conflicting versions because Sy(D3) > Sy(D4) and Sz(D4) > Sz(D3). These conflicts are resolved through semantic reconciliation.
The paper also mentions that Dynamo truncates the vector clock to 10 elements by removing the oldest entry. This design decision raises concerns, as it may potentially lead to several unforeseen issues.
Handling Failures
Temporary Failure
Dynamo introduces "hinted handoff", a technique that improves fault tolerance. When a node is unavailable, writes are redirected to its successor node on the ring. These "hinted writes" are stored in a separate file, facilitating efficient transfer when the target node becomes available again.
This approach inherently breaks the traditional quorum algorithm, leading Dynamo to adopt a "sloppy quorum" strategy.
Permanent Failure
When a replica fails permanently, writes are still directed to at least N-1 other nodes, ensuring some level of data durability. This is where merkle trees prove invaluable.
Dynamo incorporates an anti-entropy replica synchronization protocol. In this protocol, the merkle tree serves as a data structure where leaf nodes represent the hashes of values of individual key-value pairs. During synchronization, a comparison of merkle trees between replicas reveals discrepancies, enabling the efficient transfer of only the out-of-sync values.
Ring Membership
Only administrators possess the authority to add or remove members from the ring. Upon any membership change, a gossip-based protocol propagates these updates throughout the system, ensuring eventual consistency in the membership view across all nodes. In this protocol, each node randomly selects a peer for communication every second.
Failure detection within the system also leverages a simple gossip-style mechanism. However, this detection is inherently local. Node A may deem node B as failed if it fails to receive responses from B, even though B might still be responsive to other nodes like C. To further reinforce failure detection, a periodic loop is implemented to actively check for the presence of node B.
Evaluation
- Tail latencies: The 99.9th percentile latency (p999) hovers around 200ms, significantly higher than the average latency.
- Load Balancing: The new consistent partitioning strategy demonstrates excellent load balancing efficiency, exceeding 0.9.
- Version Divergence: Two primary scenarios contribute to version divergence: system failures and concurrent write operations.
- 99.94% of requests encounter a single version of the data.
- 0.00057% encounter two versions.
- 0.00047% encounter three versions.
- 0.00009% encounter four versions.
- Client-Driven Coordination: Implementing client-driven coordination, where clients directly contact servers based on their knowledge of the key-to-preference list mapping, results in a 30ms reduction in both read and write latencies at the 99.9th percentile.
Paper Review
The Dynamo paper is a seminal work in the field of distributed systems. It introduces several groundbreaking ideas that have significantly influenced the design and development of numerous NoSQL databases.
Given the depth and breadth of the paper, these insights necessarily provides a high-level overview. To fully grasp the nuances and intricacies of Dynamo's design, I highly recommend a thorough and repeated reading of the original paper.
PS: Towards Distributed SQL Databases
Even with the rise of NoSQL, the need for strong consistency and data integrity remained paramount for critical applications like financial systems.
Google Spanner emerged as a groundbreaking achievement, becoming the first distributed SQL database to offer strong consistency guarantees. It pioneered the concept of "DRDBMS" – systems that are:
- Scalable: Capable of horizontal scaling to handle massive datasets and high throughput.
- Consistent: Adhering to ACID properties, ensuring data integrity and transactional isolation in a distributed environment.
- Geographically Replicated: Providing high availability and low latency by distributing data across multiple geographic locations.
- SQL-compliant: Supporting the familiar SQL language for data querying and manipulation.
Amazon Aurora, another significant player, attempted to enhance consistency by optimizing consensus protocols on top of existing standalone SQL databases like MySQL. However, this approach often faced limitations, with a single write node frequently becoming a performance bottleneck.
CockroachDB, drawing inspiration from Spanner, offers a scalable and consistently replicated transactional store. However, it differentiates itself by:
- Independence from Atomic Clocks: Unlike Spanner, CockroachDB does not rely on tightly synchronized atomic clocks, making it more adaptable to diverse deployment environments.
- Open Source and Platform Agnostic: CockroachDB is open-source and can be deployed outside of Google Cloud Platform (GCP), providing greater flexibility and choice for organizations.
Comments
Post a Comment