Paper Insights - Cassandra - A Decentralized Structured Storage System
This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik, originates from Facebook and dates back to 2008.
Paper Overview
Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation.
Recommended Read: Paper Insights - Dynamo: Amazon's Highly Available Key-value Store
Let's begin with some of fundamental concepts.
SQL Databases
SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions.
To ensure this data consistency (the C in ACID), SQL databases necessitate support for atomic operations (the A in ACID), transaction isolation (the I in ACID), and durable storage (the D in ACID).
SQL databases can be categorized into two types:
- Real-time, Strongly Consistent: Examples include Spanner, AuroraDB, and non-distributed versions of MySQL. These databases guarantee immediate visibility of transaction results, ensuring strong consistency in real-time.
- Non-real-time, Consistent: Data warehouses like Mesa, Napa, and Snowflake fall under this category. While maintaining consistency, they may exhibit some data staleness as all transactions might not have been immediately applied.
Both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) applications critically rely on SQL databases. Their construction cannot be effectively based on NoSQL databases. NoSQL databases shouldn't be used for transactional or analytical workloads.
NoSQL Databases
Databases that do not adhere to the SQL semantics discussed above are generally categorized as NoSQL databases. These databases are sometimes referred to as BASE databases, contrasting them with ACID databases. They often prioritize faster transaction processing by relaxing some of the stringent ACID guarantees.
For example, Bigtable, a prominent NoSQL database, supports atomicity only for modifications within a single row.
NoSQL databases exhibit diverse architectures and can be broadly classified into the following categories:
- Document Databases: In these databases, keys are typically strings, and the corresponding values are structured documents, often in formats like JSON. Many document databases, including MongoDB and CouchBase, draw inspiration from the design principles of Amazon DynamoDB. While MongoDB has undergone significant evolution, its core concepts remain largely similar to those of DynamoDB.
- Wide-Column Databases: These databases allow for a dynamic number of columns in a table. Prominent examples include Cassandra, HBase, and Bigtable, all of which are heavily influenced by the data structures and design principles of Google's Bigtable.
- Key-Value Stores: These databases can be viewed as a more generalized form of document stores, offering a simpler key-value data model.
- Graph Databases: Designed to efficiently store and query highly interconnected data, graph databases like Facebook's Tao and Neo4j excel at representing and analyzing relationships within complex networks.
Cassandra
Key Format: Employs a<row, column_family>
key format, closely resembling Bigtable. Notably, Bigtable, Cassandra, and HBase are all fundamentally implemented as key-value stores.Data Structures: Exhibits strong similarities to Bigtable's data structures:
- Utilizes an in-memory data structure to track updates concurrently with writing to a commit log.
- Periodically, this in-memory data is flushed to disk, accompanied by summary information and a Bloom filter to accelerate query performance. Although not explicitly mentioned in the paper, it's highly probable that the on-disk data files are stored in the SSTable format. An SSTable is a serialized representation of key-value pairs, where keys are arranged in sorted order alongside their corresponding values.
- Data files undergo periodic compaction to optimize storage and improve read performance.
The presence of a distributed file system could have significantly simplified Cassandra's implementation. All replicas of a given partition could have seamlessly relied on a consistent view of the data stored within the distributed file system. However, without a distributed file system, the Cassandra team had to implement a more intricate solution: consistent hashing for data partitioning and replication, coupled with sophisticated membership control mechanisms involving failure detectors and gossip protocols.
Range Assignment
Cassandra employs the quorum mechanism. The replication factor is set to a threshold value, N. A write operation is considered successful only if the coordinator receives successful write acknowledgments from at least N replicas.
Introduction to Failure Detectors
In the realm of computer science, the FLP Impossibility result is a foundational concept. It states that in an asynchronous network where a single node can crash, it's impossible to simultaneously guarantee both safety and liveness (or termination). Crash faults are always a possibility in large systems and real-world networks are always asynchronous. This makes consensus, state machine replication and atomic broadcasts impossible problems to solve.
Algorithms like Paxos ensure safety but not termination. Achieving termination necessitates some degree of synchrony, whether it's partial or full synchronization within the network. Consequently, all practical consensus algorithms in real-world asynchronous networks rely on assumptions about network time bounds. These bounds are crucial for making progress and ensuring termination.
Failure detectors are essential components within these algorithms. They assist in approximately identifying and isolating faulty nodes, enabling the system to continue making progress despite the presence of failures.
Failure detectors can be classified based on their:
Degree of Completeness:
- Strong Completeness: Every faulty process is eventually suspected by every non-faulty process.
- Weak Completeness: Every faulty process is eventually suspected by at least one non-faulty process.
Degree of Accuracy:
- Strong Accuracy: No non-faulty process is ever falsely suspected.
- Weak Accuracy: Some non-faulty processes may be falsely suspected.
- Eventually Strong Accuracy: No non-faulty process is suspected after a certain period.
- Eventually Weak Accuracy: Some non-faulty processes may be falsely suspected initially, but eventually, all non-faulty processes are no longer suspected.
Failure Detector Algorithms
SWIM (Scalable Weakly Consistent Infection Protocol): SWIM is a prominent failure detection algorithm known for its strong completeness and weak accuracy. However, a detailed discussion of SWIM falls outside the scope of this write-up.
Heartbeat-Based Failure Detectors:
Ping-Ack Failure Detectors:
Accrual Failure Detectors:
φ represents a metric used to assess the likelihood of a machine failure which is inversely proportional to probability of receiving a heartbeat.
φ(t) = -log10 (P(t))
φ(t) is the probability that a machine has failed at time t.
P(t) is the probability of receiving a heartbeat from the machine at time t.
A higher phi value indicates a higher probability of machine failure. For example:- φ = 1 implies a 10% probability of receiving a heartbeat.
- φ = 2 implies a 1% probability of receiving a heartbeat.
- φ = 3 implies a 0.1% probability of receiving a heartbeat.
and so on.
How to define P(t)? This is achieved by maintaining a sliding window of heartbeat intervals. The samples within this window collectively form a probability distribution. Then:
P(t) = Integral of this distribution from t to ∞.
According to the requirements, the application may utilize cutoffs for φ to determine whether to accept or reject as a failure.
Gossip Protocols
Gossip protocols constitute a class of broadcast protocols within P2P networks. Other classes include point-to-point (one-to-one) and eager reliable (one-to-all) protocols.
The core principle of gossip protocols involves each node periodically transmitting a message to a randomly selected subset of other nodes. This decentralized approach ensures that, with high probability, a particular message will eventually propagate throughout the entire system. This characteristic finds valuable applications in various domains, including failure detection.
Types of Gossip Protocols
Gossip protocols are typically characterized by two key tunables: the required time for message dissemination and network usage. Based on these parameters, several distinct types of gossip protocols have emerged:
Anti-Entropy: In anti-entropy protocols, nodes transmit their entire message set to other nodes. This approach generally results in high bandwidth consumption. To mitigate this, diffing techniques can be employed to avoid redundant data transmission. The term "anti-entropy" originates from the protocol's objective: to minimize the information divergence between nodes by exchanging complete datasets.
Rumor-Mongering: Rumor-mongering protocols, also known as dissemination protocols, focus on transmitting only updates or changes to the message set. This approach significantly reduces network usage, enabling more frequent data exchange.
Aggregation: Aggregation protocols facilitate the exchange of system-wide aggregates through sampling techniques.
Cassandra and Failure Detection
Within a range, it is crucial for replicas to detect the failure of the coordinator, enabling the election of a new coordinator.
Cassandra utilizes an anti-entropy gossiping mechanism in conjunction with an accrual failure detector. In each exchange, a node transmits its entire collection of heartbeat information (received from other nodes) to its peers. This comprehensive exchange often results in shorter heartbeat intervals. Consequently, an exponential distribution typically provides a more accurate representation of heartbeat arrival times compared to a Gaussian distribution.
Post failure detection, Cassandra elects a leader amongst its nodes using Zookeeper.
Comments
Post a Comment