Paper Insights - Cassandra - A Decentralized Structured Storage System

Paper Link

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik, originates from Facebook and dates back to 2008.

Paper Overview

Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation.

Recommended Read: Paper Insights - Dynamo: Amazon's Highly Available Key-value Store

Let's begin with some of fundamental concepts.

SQL Databases

SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions.

To ensure this data consistency (the C in ACID), SQL databases necessitate support for atomic operations (the A in ACID), transaction isolation (the I in ACID), and durable storage (the D in ACID).

SQL databases can be categorized into two types:

  • Real-time, Strongly Consistent: Examples include Spanner, AuroraDB, and non-distributed versions of MySQL. These databases guarantee immediate visibility of transaction results, ensuring strong consistency in real-time.
  • Non-real-time, Consistent: Data warehouses like Mesa, Napa, and Snowflake fall under this category. While maintaining consistency, they may exhibit some data staleness as all transactions might not have been immediately applied.

Both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) applications critically rely on SQL databases. Their construction cannot be effectively based on NoSQL databases. NoSQL databases shouldn't be used for transactional or analytical workloads.

NoSQL Databases

Databases that do not adhere to the SQL semantics discussed above are generally categorized as NoSQL databases. These databases are sometimes referred to as BASE databases, contrasting them with ACID databases. They often prioritize faster transaction processing by relaxing some of the stringent ACID guarantees.

For example, Bigtable, a prominent NoSQL database, supports atomicity only for modifications within a single row.

NoSQL databases exhibit diverse architectures and can be broadly classified into the following categories:

  • Document Databases: In these databases, keys are typically strings, and the corresponding values are structured documents, often in formats like JSON. Many document databases, including MongoDB and CouchBase, draw inspiration from the design principles of Amazon DynamoDB. While MongoDB has undergone significant evolution, its core concepts remain largely similar to those of DynamoDB.
  • Wide-Column Databases: These databases allow for a dynamic number of columns in a table. Prominent examples include Cassandra, HBase, and Bigtable, all of which are heavily influenced by the data structures and design principles of Google's Bigtable.
  • Key-Value Stores: These databases can be viewed as a more generalized form of document stores, offering a simpler key-value data model.
  • Graph Databases: Designed to efficiently store and query highly interconnected data, graph databases like Facebook's Tao and Neo4j excel at representing and analyzing relationships within complex networks.

Cassandra

Key Format: Employs a <row, column_family> key format, closely resembling Bigtable. Notably, Bigtable, Cassandra, and HBase are all fundamentally implemented as key-value stores.

Data Structures: Exhibits strong similarities to Bigtable's data structures:
  • Utilizes an in-memory data structure to track updates concurrently with writing to a commit log.
  • Periodically, this in-memory data is flushed to disk, accompanied by summary information and a Bloom filter to accelerate query performance. Although not explicitly mentioned in the paper, it's highly probable that the on-disk data files are stored in the SSTable format. An SSTable is a serialized representation of key-value pairs, where keys are arranged in sorted order alongside their corresponding values.
  • Data files undergo periodic compaction to optimize storage and improve read performance.
System Architecture: Shares similarities with Dynamo, leveraging a chord-based consistent hashing scheme for data placement. 

It's noteworthy that Bigtable leverages Google File System (GFS) for efficient data storage and retrieval. In contrast, Cassandra does not rely on a distributed file system. This design choice may be attributed to the lack of a GFS equivalent within Facebook's infrastructure at the time of Cassandra's development.

The presence of a distributed file system could have significantly simplified Cassandra's implementation. All replicas of a given partition could have seamlessly relied on a consistent view of the data stored within the distributed file system. However, without a distributed file system, the Cassandra team had to implement a more intricate solution: consistent hashing for data partitioning and replication, coupled with sophisticated membership control mechanisms involving failure detectors and gossip protocols.

Range Assignment

As discussed above, Cassandra makes use of consistent hashing to distribute keys. Each range is assigned to not one, but N nodes, one of which act as the coordinator for that range. This replication is required to ensure that there are no single point of failure. Note that the same machine can be a node for multiple ranges.


In the diagram above, the range between red and purple dot is owned by the nodes.

It is interesting to note that during that time there was a trend of building truly decentralized systems where each node is essentially the same.

Cassandra employs the quorum mechanism. The replication factor is set to a threshold value, N. A write operation is considered successful only if the coordinator receives successful write acknowledgments from at least N replicas.

Introduction to Failure Detectors

In the realm of computer science, the FLP Impossibility result is a foundational concept. It states that in an asynchronous network where a single node can crash, it's impossible to simultaneously guarantee both safety and liveness (or termination). Crash faults are always a possibility in large systems and real-world networks are always asynchronous. This makes consensus, state machine replication and atomic broadcasts impossible problems to solve.

Algorithms like Paxos ensure safety but not termination. Achieving termination necessitates some degree of synchrony, whether it's partial or full synchronization within the network. Consequently, all practical consensus algorithms in real-world asynchronous networks rely on assumptions about network time bounds. These bounds are crucial for making progress and ensuring termination.

Failure detectors are essential components within these algorithms. They assist in approximately identifying and isolating faulty nodes, enabling the system to continue making progress despite the presence of failures.

Failure detectors can be classified based on their:

  • Degree of Completeness:

    • Strong Completeness: Every faulty process is eventually suspected by every non-faulty process.
    • Weak Completeness: Every faulty process is eventually suspected by at least one non-faulty process.
  • Degree of Accuracy:

    • Strong Accuracy: No non-faulty process is ever falsely suspected.
    • Weak Accuracy: Some non-faulty processes may be falsely suspected.
    • Eventually Strong Accuracy: No non-faulty process is suspected after a certain period.
    • Eventually Weak Accuracy: Some non-faulty processes may be falsely suspected initially, but eventually, all non-faulty processes are no longer suspected.

Failure Detector Algorithms

SWIM (Scalable Weakly Consistent Infection Protocol): SWIM is a prominent failure detection algorithm known for its strong completeness and weak accuracy. However, a detailed discussion of SWIM falls outside the scope of this write-up.

Heartbeat-Based Failure Detectors:


These represent the most fundamental class of failure detectors. They operate by monitoring the regular reception of heartbeat messages from other nodes. Heartbeat-based detectors exhibit strong completeness but only achieve eventual weak accuracy. This implies that while they will eventually detect a node failure, they may temporarily suspect healthy nodes as faulty.

Ping-Ack Failure Detectors:



Similar to heartbeat-based detectors, ping-ack mechanisms rely on the exchange of ping messages and acknowledgments (acks) to monitor node health. They share the same strong completeness and eventual weak accuracy characteristics as heartbeat-based detectors.

Accrual Failure Detectors:


Cassandra employs Accrual Failure Detectors, which extend the concept of heartbeat-based mechanisms. Accrual Failure Detectors incorporate tunable parameters, allowing for adjustments to the balance between accuracy and completeness. This flexibility enables administrators to fine-tune the detector's behavior to meet specific system requirements.

φ represents a metric used to assess the likelihood of a machine failure which is inversely proportional to probability of receiving a heartbeat.

φ(t) = -log10 (P(t))

φ(t) is the probability that a machine has failed at time t.

P(t) is the probability of receiving a heartbeat from the machine at time t.

A higher phi value indicates a higher probability of machine failure. For example:
  • φ = 1 implies a 10% probability of receiving a heartbeat.
  • φ = 2 implies a 1% probability of receiving a heartbeat.
  • φ = 3 implies a 0.1% probability of receiving a heartbeat.

and so on.

How to define P(t)? This is achieved by maintaining a sliding window of heartbeat intervals. The samples within this window collectively form a probability distribution. Then:

P(t) = Integral of this distribution from t to .

According to the requirements, the application may utilize cutoffs for φ to determine whether to accept or reject as a failure.

Gossip Protocols

Gossip protocols constitute a class of broadcast protocols within P2P networks. Other classes include point-to-point (one-to-one) and eager reliable (one-to-all) protocols.

The core principle of gossip protocols involves each node periodically transmitting a message to a randomly selected subset of other nodes. This decentralized approach ensures that, with high probability, a particular message will eventually propagate throughout the entire system. This characteristic finds valuable applications in various domains, including failure detection.

Types of Gossip Protocols

Gossip protocols are typically characterized by two key tunables: the required time for message dissemination and network usage. Based on these parameters, several distinct types of gossip protocols have emerged:

  • Anti-Entropy: In anti-entropy protocols, nodes transmit their entire message set to other nodes. This approach generally results in high bandwidth consumption. To mitigate this, diffing techniques can be employed to avoid redundant data transmission. The term "anti-entropy" originates from the protocol's objective: to minimize the information divergence between nodes by exchanging complete datasets.

  • Rumor-Mongering: Rumor-mongering protocols, also known as dissemination protocols, focus on transmitting only updates or changes to the message set. This approach significantly reduces network usage, enabling more frequent data exchange.

  • Aggregation: Aggregation protocols facilitate the exchange of system-wide aggregates through sampling techniques.

Fundamentally, all gossip protocols share a common approach: they disseminate information by having each node randomly transmit messages to a subset of its peers. This decentralized strategy relies on the probabilistic nature of these random transmissions to ensure that information eventually propagates throughout the entire network.

Cassandra and Failure Detection

Within a range, it is crucial for replicas to detect the failure of the coordinator, enabling the election of a new coordinator.

Cassandra utilizes an anti-entropy gossiping mechanism in conjunction with an accrual failure detector. In each exchange, a node transmits its entire collection of heartbeat information (received from other nodes) to its peers. This comprehensive exchange often results in shorter heartbeat intervals. Consequently, an exponential distribution typically provides a more accurate representation of heartbeat arrival times compared to a Gaussian distribution.

Post failure detection, Cassandra elects a leader amongst its nodes using Zookeeper.

Paper Review

This paper is remarkably concise, covering a wide range of concepts within a relatively small number of pages. While not overly challenging to read and comprehend, I recommend familiarizing oneself with Dynamo and Bigtable beforehand to gain a deeper understanding of the context and related concepts.



Comments

Popular posts from this blog

Comments in a Code v/s Code Readability

Architecture of High Performance Computing Server at BIT Mesra