Paper Link
Before we delve into the paper's details, I would like to introduce some topics to readers.
Cloud Data Store
The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include
Google Cloud Storage,
Amazon S3, and
Azure Blob Storage.
Design Points:
- Key-Value Store with Eventual Consistency: Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes.
- Data Immutability: In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in adding a new block to chunk servers and updating metadata, appends in a cloud store can be less efficient.
- Data Partitioning (Buckets): Keys are partitioned into buckets, analogous to directories in a distributed file system.
Limited Functionality: The key-value store is designed for simplicity. Transaction support, especially across multiple keys, is notably absent. For example, consider a scenario where key A has the value 10 in the store. Transaction support would be necessary to atomically read the value of A and then write its double (20) back to the store. Without transaction support, another write operation could concurrently modify the value of A to 30 before the first operation completes, leading to an incorrect result (20 instead of 60).
Performance Characteristics: Typical latencies range from 5 to 10 ms. Achievable throughput typically falls within the 50-100 MB/s range.
Database, Data Warehouse, and Data Lakes
Databases: While the term database is broad, it commonly refers to transactional databases (OLTP). These databases, typically SQL-based, prioritize transactional integrity and offer varying levels of isolation. NoSQL databases, on the other hand, often sacrifice some isolation guarantees for improved performance. A fundamental principle of databases is that data must always maintain consistency.
Example of Consistency: In a bank's table, the total balance across all accounts must remain constant regardless of the number of transfers. If two transactions - one at time T1 and another at time T2 each add $1 to every account, the expected behavior is:
Before T1: No accounts have $1 added.
After T1: All accounts have $1 added.
After T2: All accounts have $2 added.
Consistency in databases have time semantics.
Data Warehouses: Designed for Online Analytical Processing (OLAP), data warehouses prioritize historical data for analysis. While data in a data warehouse must be consistent, it does not necessarily reflect real-time changes. In the example above, balances might not immediately reflect the latest transactions, but consistency is maintained. All accounts will reflect the same updated balance at a later point in time.
Data Lakes: Data lakes are repositories for storing large volumes of raw data in various formats (structured, semi-structured, unstructured). This raw data can then be processed through ETL pipelines to create data warehouses.
Data Pipelines (ETL): These processes are used to extract, transform, and load data into data warehouses. Data sources for ETL pipelines typically include other databases, data lakes, or other data warehouses.
The Goal: Building a Database on top of a Cloud Data Store
In this paper, the authors aim to construct an ACID table on top of a cloud data store. This approach can be extended to any eventually consistent key-value store given that the key-value store is not transient (like
Memcache).
It is crucial to remember that this creates an ACID table, not an ACID database. Therefore, transactions spanning multiple tables may still violate ACID properties.
Cloud Data Store v/s Distributed File System:
Distributed File Systems have gained widespread popularity since the advent of Google File System (GFS, also known as
Colossus). A distributed file system offers a more comprehensive file system interface ideal for database systems. Colossus and
HDFS are well-known examples of distributed file systems.
A fundamental question arises: Why build upon a cloud data store instead of a distributed file system? To answer this, we must understand the distinction between compute and storage layers in cloud environments.
Cloud Computing Architecture:
Most cloud applications are composed of microservices, each with distinct compute and storage components:
- Storage Layer: Comprises the underlying storage medium (e.g., disks) where data is persisted and accessed for computation.
- Compute Layer: Handles user requests, performs computations, and typically involves a replicated service running continuously.
This architecture is well-suited for low-latency, real-time services, allowing independent scaling of storage and compute resources.
Distributed File Systems has inherent compute overhead. Distributed file systems incorporate a compute layer consisting of chunk servers (handling read/write requests) and metadata servers (managing metadata updates). These services run constantly, consuming compute resources even when idle.
On the other hand, Cloud Data Stores leverage serverless architectures like
Lambda. The compute layer is invoked on-demand, only when data is read or modified. This pay-per-use model offers cost-effectiveness compared to the continuous compute consumption of distributed file systems, databases, or data warehouses.
With the goal in mind, let's see how the authors approached this challenge.
Data Format
Tabular data, inherently two-dimensional with rows and columns, is typically serialized into files – continuous sequences of bytes – for storage. Two primary serialization strategies exist:
- Row-oriented Encoding: Common examples include CSV, PostgreSQL, and Apache Avro.
- Column-oriented Encoding: Popular choices include Parquet, ORC, Arrow, and DuckDB. Columnar formats often achieve superior compression, leading to improved performance.
Another notable format is the SSTable, where keys are stored in sorted order.
Apache Parquet leverages Apache Thrift, a serialization protocol comparable to Google Protocol Buffers, as its underlying framework. Delta lake stores records in Parquet files.
Transactions
table/date1/f1.data
f2.data
...
/date2/f1.data
f2.data
...
To handle data modifications, a log-based approach is employed. When a record is updated, the existing Parquet file is not directly altered. Instead:
- A new Parquet file is created to accommodate the changes.
- The original file is soft-deleted, meaning it's marked as inactive but not immediately removed.
Log records maintain a history of additions and deletions. Each log entry corresponds to a specific Parquet file. To enhance system stability and performance, these log records are periodically snapshotted into checkpoints.
To further optimize performance, data files are partitioned. This partitioning strategy improves the efficiency of the aforementioned operations.
Read-Only Transactions
By default, all reads are performed as snapshot reads. Serializable reads are not natively supported. To achieve serializable read behavior, a workaround is necessary: a dummy write operation must be executed.
Performance
Due to its current implementation, the system can only process one transaction at a time, significantly limiting its throughput. This limitation may be acceptable if the database is intended for use in environments with low transaction volumes.
Querying
As noted earlier, Parquet files store data in a columnar format. Each Parquet file includes summary statistics for each column, enabling efficient query processing by allowing the system to quickly identify and skip files that do not contain the relevant data. This approach is particularly effective when data is sorted by the queried column.
However, if data records are not sorted based on the query column, performing range queries on that column can become significantly more expensive.
Use a Secondary Index?
Building and maintaining a secondary index requires cross-table transactions, which are not supported by Delta Lake.
Solution: Z-Ordering
Z-ordering is a technique used to optimize the physical layout of data records within Parquet files. By intelligently sorting data based on multiple attributes, Z-ordering enhances data locality.
Instead of sorting data along a single dimension, Z-ordering considers multiple attributes simultaneously. It aims to group together data points that are spatially close to each other in the multi-dimensional space.
Consider a 2D space with points represented by (X, Y) coordinates.
A simple Z-order might traverse the space in the following order:
(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (1, 4), (2, 3), (2, 4),
(3, 1), (3, 2), (4, 1), (4, 2), (3, 3), (3, 4), (4, 3), (4, 4)
Z-ordering significantly reduces the search space when filtering on multiple attributes, leading to faster query execution. By grouping related data together, Z-ordering improves data locality, which can lead to better data compression and faster read/write operations.
Extensibility:
Z-ordering can be extended to higher dimensions, making it suitable for complex datasets with multiple attributes.
Alternatives
Alternative approaches to building ACID stores on top of cloud data stores often involve employing a separate transactional store for object metadata, analogous to metadata servers in distributed file systems. However, Delta Lake addresses this challenge by seamlessly integrating metadata management within the object store itself.
Related Systems
The paper introduces several key cloud systems and technologies:
Data Warehousing
- Google Cloud BigQuery: A data warehouse service built on top of Dremel, leveraging Colossus for distributed storage. It offers columnar storage and separates storage from processing for scalability.
- Amazon Redshift: A columnar-oriented data warehouse similar to BigQuery.
- Snowflake: A cloud-agnostic data warehouse with a multi-cluster shared architecture.
- Apache Hive: An open-source data warehouse built on Hadoop, providing a SQL-like query interface.
Columnar Data Formats- Apache Parquet: A widely used columnar storage format offering efficient compression and encoding schemes.
- Apache Iceberg: A data format for analytic tables, enabling SQL-based queries across various engines like Spark, Flink, Hive, and Impala.
NoSQL Databases
- Apache Kudu: A columnar-oriented data store within the Hadoop ecosystem, optimized for OLAP workloads and utilizing a relational schema.
- Apache HBase: A non-relational, distributed database supporting wide-column storage, inspired by Google Bigtable.
Query Service
- Amazon Redshift Spectrum: Enables querying data directly in Amazon S3 without prior loading into Redshift, decoupling storage from processing and optimizing costs.
- AWS Athena: An interactive query service that allows users to analyze data directly in Amazon S3 using standard SQL.
Competitor
- Apache Hudi: An open-source framework for managing data lakes, offering features like transactions, upserts, deletes, indexing, clustering, and compaction, positioning it as a direct competitor to Delta Lake.
Final Thoughts
While the core concept of the paper seems straightforward, the actual implementation likely presented significant engineering challenges. Given the problem statement, I initially anticipated a design where table metadata would be managed by a strongly consistent service.
My primary concern revolves around the use of a single log file. This centralized approach could potentially introduce a single point of contention, limiting concurrency and making optimistic concurrency control difficult or inefficient.
The paper devotes a considerable portion (two pages) to discussing business use cases for Delta Lake, specifically emphasizing the need for millions of partitions. While I acknowledge the potential for such scenarios, I remain somewhat skeptical about the practical frequency of these extreme use cases.
Comments
Post a Comment