Posts

Paper Insights - The Eternal Tussle: Exploring the Role of Centralization in IPFS

Paper Link I'd like to delve into a technology I have significant research experience with: the InterPlanetary File System (IPFS). While the original IPFS white paper -  IPFS - Content Addressed, Versioned, P2P File System  - was influential, it is not peer reviewed. Therefore, I'll focus on a related paper presented at the 2024 Networked Systems Design and Implementation (NSDI) conference, a prestigious venue in distributed systems research. This paper was also discussed in Stanford's CS244b class. For background, I recommend reading my " Paper Insights - Dynamo: Amazon's Highly Available Key-value Store ", where I discuss consistent hashing and its techniques. Now, let's explore some fundamental concepts of IPFS. Decentralized Web Traditional websites, akin to 1800 numbers, rely solely on the website owner to bear the costs of hosting and computation. In contrast, decentralized web leverage a distributed network of nodes. This means that the computationa...

Paper Insights - Dynamo: Amazon's Highly Available Key-value Store

Image
Paper Link This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Recommended Read - Paper Insights - Cassandra - A Decentralized Structured Storage System  where I discuss in details about failure detections and gossip protocols. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without ce...

Paper Insights - The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Image
Paper Link This influential paper from Google, presented at VLDB 2015, is a landmark in Data Engineering. Authored by  Tyler Akidau , a distinguished engineer at Snowflake and former Google employee, it explores groundbreaking concepts. Akidau's work on Millwheel significantly influenced Apache Flink , while his Dataflow Model laid the groundwork for Apache Beam . Notably, Google Cloud Dataflow implements the Apache Beam framework. For a deeper understanding, I recommend a valuable  YouTube playlist  that effectively explains the core ideas presented in this paper. Motivating Example This paper focuses on streaming systems. For a better understanding of the context, I recommend reviewing my previous post,  Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine , which explains the key differences between batch and stream processing. In a streaming system, events arrive continuously and are processed in an ongoing manner. The core concep...

Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine

Image
Paper Link This paper was presented at the IEEE International Conference on Cloud Engineering Workshop in 2015. Workshops, generally considered less prestigious than the main conference itself, provide a forum for more focused discussions and specialized research. The authors of the paper acknowledge that the work presented is not solely their own, demonstrating commendable modesty in recognizing the contributions of others. Batch and Stream Processing Data processing involves performing operations on collections of data, often represented as records . These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are: Batch Processing Processes data in discrete batches or groups. There are certain primitive operations in batch processing: Map: Applies a function to each individual record within a batch, producing an output for each input. Reduce: Aggregates the results of the map operation, often by combining values associa...

Paper Insights - Firecracker: Lightweight Virtualization for Serverless Applications

Image
Paper Link This paper, co-authored by Alexandru Agache and distinguished AWS scientist Marc Brooker , along with other researchers, was presented at the esteemed NSDI '20 conference. Before we explore the paper's ideas in depth, let's establish some foundational context. Serverless Computing Serverless computing is a cloud computing paradigm where the cloud provider dynamically allocates machine resources, especially compute, as needed. The cloud provider manages the underlying servers. AWS offers a comprehensive serverless stack with a diverse range of products, including: API Gateway : For creating and managing APIs. Lambda : For executing serverless compute functions. Simple Storage Service (S3) : For serverless object storage. This paper primarily focuses on Lambda, a serverless compute platform. However, a foundational understanding of cloud object stores like S3 is also crucial. For those seeking a deeper dive into cloud object stores, I recommend exploring the De...