Nano

Posts

Paper Insights - Dynamo: Amazon's Highly Available Key-value Store

By Sushant Kumar Gupta January 20, 2025

Paper Link This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB , Cassandra , and Azure Cosmos DB . A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Recommended Read - Paper Insights - Cassandra - A Decentralized Structured Storage System where I discuss in details about failure detections and gossip protocols. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without ce...

Paper Insights - The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

By Sushant Kumar Gupta January 18, 2025

Paper Link This influential paper from Google, presented at VLDB 2015, is a landmark in Data Engineering. Authored by Tyler Akidau , a distinguished engineer at Snowflake and former Google employee, it explores groundbreaking concepts. Akidau's work on Millwheel significantly influenced Apache Flink , while his Dataflow Model laid the groundwork for Apache Beam . Notably, Google Cloud Dataflow implements the Apache Beam framework. For a deeper understanding, I recommend a valuable YouTube playlist that effectively explains the core ideas presented in this paper. Motivating Example This paper focuses on streaming systems. For a better understanding of the context, I recommend reviewing my previous post, Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine , which explains the key differences between batch and stream processing. In a streaming system, events arrive continuously and are processed in an ongoing manner. The core concep...

Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine

By Sushant Kumar Gupta January 17, 2025

Paper Link This paper was presented at the IEEE International Conference on Cloud Engineering Workshop in 2015. Workshops, generally considered less prestigious than the main conference itself, provide a forum for more focused discussions and specialized research. The authors of the paper acknowledge that the work presented is not solely their own, demonstrating commendable modesty in recognizing the contributions of others. Batch and Stream Processing Data processing involves performing operations on collections of data, often represented as records . These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are: Batch Processing Processes data in discrete batches or groups. There are certain primitive operations in batch processing: Map: Applies a function to each individual record within a batch, producing an output for each input. Reduce: Aggregates the results of the map operation, often by combining values associa...

Paper Insights - Firecracker: Lightweight Virtualization for Serverless Applications

By Sushant Kumar Gupta January 15, 2025

Paper Link This paper, co-authored by Alexandru Agache and distinguished AWS scientist Marc Brooker , along with other researchers, was presented at the esteemed NSDI '20 conference. Before we explore the paper's ideas in depth, let's establish some foundational context. Serverless Computing Serverless computing is a cloud computing paradigm where the cloud provider dynamically allocates machine resources, especially compute, as needed. The cloud provider manages the underlying servers. AWS offers a comprehensive serverless stack with a diverse range of products, including: API Gateway : For creating and managing APIs. Lambda : For executing serverless compute functions. Simple Storage Service (S3) : For serverless object storage. This paper primarily focuses on Lambda, a serverless compute platform. However, a foundational understanding of cloud object stores like S3 is also crucial. For those seeking a deeper dive into cloud object stores, I recommend exploring the De...

Search This Blog