Big Data Processing - Basics : Part 1

A small series for understanding and optimizing Spark and other data processing engines.

Oct 23, 2024

Over the course of last few years, I’ve had the opportunity to work on optimizing multiple batch jobs across multiple platforms and projects. This series shares a mental model that should come in handy for understanding, writting and optimizing batch jobs.

Abstracting the basics

The series over generalizes and significantly undermines complexity of the actual implementation of the underlying systems. And is expected to just point out the more commonly seen issues in typical implementation.

It helps to understand what goes under the hood in batch systems before we jump to optimize them. The most common ones you’ll see would be

Coordinator and Workers
Tasks
Shuffling
Processing
Reading and Writing

The above diagram tries to put a picture to some of the components we will be discussing.

Coordinator and Worker

Almost all jobs have a coordinator ( aka manager, driver, ~~master~~, etc) and worker nodes.

The responsibility of coordinator is as the name suggests:

Pick up what needs to be executed.
Create a plan that divides the huge job into stages, creating subtasks to be executed for each of these stages.
Manage stage completion and task progress check-pointing.

The responsibility of workers is :

Pick up a task from the coordinator.
Execute the task.
Return the result / or commit it to a temp storage.

Tasks

Most batch systems 100s of GBs, TBs of data if not PBs. At this scale, we have already left the realm of individual machines and entered the realm of “horizontally scaling”. A huge task of processing 100s of TBs worth of data into partitioned files. These partitions are then used to fetch the results of a query that can be divided into 100K or more smaller tasks. A typical task would only process a partition or part of partition and return the result. It would help if we look at a task as a simple metadata that looks like

task {
  task_id,
  query,
  read {
    partition: a
    from: x
    to: y 
  }
}

So when a worker receives a task it just processes a part of the request for the query provided and returns its result back to the coordinator.

Shuffling

Ok! I lied about the part where a task only reads from x to y. Depending on what the query is requesting the task might not have the complete data required to provide the results. For instance, a query to write all the values related to a key onto disk in a single file cannot be completed without reading all the petabytes of data.

So how do we solve the problem when a query that cannot be served by just processing parts of the problem? In such cases, the coordinator requests workers to rearrange the data aka shuffle the data by key. This results in data being written in chunks that facilitate creation of continuous tasks (as mentioned before).

Processing

When a worker receives a task it performs some set of operations requested by coordinator. A task only performs a subset of work so you will see workers see pick up multiple tasks at the same time and doing what is needed a part of these tasks.

It is always very important to understand what processing is happening as part of the task to debug / speed your job execution as we will see in some later posts of this series.

Reading and Writing

It is important to mention at this stage that almost all of these batch processing systems heavily rely on Distributed File Systems(DFS) ( such as HDFS, GFS, and S3) properties to store and process temporary states.

The fault tolerance and scalability of DFS significantly improves the performance at a per task level. The read and write performance is also significantly improved based on the choice of data file format you choose to store and read data from. Text files are much slower than column-oriented data formats such as parquet and Arrow.

What’s next?

In the follow up posts, I will share some of the key bottlenecks with which we were able to improve job performance and reduce infrastructure costs by as much as 50% in some cases. Additionally, new levels of complexity and details will surface as we start digging deeper into problems and their solutions.

Hitesh Writes

Discussion about this post