Quotes

Tuesday, April 28, 2020

Spark - Databricks

Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines

Driver: Driver is a machine in which app runs, responsible for 3 main things.

1) Maintaining info about the spark app.

2)responding to the user program

3) analyzing, distributing and scheduling work across the executors.


Task:  Tasks are created by Driver and assigned a partition of data to process.
            Partition is a collection of rows that sits on the physical machine in your cluster. then tasks are assigned to slots for parellel execution. Once started each task fetchs data from it's soruce.

Slots: spark parellizes at two levels 1 is to split the work among executors. the other is slot. Each executor has a number of slots. Each slot can be assigned to task.

Job: the secret to spark 's performance is parellelism.Each parallelizeord action is called job. Each job is broken down into stages, which is a set of ordered steps that together accomplish a job.

Executor: each executor holds a chunk of data, the chunk is called partition. Executorats are resnponsible for carryout the work assigned by Driver program. responsible for two actions.
1) Execute the code assigned by driver.
2) Report the state of the computation back to the driver.


Dataframes: tables are equivalent to Apache Spark DataFrames. A DataFrame is an immutable, distributed collection of data organized into named columns. In concept, a DataFrame is equivalent to a table in a relational database, except that DataFrames carry important metadata that allows Spark to optimize queries

Lazy Evaluation: Lazy Evaluation refers to the idea that Spark waits until the last moment to execute a series of operations. Instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you will apply to your source data. That plan is executed when you call an action. 

Lazy evaluation makes it easier to parallelize operations and allows Spark to apply various optimizations. 

Transformations: 
Transformations are at the core of how you express your business logic in Spark. They are the instructions you use to modify a DataFrame to get the results that you want. We call them lazy because they will not be completed at the time you write and execute the code in a cell  - they will only get executed once you have called an action. 

There are two types of transformations: Narrow and Wide.

For narrow transformations, the data required to compute the records in a single partition reside in at most one partition of the parent dataset. 
For wide transformations, the data required to compute the records in a single partition may reside in many partitions of the parent dataset. 




Remember, spark partitions are collections of rows that sit on physical machines in the cluster. Narrow transformations mean that work can be computed and reported back to the executor without changing the way data is partitioned over the system. Wide transformations require that data be redistributed over the system. This is called a shuffle. 

Shuffles are triggered when data needs to move between executors. 
Actions: Actions are statements that are computed AND executed when they are encountered  in the developer's code. They are not postponed or wait for other code constructs. While transformations are lazy, actions are eager.
Example commands that plan transformations or trigger actions. 
Pipelining: Lazy evaluation allows Spark to optimize the entire pipeline of computations as opposed to the individual pieces. This makes it exceptionally fast for certain types of computation because it can perform all relevant computations at once. Technically speaking, Spark pipelines this computation, which we can see in the image below. This means that certain computations can all be performed at once (like a map and a filter), rather than having to do one operation for all pieces of data and then the following operation. Additionally, Apache Spark can keep results in-memory as opposed to other frameworks that immediately write to disk after each task. 
Source: A Gentle Introduction to Apache Spark on Databricks
Catalyst Optimizer: The Catalyst Optimizer is at the core of Spark SQL's power and speed. It automatically finds the most efficient plan for applying your transformations and actions. 





In applications that reuse the same datasets over and over, one of the most useful optimizations is cachingCaching will place a DataFrame or table into temporary storage across the executors in your cluster and make subsequent reads faster. 

The use case for caching is simple: as you work with data in Spark, you will often want to reuse a certain dataset. It is important to be careful how you use caching, because it is an expensive operation itself. If you're only using a dataset once, for example, the cost of pulling and caching it is greater than that of the original data pull. 
Once data is cached, the Catalyst optimizer will only reach back to the location where the data was cached. 


No comments:

Post a Comment