Saturday, August 8, 2015

Spark Optimizations and Hacks.(Part 1)

Apache Spark is one of the most active Big Data ecosystems project right now. So I thought about sharing my knowledge what I have gained in past 6 months of my experience on Apache Spark.

More on Apache Spark can be found on: http://spark.apache.org/
Here in this blog post I would like to mention few Optimization Techniques that we can use for making full use of the available resources such as memory, cores, etc. to full use.
1.       Loading data from external dataset rather than parallelizing.
As most users are aware, that there are two ways to create an RDD.
a)      Parallelize a collection
b)      Load from external dataset
The first option is not the preferred option as the data goes to the driver program and incase if the data is huge it might crash the driver JVM.
2.       Filter transformations gives the same number of Child RDD Partitions as the Parent RDD Partitions.
So after a filter transformation we can use Coalesce (n) to bring down the number of partitions to n partitions.
Note: Here n is the number of partitions.
3.       Collect action ships all the required data to Driver Program JVM
Do not call collect on something like 1TB of data, that will cause collect to come at driver and crash the driver. Instead either save to HDFS or sample the bigger RDD to smaller just to see the results while testing.

4.       Use Persist or Cache
When we are sure that the data is clean and might be required again and again.
Eg : Like after map or filter use cache to cache the data to memory.
This will reduce the re computation time for RDD.

5.       Narrow Transformation vs Wide Transformation
Narrow à Can happen in parallel, not dependent on other partitions.
Wide à  Where multiple child partitions may depend on it.

Always prefer Narrow transformation if possible.
  
Most of the transformations work element wise. Like map/filter works on each element and each partition.

But there are few transformation which work on per partition basis.

Eg  : Use Case can be opening a connection to a remote database and iterating each element and then closing the connection, instead of opening connection for each element.
6.       Oversubscribe
Always oversubscribe, for example if we have 5 cores on an executor, oversubscribe at least 1.5 times or 2 times. So here we can subscribe to probably 10 tasks.
Syntax:
Spark-shell –master local[10] //will start with 10 worker  / slots

Spark-shell –master local[*]  //as many slots as logical cores on the machine

More in Spark Optimizations and Hacks.(Part 2)

1 comment:

  1. Really Thanks Mr Vipul Rai, Very very important optimized techniques. Thanks to share ur knowledge. I am trying to get certification, these apache Spark advanced tips help me to get it.

    ReplyDelete