reduceByKey and groupByKey are both transformations on RDD in Apache Spark.
In this post we would try to answer few of the questions such as :
But which one to use ?
Which is more efficient ?
When to use reduceByKey and when to use groupByKey ?
To understand this we need to understand the underlying architecture of both the transformations.
The most important part of the execution stage is the Shuffle Phase.
So let us compare this for both reduceByKey and groupByKey
ReduceByKey Shuffle Step :
Data is combined so that at each partition there is at most one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey Shuffle Step :
It does not merge the values for the key but directly the shuffle step happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key happens after the shuffle step.
Here lot of data needs to be stored on final worker node (reducer) thereby resulting in out of memory issue.
Therefore we understand that :
ReduceByKey is more efficient when we run this on large data set.
GroupByKey can cause out of Disk Problem.
Why do we have GroupByKey then ?
Not every problem can be solved by GroupByKey.
Like we would not want to aggregate over same keys always.
Other variants such as reduceByKey , aggregateByKey , foldByKey and combineByKey do some operation before Shuffle phase,
so these should be preferred over groupByKey.
In this post we would try to answer few of the questions such as :
But which one to use ?
Which is more efficient ?
When to use reduceByKey and when to use groupByKey ?
To understand this we need to understand the underlying architecture of both the transformations.
The most important part of the execution stage is the Shuffle Phase.
So let us compare this for both reduceByKey and groupByKey
ReduceByKey Shuffle Step :
Data is combined so that at each partition there is at most one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey Shuffle Step :
It does not merge the values for the key but directly the shuffle step happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key happens after the shuffle step.
Here lot of data needs to be stored on final worker node (reducer) thereby resulting in out of memory issue.
Therefore we understand that :
ReduceByKey is more efficient when we run this on large data set.
GroupByKey can cause out of Disk Problem.
Why do we have GroupByKey then ?
Not every problem can be solved by GroupByKey.
Like we would not want to aggregate over same keys always.
Other variants such as reduceByKey , aggregateByKey , foldByKey and combineByKey do some operation before Shuffle phase,
so these should be preferred over groupByKey.
Reference : http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs


No comments:
Post a Comment