Thursday, June 4, 2020

Spark Dataframe: Speeding / Optimisation /Shuffle Partition

Spark : Speeding / Optimization 




  val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

import spark.implicits._

val df2: DataFrame =Seq(1,2).toDF("CURRENCY")
.withColumn("c2", lit(8))
.withColumn("c3", lit(1))

spark.conf.set("spark.sql.shuffle.partitions",100)


Less data reduce the shuffle partitions or you will end up with many partitioned files with less number of records in each partition. which results in running process for long time .

When you have too much of data and having less number of partitions results in fewer longer running tasks and some times you may also get out of memory error.


Default is set to 200

No comments:

Post a Comment