JOIN种类
sort merge join
broadcast join
触发条件:
表小于spark.sql.autoBroadcastJoinThreshold设定的值(默认10M),
shuffle hash join
触发条件:
1. 分区的平均大小不超过spark.sql.autoBroadcastJoinThreshold所配置的值,默认是10M
2. 基表不能被广播,比如left outer join时,只能广播右表
3. 一侧的表要明显小于另外一侧,小的一侧将被广播(明显小于的定义为3倍小,此处为经验值)
小文件问题
SQL触发SQL可以调小这个,spark.sql.shuffle.partitions
Spark on Hive
代码样例
hive-site.xml放到resource
val spark = SparkSession.builder().appName("WordCount").config("spark.master", "local").config("spark.sql.shuffle.partitions", 4).config("spark.sql.adaptive.enabled", true).config("hive.exec.dynamici.partition", true).config("hive.exec.dynamic.partition.mode", "nonstrict").enableHiveSupport().getOrCreate()spark.sparkContext.setLogLevel("DEBUG")val sc = spark.sparkContextval path = "/etc/hadoop"sc.hadoopConfiguration.addResource(new Path(s"${path}/core-site.xml"))sc.hadoopConfiguration.addResource(new Path(s"${path}/hdfs-site.xml"))
