我在读书
Jacek Laskowski's online book about Apache Spark
关于分区,他说
默认情况下,为每个HDFS分区创建一个分区,通过
默认为64 MB
我对HDF不太熟悉,但我在复制这一声明时遇到了一些问题。我有一个叫
Reviews.csv
这是亚马逊食品评论的大约330MB文本文件。考虑到默认的64MB块,我希望
ceiling(330 / 64) = 6
分区。但是,当我将文件加载到我的spark shell中时,会得到9个分区:
scala> val tokenized_logs = sc.textFile("Reviews.csv")
tokenized_logs: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> tokenized_logs
res0: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> tokenized_logs.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.HadoopPartition@3c1, org.apache.spark.rdd.HadoopPartition@3c2, org.apache.spark.rdd.HadoopPartition@3c3, org.apache.spark.rdd.HadoopPartition@3c4, org.apache.spark.rdd.HadoopPartition@3c5, org.apache.spark.rdd.HadoopPartition@3c6, org.apache.spark.rdd.HadoopPartition@3c7, org.apache.spark.rdd.HadoopPartition@3c8, org.apache.spark.rdd.HadoopPartition@3c9)
scala> tokenized_logs.partitions.size
res2: Int = 9
我确实注意到,如果我创建另一个较小版本的
回顾
打电话
Reviews_Smaller.csv
只有135MB,我的分区大小大大减小了:
scala> val raw_reviews = sc.textFile("Reviews_Smaller.csv")
raw_reviews: org.apache.spark.rdd.RDD[String] = Reviews_Smaller.csv MapPartitionsRDD[11] at textFile at <console>:24
scala> raw_reviews.partitions.size
res7: Int = 4
不过,根据我的计算,应该有
ceiling(135 / 4) = 3
分区,不是4个。
我在本地运行所有东西,在我的MacBook Pro上。有人能解释如何计算HDF的默认分区数吗?