df = spark.read.parquet("s3://my_s3_bucket/my_parquet_location/")
distinctDates = df.select("dt").distinct()
distinctDates.explain(True)
执行计划很快就得到了,找到了796个分区:
== Parsed Logical Plan ==
Deduplicate [dt#310]
+- Project [dt#310]
+- Relation[,... 287 more fields] parquet
== Analyzed Logical Plan ==
dt: date
Deduplicate [dt#310]
+- Project [dt#310]
+- Relation[<>,... 287 more fields] parquet
== Optimized Logical Plan ==
Aggregate [dt#310], [dt#310]
+- Project [dt#310]
+- Relation[<>,... 287 more fields] parquet
== Physical Plan ==
*(2) HashAggregate(keys=[dt#310], functions=[], output=[dt#310])
+- Exchange hashpartitioning(dt#310, 200)
+- *(1) HashAggregate(keys=[dt#310], functions=[], output=[dt#310])
+- *(1) FileScan parquet [dt#310] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://my_s3_bucket/my_parquet_location], PartitionCount: 796, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
现在,当我试图计算distinctDates数据帧中的条目数时:
distinctDates.count()
在给出结果“796”之前,执行需要很多分钟。
真奇怪。我不明白Spark对我的询问做了什么。我做错什么了吗?对于分区键的查询,我希望立即得到结果。