代码之家 › 专栏 › 技术社区 › Midiparse

如何将生成的Java代码转储到stdout?

apache-spark-sql apache-spark

Midiparse · 技术社区 · 6 年前

在apachespark2.+上使用DataFrames,有没有办法获取底层rdd并将生成的Java代码转储到控制台?

2 回复 | 直到 6 年前

huon John U 6 年前

这可以使用 QueryExecution.debug.codegen . 此值可通过访问Dataframe/Dataset .queryExecution (这是一个“开发人员API”,即不稳定,易被破坏,因此只能用于调试)。这适用于Spark 2.4.0,从代码上看,它应该从2.0.0(或更高版本)开始工作:

scala> val df = spark.range(1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*(1) Range (0, 1000, step=1, splits=12)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private boolean range_initRange_0;
/* 010 */   private long range_number_0;
/* 011 */   private TaskContext range_taskContext_0;
/* 012 */   private InputMetrics range_inputMetrics_0;
/* 013 */   private long range_batchEnd_0;
/* 014 */   private long range_numElementsTodo_0;
/* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];

...

/* 104 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_nextBatchTodo_0);
/* 105 */       range_inputMetrics_0.incRecordsRead(range_nextBatchTodo_0);
/* 106 */
/* 107 */       range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
/* 108 */     }
/* 109 */   }
/* 110 */
/* 111 */ }

Raphael Roth 6 年前

下面是一种输出生成代码的方法,可能还有其他方法:

import org.apache.spark.sql.execution.command.ExplainCommand

val explain = ExplainCommand(df.queryExecution.logical, codegen=true)
spark.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
  r => println(r.getString(0))
}

推荐文章

maximodesousadias · 如何根据条件删除日期后的记录

1 年前

Joe Bloggr · 如何将Dataframe类型的函数参数传递给SparkSQL查询

1 年前

Shankar Panda · 如何从org.apache.spark.sql获取密钥。在scala中键入列并将其放入列表变量中?

1 年前

Aaron Brazier · 连接2个pyspark数据帧并继续运行窗口sum和max

1 年前

user23358051 · 火花顺序优化规则

1 年前

Tristpost · 如何从我自己的Java Stream将CSV数据最好地加载到Apache Spark数据帧中?

1 年前

David Cunningham · Pyspark结构化流媒体-来自以前记录的数据

1 年前

Surender Raja · 对case语句的pyspark查询引发错误

1 年前

ConfusedDeveloper · 取消查看SPARK SQL中的列组

1 年前

nfsp412 · 当我使用spark-sql时,将出现此错误

1 年前