代码之家  ›  专栏  ›  技术社区  ›  Ardalan Shahgholi

您可以通过在SQL中运行“REFRESH TABLE tableName”命令或重新创建所涉及的数据集/DataFrame来显式地使Spark中的缓存无效

  •  0
  • Ardalan Shahgholi  · 技术社区  · 4 年前

    我在Azure上使用DataBricks作为服务。这是我的集群信息:

    enter image description here

    我听从指挥,一切都很糟糕。

     %sql
     Select 
        * 
     from db_xxxxx.t_fxxxxxxxxx
     limit 10
    

    然后我更新了上表中的一些行。当我再次运行上述命令时,我出现了以下错误:

        Error in SQL statement: SparkException: Job aborted due to stage failure: Task 3 in stage 2823.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2823.0 (TID 158824, 10.11.49.6, executor 14): com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@storxfadev0501.dfs.core.windows.net/xsi-ed-faits/t_fait_xxxxxxxxxxx/_delta_log/00000000000000000022.json. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:251)
            at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:354)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
            at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
            at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
            at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
            at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
            at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
            at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
            at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
            at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
            at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
            at org.apache.spark.scheduler.Task.run(Task.scala:112)
            at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
        Caused by: java.io.FileNotFoundException: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
        StatusCode=404
        StatusDescription=The specified path does not exist.
        ErrorCode=
        ErrorMessage=
            at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:912)
            at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:169)
            at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
            at com.databricks.spark.metrics.FileSystemWithMetrics.open(FileSystemWithMetrics.scala:282)
            at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
            at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
            at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.readFile(JsonDataSource.scala:134)
            at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:138)
            at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:136)
            at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
            at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:134)
            at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:235)
            ... 26 more
        Caused by: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
        StatusCode=404
        StatusDescription=The specified path does not exist.
        ErrorCode=
        ErrorMessage=
            at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
            at shaded.databricks.v20180920_b
    
    0 回复  |  直到 4 年前
        1
  •  4
  •   Jorge Tovar    3 年前

    总之,您可以 刷新表格 (执行前)姓名或 重新启动群集

    spark.sql("refresh TABLE schema.table")
    

    基础文件可能已更新。你可以 通过运行“REFRESH TABLE”使Spark中的缓存显式无效 SQL中的tableName命令或通过重新创建数据集/DataFrame 卷入的。如果增量缓存已过时或基础文件已损坏 删除后,您可以通过重新启动来手动使增量缓存无效 集群。

        2
  •  3
  •   CHEEKATLAPRADEEP    4 年前

    当您更新表中的某些行并立即查询表时,这是预期的行为。

    从错误消息中: It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

    要解决此问题,请刷新与表关联的所有缓存条目。

    REFRESH TABLE [db_name.]table_name
    

    刷新与表关联的所有缓存条目。如果该表之前已被缓存,那么下次扫描时它将被延迟缓存。