代码之家  ›  专栏  ›  技术社区  ›  JoeVictor

使用PySpark和丢弃空值的无法解释的行为

  •  1
  • JoeVictor  · 技术社区  · 6 年前

    我在PySpark中有一个Spark数据帧,我正试图从中删除空值。

    之前,在解析过程中清理东西时,我运行了 convert_to_null 方法论 title 基本上检查字符串是否按字面意思表示的列 "None" 如果是的话,把它转换成 None . 这样,Spark将其转换为内部空类型。

    现在,我正试图将空类型的行放到 标题 列。以下是我试图删除空值的所有内容:

    new_df = df.na.drop('title')

    new_df = df[F.col('title').isNotNull()]

    new_df = df[~F.col('title').isNull()]

    但我总是在 new_df.show() 在以下时间后拨打几行电话:

    Py4JJavaError: An error occurred while calling o2022.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 87.0 failed 1 times, most recent failure: Lost task 1.0 in stage 87.0 (TID 314, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main process() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 324, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in dump_stream for obj in iterator: File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 313, in _batched for item in iterator: File "<string>", line 1, in <lambda> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 75, in <lambda> return lambda *a: f(*a) File "/usr/local/spark/python/pyspark/util.py", line 55, in wrapper return f(*args, **kwargs) File "<ipython-input-16-48bc3ec1b5d9>", line 5, in replace_none_with_null TypeError: 'in <string>' requires string as left operand, not NoneType

    我想我快疯了。我不知道怎么修理东西。如有任何帮助,我们将不胜感激。谢谢!

    0 回复  |  直到 6 年前
    推荐文章