代码之家  ›  专栏  ›  技术社区  ›  Ivan Bilan

使用rdd.地图在Pypark

  •  1
  • Ivan Bilan  · 技术社区  · 6 年前

    我需要将代码从PySpark 1.3移植到2.3(也仅在Python 2.7上),并且在rdd上有以下映射转换:

    import cPickle as pickle
    import base64
    
    path = "my_filename"
    
    my_rdd = "rdd with data" # pyspark.rdd.PipelinedRDD()
    
    # saving RDD to a file but first encoding everything
    my_rdd.map(lambda line: base64.b64encode(pickle.dumps(line))).saveAsTextFile(path)
    
    # another my_rdd.map doing the opposite of the above, fails with the same error
    my_rdd = sc.textFile(path).map(lambda line: pickle.loads(base64.b64decode(line)))
    

    当运行此部分时,我得到以下错误:

       raise pickle.PicklingError(msg)
    PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
    

    看起来像这样的行为是不允许的 map

    更新:

    奇怪的是,只是做:

    my_rdd.saveAsTextFile(path)
    

    同样的错误也会失败。

    1 回复  |  直到 6 年前
        1
  •  0
  •   Ivan Bilan    6 年前

    归根结底,问题出在做转换的函数的深处。在这种情况下,重写比调试更容易。