我需要将代码从PySpark 1.3移植到2.3(也仅在Python 2.7上),并且在rdd上有以下映射转换:
import cPickle as pickle
import base64
path = "my_filename"
my_rdd = "rdd with data"
my_rdd.map(lambda line: base64.b64encode(pickle.dumps(line))).saveAsTextFile(path)
my_rdd = sc.textFile(path).map(lambda line: pickle.loads(base64.b64decode(line)))
当运行此部分时,我得到以下错误:
raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
看起来像这样的行为是不允许的
map
更新:
奇怪的是,只是做:
my_rdd.saveAsTextFile(path)
同样的错误也会失败。