代码之家 › 专栏 › 技术社区 › Ajay S Pal

当调用函数时传递参数时,PySpark没有在函数内部创建Dataframe

azure-databricks bigdata databricks pyspark apache-spark

1

Ajay S Pal · 技术社区 · 2 年前

我想使用函数创建数据帧。我有一个作为Row对象的国家列表。如下所示:

country_list=df_correct_countries.select('NewCountry').dropDuplicates().collect()
for i in country_list:
    print(i)

###OUTPUT 

Row(NewCountry='Senegal')
Row(NewCountry='Algeria')
Row(NewCountry='Nigeria')
Row(NewCountry='Morocco')
Row(NewCountry='Ethiopia')

我使用for循环将其传递给create_df函数,因为它有两个参数(original_df,country):

下面是我的create_df函数。

def create_df(df,cnt):
         #["NewCountry"]
    cnt=str(cnt)
    # print(cn)
    # print(tf)
    cnt=df.where(col("NewCountry")==str(cnt))
    return cnt

这就是我调用函数的方式:


for j in country_list:
    create_df(df_correct_countries,j['NewCountry'])  ##NewCountry is column name of my coorect country column, which I have collected in list

在函数内部,每次调用函数时,cnt的值都是一个国家。我想创建新的数据帧,在那里我想过滤掉只属于cnt的当前calue的行。

但是,它并没有创建df。

函数在没有错误的情况下运行,但当我试图显示一个国家时,它会抛出错误。

display(EquatorialGuinea)

###Error
NameError: name 'EquatorialGuinea' is not defined

但是,当我在函数外创建数据帧时,在同一个国家,它是有效的。这样地:

df_correct_countries.where(col("NewCountry")=='EquatorialGuinea')

以上工作。

有人能告诉我出了什么问题吗?

最重要的是,我已经尝试过了。

1 回复 | 直到 2 年前

1

0

Nishant Wangneo 2 年前

for j in country_list:
create_df(df_correct_countries,j['NewCountry'])  ##NewCountry is column name of my coorect country column, which I have collected in list

我认为问题可能是您没有使用任何数据结构来保存这个循环中create_df函数创建的新数据帧。由于该函数返回过滤后的df,因此需要将这些dfs存储在某个位置(如字典中)或从循环中打印出来,以便正确显示。例如

    country_dfs = {}

for j in country_list:
    country_name = j['NewCountry']
    country_dfs[country_name] = create_df(df_correct_countries, country_name)

display(country_dfs['EquatorialGuinea'])