代码之家 › 专栏 › 技术社区 › mllamazares

如何在pandas数据框中执行排序搜索?

dataframe pandas python

mllamazares · 技术社区 · 7 年前

我有一个输入字符串,如下所示:

ms = 'hello stack overflow friends'

以及具有以下结构的pandas数据框:

      string  priority  value
0         hi         1      2
1  astronaut        10      3
2   overflow         3     -1
3     varfoo         4      1
4      hello         2      0

然后我尝试执行以下简单算法:

将pandas数据帧升序排序依据 df['priority'] 列。
检查是否 ms 字符串变量包含 df['string'] 一言为定。
如果是,请返回 df['value'] .

因此,这是我的做法:

import pandas as pd

ms = 'hello stack overflow friends'

df = pd.DataFrame({'string': ['hi', 'astronaut', 'overflow', 'varfoo', 'hello'],
                   'priority': [1, 10, 3, 4, 2],
                   'value': [2, 3, -1, 1, 0]})

final_val = None

for _, row in df.sort_values('priority').iterrows():
    # just printing the current row for debug purposes
    print (row['string'], row['priority'], row['value'])

    if ms.find(row['string']) > -1:
        final_val = row['value']
        break

print()
print("The final value for '", ms, "' is ", final_val)

返回以下内容:

hi 1 2
hello 2 0

The final value for ' hello stack overflow friends ' is  0

这段代码工作正常,但问题是我的df有大约20K行,我需要执行这种搜索超过1K次。

这大大降低了我的进程的性能。那么,有没有比我使用纯熊猫和避免不必要的循环更好(或更简单)的方法呢?

1 回复 | 直到 7 年前

rer 7 年前

编写一个可以应用于数据帧而不是使用的函数 iterrows

match_set = set(ms.split())
def check_matches(row):
    return row['value'] if row['string'] in match_set else None

df['matched'] = df.apply(check_matches, axis=1)

它给你:

   priority     string  value  matched
0         1         hi      2      NaN
1        10  astronaut      3      NaN
2         3   overflow     -1     -1.0
3         4     varfoo      1      NaN
4         2      hello      0      0.0

然后可以对值进行排序,并取第一个非 NaN 价值来源 df.matched 得到你所说的 final_value .

df.sort_values('priority').matched.dropna().iloc[0]
0.0

或者,可以对df进行排序并将其转换为元组列表:

l = df.sort_values('priority').apply(lambda r: (r['string'], r['value']), axis=1).tolist()

给:

l
[('hi', 2), ('hello', 0), ('overflow', -1), ('varfoo', 1), ('astronaut', 3)]

并编写一个函数,当它到达第一个匹配时停止:

def check_matches(l):
    for (k, v) in l:
        if k in match_set:
            return v
check_matches(l)
0

推荐文章

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

5 月前

Cam · Pandas列表日期到日期时间

5 月前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

5 月前

Sun Jar · 在另一个系列中查找当前df值的索引,并将其添加到列中

6 月前

dietzi96 · Pandas DataFrame.to_sql随机和静默地失败,没有错误消息

6 月前

Bijan · Pandas批量更新帐户字符串

6 月前

Kernel · TypeError:Index.reindex()收到意外的关键字参数fill_value'

6 月前

Kernel · 进入熊猫的定义。系列super().reindex

6 月前

adventurous_chip_55 · 如何引爆柱子

6 月前

RKIDEV · Panda迭代行并将第n行值乘以下一(n+1)行值

6 月前