代码之家 › 专栏 › 技术社区 › mammykins

将熊猫节点和边缘列表从节点标签转换为节点索引

edges nodes pandas python

mammykins · 技术社区 · 6 年前

我有一个 tidy representation 表示为两个独立的CSV的图或网络;一个用于节点,一个用于带权重的边。我已经将它们从csv读取到了python 3中的pandas数据帧中。

我在这里使用不同的方法创建了一些类似的数据帧,但是使用它们来说明问题。

import pandas as pd

# i have a nodes list
nodes = {'page': ['/', '/a', '/b']}
# the data is actually read in from csv
nodes = pd.DataFrame.from_dict(nodes)

nodes

它返回由默认方法自动索引的节点列表(不管是什么;我读到的它在不同的Python版本之间有所不同,但不应该影响问题)。

    page
0   /
1   /a
2   /b

边缘列表是:

# and an edges list which uses node label; source and destination
# need to convert into indexes from nodes
edges = {'source_node': ['/', '/a', '/b', '/a'],
        'destination_node': ['/b', '/b', '/', '/'],
        'weight': [5, 2, 10, 5]}
# the data is actually read in from csv
edges = pd.DataFrame.from_dict(edges)
edges

看起来像:

    source_node destination_node    weight
0   /                   /b            5
1   /a                  /b            2
2   /b                  /             10
3   /a                  /             5

在这里,您可以看到问题,源节点和目标节点是标签,而不是来自上一个数据帧的正确节点索引。我想要一个边缘熊猫数据帧,它具有标记节点的适当索引,而不是它们的标签。我可以在数据管道的上游执行此操作,但为了方便起见,我想在这里修复此操作。节点数和边数分别为22K和45K。我不介意解决方案运行几分钟。

我可以获取我需要的信息,但不能将其分配给边缘数据框中的新熊猫列。

我可以通过循环获得我想要的索引,但是在熊猫中有更好的方法可以做到这一点吗,我可以像在R中那样向量化问题吗?

for i in edges["source_node"]:
    print(nodes[nodes.page == i].index.values.astype(int)[0])

for i in edges["destination_node"]:
    print(nodes[nodes.page == i].index.values.astype(int)[0])

0
1
2
1
2
2
0
0

以及如何将其作为两个新列(一个称为“源”)和一个称为“目的地”)放入我的边缘数据框中。我想要的是:

    source_node destination_node    weight    source      destination
0   /                   /b            5        0                2
1   /a                  /b            2        1                2
2   /b                  /             10       2                0
3   /a                  /             5        1                0

执行以下错误,开始时看起来不正确:

edges['source'] = for i in edges["source_node"]:
    nodes[nodes.page == i].index.values.astype(int)[0]

edges['destination'] = for i in edges["destination_node"]:
    nodes[nodes.page == i].index.values.astype(int)[0]

因为我对python还不熟悉,所以我对解决这个问题的“python”方法以及对我的新手来说很简单的方法很感兴趣。

1 回复 | 直到 6 年前

Scott Boston 6 年前

你可以使用 map 和 set_index :

nodelist = nodes.reset_index().set_index('page').squeeze()

或者@mammykins建议使用真实世界的样本:

nodelist = nodelist.loc[~nodelist.index.duplicated(keep='first')]


edges['source'] = edges.source_node.map(nodelist)
edges['destination'] = edges.destination_node.map(nodelist)

print(edges)

输出:

  source_node destination_node  weight  source  destination
0           /               /b       5       0            2
1          /a               /b       2       1            2
2          /b                /      10       2            0
3          /a                /       5       1            0