代码之家  ›  专栏  ›  技术社区  ›  F.Lira

字典到数据帧的存在/不存在

  •  0
  • F.Lira  · 技术社区  · 7 年前

    从2D字典中,如何将其转换为存在/不存在的数据帧或矩阵,其中的值(在列表中)是列,键是行名称? 在列表中累积值,我的目标是将它们组织成一个矩阵。

    我一直在尝试,但没有成功:

    values = set()
    
    for genome, info in dict_cluster.items():
        for v in info:
            #t = [genome, ([v for v in info])]
            t = [genome,v]
        print pd.DataFrame(t)
    

    输入:

    A ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps']
    B ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide']
    C ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps']
    D ['nrps', 'siderophore', 't1pks-nrps']
    

    输出:

        arylpolyene siderophore hserlactone-arylpolyene transatpks-nrps terpene thiopeptide hserlactone nrps    t1pks-nrps
    A   1   2   0   1   1   1   2   1   1
    B   0   1   1   0   0   1   1   1   0
    C   0   1   0   0   0   0   0   3   2
    D   0   1   0   0   0   0   0   1   1
    

    我的结论是:

                     0
    0  GCF_900068895.1
    1  transatpks-nrps
                     0
    0  GCA_002415165.1
    1      thiopeptide
                     0
    0  GCA_000367685.2
    1       t1pks-nrps
                     0
    0  GCA_002732135.1
    1       t1pks-nrps
    
    3 回复  |  直到 7 年前
        1
  •  1
  •   jezrael    7 年前

    使用 Counter 具有 dictionary comprehension 并分配给 DataFrame :

    from collections import Counter
    
    df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
    print (df)
    
       arylpolyene  hserlactone  hserlactone-arylpolyene  nrps  siderophore  \
    A            1            2                        0     1            1   
    B            0            1                        1     1            1   
    C            0            0                        0     3            1   
    D            0            0                        0     1            1   
    
       t1pks-nrps  terpene  thiopeptide  transatpks-nrps  
    A           1        1            1                1  
    B           0        0            1                0  
    C           2        0            0                0  
    D           1        0            0                0  
    

    编辑:

    用于指示值 MultiLabelBinarizer :

    d = {'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
    'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
    'C' :['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
    'D': ['nrps', 'siderophore', 't1pks-nrps']}
    

    from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb = MultiLabelBinarizer()
    df = pd.DataFrame(mlb.fit_transform(d.values()),columns=mlb.classes_, index=d.keys())
    print (df)
       arylpolyene  hserlactone  hserlactone-arylpolyene  nrps  siderophore  \
    A            1            1                        0     1            1   
    B            0            1                        1     1            1   
    C            0            0                        0     1            1   
    D            0            0                        0     1            1   
    
       t1pks-nrps  terpene  thiopeptide  transatpks-nrps  
    A           1        1            1                1  
    B           0        0            1                0  
    C           1        0            0                0  
    D           1        0            0                0  
    
        2
  •  1
  •   phil    7 年前

    也许你在找这样的东西:

    val = {'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
           'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
           'C': ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
           'D': ['nrps', 'siderophore', 't1pks-nrps']}
    all_val = []
    for k in val:
        for v in val[k]:
            all_val.append((k,v))
    
    df = pd.DataFrame(all_val,columns=['key','val']).set_index('key')
    df_count = df.pivot_table(index='key',columns='val',aggfunc=len)
    

    输出:

    val  arylpolyene  hserlactone  hserlactone-arylpolyene  nrps  siderophore  \
    key                                                                         
    A            1.0          2.0                      NaN   1.0          1.0   
    B            NaN          1.0                      1.0   1.0          1.0   
    C            NaN          NaN                      NaN   3.0          1.0   
    D            NaN          NaN                      NaN   1.0          1.0   
    
    val  t1pks-nrps  terpene  thiopeptide  transatpks-nrps  
    key                                                     
    A           1.0      1.0          1.0              1.0  
    B           NaN      NaN          1.0              NaN  
    C           2.0      NaN          NaN              NaN  
    D           1.0      NaN          NaN              NaN 
    
        3
  •  0
  •   Marios Karatisoglou    7 年前

    这应该可以做你的工作(我用的是蟒蛇3):

    my_dict = {
                'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
                'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
                'C': ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
                'D': ['nrps', 'siderophore', 't1pks-nrps']
                }
    
    rows_list=list(my_dict.keys())
    values=list(my_dict.values())
    rows_size=len(rows_list)
    
    columns_list = []
    
    for sublist in values:
        for item in sublist:
            if item not in columns_list:
                columns_list.append(item)
    
    columns_size = len(columns_list)
    
    #initialize adjacent matrix
    print('Initial adjacent matrix')
    adjacent = [ [0]*columns_size for i in range(rows_size) ]
    for row in adjacent:
        print(row)
    
    for key, value in my_dict.items():
        for v in value:
            adjacent[rows_list.index(key)][columns_list.index(v)] += 1
    
    print('-'*50)
    print('Final adjacent matrix')
    for row in adjacent:
        print(row)
    

    在第一个循环中 for sublist in values: 我创建了一个列表,其中的值作为不重复的列。

    adjacent = [ [0]*columns_size for i in range(rows_size) ] 我创建一个列表,其中的元素数量与字典键的数量相同。每个元素都是一个列表,其中的元素数量与列值的数量相同。

    我试着尽可能简单地做,如果有什么你想不出来的,告诉我:)