代码之家  ›  专栏  ›  技术社区  ›  Nihal Saranga

python-dataframe生成表示向量的列

  •  3
  • Nihal Saranga  · 技术社区  · 7 年前

    我有各种类型的数据框

    df = pd.DataFrame({'genres': [['Drama'], ['Music', 'Drama', 'Romance'],
                                   ['Action', 'Adventure', 'Comedy'],
                                   ['Thriller', 'Romance', 'Drama'],
                                   ['Adventure', 'Family']]
                        })
    print(df)
    genres = ['Action', 'Adventure', 'Comedy', 'Drama', 'Family', 'Music', 'Romance', 'Thriller']  # list of all genres
    

    数据:

                            genres
    0                      [Drama]
    1      [Music, Drama, Romance]
    2  [Action, Adventure, Comedy]
    3   [Thriller, Romance, Drama]
    4          [Adventure, Family]
    

                            genres  Action  Adventure  Comedy  Drama  Family  \
    0                      [Drama]       0          0       0      1       0   
    1      [Music, Drama, Romance]       0          0       0      1       0   
    2  [Action, Adventure, Comedy]       1          1       1      0       0   
    3   [Thriller, Romance, Drama]       0          0       0      1       0   
    4          [Adventure, Family]       0          1       0      0       1   
    
       Music  Romance  Thriller  
    0      0        0         0  
    1      1        1         0  
    2      0        0         0  
    3      0        1         1  
    4      0        0         0  
    
    1 回复  |  直到 7 年前
        1
  •  6
  •   jezrael    7 年前

    使用 MultiLabelBinarizer :

    from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb = MultiLabelBinarizer()
    
    df1 = pd.DataFrame(mlb.fit_transform(df['genres']),columns=mlb.classes_, index=df.index)
    df = df.join(df1)
    print (df)
                            genres  Action  Adventure  Comedy  Drama  Family  \
    0                      [Drama]       0          0       0      1       0   
    1      [Music, Drama, Romance]       0          0       0      1       0   
    2  [Action, Adventure, Comedy]       1          1       1      0       0   
    3   [Thriller, Romance, Drama]       0          0       0      1       0   
    4          [Adventure, Family]       0          1       0      0       1   
    
       Music  Romance  Thriller  
    0      0        0         0  
    1      1        1         0  
    2      0        0         0  
    3      0        1         1  
    4      0        0         0  
    

    如果需要按列表筛选流派添加 reindex :

    genres = ['Action', 'Adventure', 'Comedy', 'Drama']
    
    df1 = pd.DataFrame(mlb.fit_transform(df['genres']),columns=mlb.classes_, index=df.index)
    df = df.join(df1.reindex(columns=genres, fill_value=0))
    print (df)
                            genres  Action  Adventure  Comedy  Drama
    0                      [Drama]       0          0       0      1
    1      [Music, Drama, Romance]       0          0       0      1
    2  [Action, Adventure, Comedy]       1          1       1      0
    3   [Thriller, Romance, Drama]       0          0       0      1
    4          [Adventure, Family]       0          1       0      0