代码之家  ›  专栏  ›  技术社区  ›  Aaraeus

将数据帧[duplicate]中的值从字符串映射到int

  •  0
  • Aaraeus  · 技术社区  · 6 年前

    我有一个叫做 df_base 看起来像这样。如你所见,有一个专栏叫做 Sex male female . 我想把这些值分别映射到0和1。

    +---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
    |   | PassengerId | Survived | Pclass |                       Name                        |  Sex   | Age | SibSp | Parch |      Ticket      |  Fare   | Cabin | Embarked |
    +---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
    | 0 |           1 |        0 |      3 | Braund, Mr. Owen Harris                           | male   |  22 |     1 |     0 | A/5 21171        |    7.25 | NaN   | S        |
    | 1 |           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |  38 |     1 |     0 | PC 17599         | 71.2833 | C85   | C        |
    | 2 |           3 |        1 |      3 | Heikkinen, Miss. Laina                            | female |  26 |     0 |     0 | STON/O2. 3101282 |   7.925 | NaN   | S        |
    | 3 |           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)      | female |  35 |     1 |     0 | 113803           |    53.1 | C123  | S        |
    | 4 |           5 |        0 |      3 | Allen, Mr. William Henry                          | male   |  35 |     0 |     0 | 373450           |    8.05 | NaN   | S        |
    +---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
    

    我在StackOverflow上看到了一些方法,但我想知道执行以下映射最有效的方法是什么:

    +---------+---------+
    | Old Sex | New Sex |
    +---------+---------+
    | male    |       0 |
    | female  |       1 |
    | female  |       1 |
    | female  |       1 |
    | male    |       0 |
    +---------+---------+
    

    我用这个:

    df_base['Sex'].replace(['male','female'],[0,1],inplace=True)

    ... 但我不由得觉得这有点伪劣。有没有更好的办法?还有使用 .loc 但是它在数据帧的行之间循环,所以效率较低,对吧?

    3 回复  |  直到 6 年前
        1
  •  4
  •   jezrael    6 年前

    我认为这是更好/更快的使用方法 map 只要查字典就行了 male female Sex :

    df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))
    

    df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})
    

    女性的 男性的 值被布尔掩码转换为整数 True/False 1,0 :

    df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)
    

    性能 :

    np.random.seed(2019)
    
    import perfplot    
    
    def ma(df):
        df = df.copy()
        df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
        return df
    
    def rep1(df):
        df = df.copy()
        df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
        return df
    
    def nwhere(df):
        df = df.copy()
        df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
        return df
    
    def mask1(df):
        df = df.copy()
        df['Sex_new'] = (df['Sex'] == 'female').astype(int)
        return df
    
    def mask2(df):
        df = df.copy()
        df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
        return df
    
    
    def make_df(n):
        df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})
    
        return df
    

    perfplot.show(
        setup=make_df,
        kernels=[ma,  rep1, nwhere, mask1, mask2],
        n_range=[2**k for k in range(2, 18)],
        logx=True,
        logy=True,
        equality_check=False,  # rows may appear in different order
        xlabel='len(df)')
    

    pic

    如果仅替换2个值,则速度最慢 replace , numpy.where, map and mask .values .
    也都取决于数据,所以最好用真实数据测试。

        2
  •  2
  •   Niels Henkens    6 年前

    我的直觉会建议 .map() ,但我将您的解决方案与map进行了比较,基于一个包含1500个随机男性/女性值的数据帧。

    %timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
    1000 loops, best of 3: 653 µs per loop
    

    %timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
    1000 loops, best of 3: 968 µs per loop
    

    .map() ...!

    所以基于这个例子,你的“劣质”解决方案似乎比

    编辑

    pygo的解决方案:

    %timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
    1000 loops, best of 3: 331 µs per loop
    

    .astype(int) :

    %timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
    1000 loops, best of 3: 388 µs per loop
    

    .map() .replace() .

        3
  •  1
  •   Karn Kumar    6 年前

    另一个解决方案,你可以使用 np.where

    仅举一个数据帧示例:

    >>> df
          Sex
    0    male
    1  female
    2  female
    3  female
    4    male
    

    根据条件创建新列 new_Sex

    >>> df['new_Sex'] = np.where(df['Sex'] == 'male', 0, 1)
    >>> df
          Sex  new_Sex
    0    male        0
    1  female        1
    2  female        1
    3  female        1
    4    male        0
    

    >>> df['new_Sex'] = np.where(df['Sex'] != 'male', 1, 0)
    >>> df
          Sex  new_Sex
    0    male        0
    1  female        1
    2  female        1
    3  female        1
    4    male        0