代码之家  ›  专栏  ›  技术社区  ›  A T

有解决方案吗。g、 :使用numba,或Cythonto'transform`/'apply`和索引,多索引数据帧?

  •  0
  • A T  · 技术社区  · 6 年前

    有解决方案吗。g、 :与numba或Cythonto一起 transform / apply 有索引吗?

    我知道我可以用 iterrows , itertuples , iteritems items .但我想做的应该是微不足道的矢量化我已经为我的实际用例建立了一个简单的代理( runnable code ):

    df = pd.DataFrame(
        np.random.randn(8, 4),
        index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
               np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
    
    namednumber2numbername = {
        'one': ('zero', 'one', 'two', 'three', 'four',
                'five', 'six', 'seven', 'eight', 'nine'),
        'two': ('i',    'ii',  'iii', 'iv',    'v',
                'vi',   'vii', 'viii',  'ix',    'x')
    }
    
    def namednumber2numbername_applicator(series):        
        def to_s(value):
            if pd.isnull(value) or isinstance(value, string_types): return value
            value = np.ushort(value)
            if value > 10: return value
    
            # TODO: Figure out idx of `series.name` at this `value`… instead of `'one'`
    
            return namednumber2numbername['one'][value]
    
        return series.apply(to_s)
    
    df.transform(namednumber2numbername_applicator)
    

    实际产量

                 0      1      2      3
    bar one   zero   zero    one  65535
        two   zero   zero   zero   zero
    baz one   zero   zero   zero   zero
        two   zero    two   zero   zero
    foo one  65535   zero   zero   zero
        two   zero  65535  65534   zero
    qux one   zero    one   zero   zero
        two   zero   zero   zero   zero
    

    我想要的输出

                 0      1      2     3
    bar one   zero   zero    one  65535
        two      i      i      i      i
    baz one   zero   zero   zero   zero
        two      i    iii      i      i
    foo one  65535   zero   zero   zero
        two      i  65535  65534      i
    qux one   zero    one   zero   zero
        two      i      i      i      i
    

    可能相关: How to query MultiIndex index columns values in pandas

    基本上,我在寻找和你一样的行为 JavaScript's Array.prototype.map (沿着 idx ).

    0 回复  |  直到 6 年前
        1
  •  3
  •   oppressionslayer    6 年前

    为了得到这些结果,我编写了一个非常快速的转换版本。你可以做np。U发电机内部的速度也很快,但外部的速度要快得多:

    import time
    df = pd.DataFrame(
        np.random.randn(8, 4**7),
        index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
               np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
    
    start = time.time()
    df.loc[:,] = np.ushort(df)
    df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
    end = time.time()
    print(end - start)
    
    # 1.150895118713379
    
    

    以下是原件上的时间:

    df = pd.DataFrame( np.random.randn(8, 4),
         index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']), 
               np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]) 
    
    start = time.time() 
    df.loc[:,] = np.ushort(df) 
    df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1) 
    end = time.time() 
    print(end - start)                                                                                                                                                                   
    # 0.005067110061645508
    
    In [453]: df                                                                                                                                                                                   
    Out[453]: 
                 0     1      2     3
    bar one   zero  zero    one  zero
        two      i     i      i     i
    baz one   zero  zero   zero  zero
        two      i     i     ii     i
    foo one  65535  zero  65535  zero
        two      i     i      i     i
    qux one   zero  zero   zero  zero
        two      i     i      i    ii
    
    

    我把它写进了一行:

    df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 10 else namednumber2numbername[pos[1]][np.ushort(value)] for pos, value in x.items()])                              
    
                 0     1      2     3
    bar one   zero  zero   zero  zero
        two      i     i     ii     i
    baz one  65534  zero  65535  zero
        two     ii     i  65535     i
    foo one   zero  zero   zero  zero
        two     ii     i      i    ii
    qux one  65535  zero   zero  zero
        two      i     i      i     i
    

    好的,没有的版本。items():

    
    def what(x): 
       if type(x[0]) == np.float64: 
          if np.ushort(x[0])>10: 
             return np.ushort(x[0]) 
          else: 
             return(namednumber2numbername[x.index[0][1]][np.ushort(x[0])]) 
    
    df.groupby(level=[0,1]).transform(what)
    
                0     1      2      3
    bar one  zero   one   zero   zero
        two     i    ii  65535      i
    baz one  zero  zero  65535   zero
        two     i     i      i      i
    foo one  zero   one   zero   zero
        two     i     i      i      i
    qux one   two  zero   zero  65534
        two     i     i      i     ii
    

    还有一个班轮!!!!不根据您的要求提供商品!我们按级别0和1分组,然后执行计算以确定值:

    df.groupby(level=[0,1]).transform(lambda x: np.ushort(x[0]) if type(x[0]) == np.float64 and np.ushort(x[0]) >10 else namednumber2numbername[x.index[0][1]][np.ushort(x[0])])
    
                0     1      2      3
    bar one  zero   one   zero   zero
        two     i    ii  65535      i
    baz one  zero  zero  65535   zero
        two     i     i      i      i
    foo one  zero   one   zero   zero
        two     i     i      i      i
    qux one   two  zero   zero  65534
        two     i     i      i     ii
    
    

    为了获得其他值,我做了以下操作:

    df.transform(lambda x: [ str(x.name[0]) + '_' + str(x.name[1]) + '_' + str( pos)+ '_' +str(value) for pos,value in x.items()])
    
    print('Transformed DataFrame:\n',
          df.transform(what), sep='')
    
    Transformed DataFrame:
                                 α                                                        ...                          ω                                                       ε
    f                            a                          b                          c  ...                          b                           c                           j
    one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
    two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
    one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303
    

    这里有一个没有。项目:

    df.transform(lambda x: ['_'.join((x.name[0], x.name[1], x.index[0], str(i) if type(i) == float else 0)) for i in list(x)]) 
    

    输出

                                 α                                                        ...                          ω                                                       ε
    f                            a                          b                          c  ...                          b                           c                           j
    one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
    two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
    one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303
    

    我也没有分组:

    df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + df.T.eq(x).columns + '_' + x.astype(str) ,  axis=1).T
    
    or even better and most simple:
    
    df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 
    
    or 
    
    df.T.transform(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 
    
    or with no .T:
    
    df.transform(lambda x: x.index[0][0] + '_'+ x.index[0][1] + '_' + x.name + '_' + x.astype(str) ,  axis=1) 
                                 α                                                        ...                          ω                                                       ε
    f                            a                          b                          c  ...                          b                           c                           j
    one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
    two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
    one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303
    
        2
  •  3
  •   Stef    6 年前

    Transform 默认情况下,将函数应用于每一列。您可以将其应用于每个 一行 指定轴参数= 1 'columns' 。然后您可以访问行索引,并可以将其第二个名称字段传递给您的函数:

        def namednumber2numbername_applicator(series):        
            def to_s(value, name):
                if pd.isnull(value): return value
                value = np.ushort(value)
                if value > 10: return value
    
                return namednumber2numbername[name][value]
    
            return series.apply(to_s, args=((series.name[1]),))
    
    df.transform(namednumber2numbername_applicator, 1)
    

    结果:

                 0      1      2      3
    bar one  65535   zero   zero  65535
        two     ii      i    iii  65535
    baz one  65535   zero   zero  65535
        two      i      i  65535      i
    foo one   zero   zero   zero   zero
        two      i  65535      i      i
    qux one   zero   zero   zero  65535
        two      i      i      i      i
    
        3
  •  2
  •   anky    6 年前

    下面是另一种使用 reindex np.where() :

    def myf(dataframe,dictionary):
        cond1=dataframe.isna()
        cond2=np.ushort(dataframe)>10
        m=(pd.DataFrame.from_dict(dictionary,orient='index')
                              .reindex(dataframe.index.get_level_values(1)))
        m.index=pd.MultiIndex.from_arrays((dataframe.index.get_level_values(0),m.index))
        arr=np.where(cond1|cond2,np.ushort(dataframe),
                                     m[m.columns.intersection(dataframe.columns)])
    return pd.DataFrame(arr,dataframe.index,dataframe.columns)
    

    myf(df,namednumber2numbername)
    

                 0      1      2      3
    bar one   zero    one    two  three
        two  65535     ii    iii  65535
    baz one   zero    one  65535  three
        two      i     ii    iii     iv
    foo one   zero  65535    two  three
        two      i     ii    iii     iv
    qux one   zero  65535    two  65535
        two      i     ii    iii     iv
    

    接下来的步骤是:

    • 此函数使用字典创建数据帧( m )重新编制原始数据的索引。
    • 在此之后,我们将添加一个额外的级别,使其成为与原始数据帧相同的多索引。(在func中打印m以查看m)
    • 然后我们检查条件dataframe是否为Null或 np.ushort 价值超过10
    • 如果条件匹配,请返回 NP无符号短整数 来自m中匹配列的dataframe else值的。

    如果有任何步骤我没有检查,或者你想合并,请告诉我,因为我觉得这是避免行计算的一种方法。

        4
  •  1
  •   Meow David Fraser    6 年前

    一个使用级数的例子。地图:

    class dict_default_key(dict):
        def __missing__(self, key):
            return key
    
    
    number_names = [
        'zero',
        'one',
        'two',
        'three',
        'four',
        'five',
        'six',
        'seven',
        'eight',
        'nine'
    ]
    roman_numerals = [
        'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x'
    ]
    name_mapping = {
        'one': dict_default_key(
            {c: v for c, v in enumerate(number_names)}
        ),
        'two': dict_default_key(
            {c: v for c, v in enumerate(roman_numerals)}
        )
    }
    
    def translate(series):
        key = series.name[1]
        row_map = name_mapping[key]
        result = series.map(row_map)
        return result
    
    ushorts = df.apply(np.ushort)
    ushorts.apply(translate, axis=1)
    
        5
  •  0
  •   Yaakov Bressler    6 年前

    以下是我将如何着手解决这个问题:

    # 1. Rewrite functions to include a parameter for `idx`
    def some_fun_name(value, idx):  
        value = np.ushort(value)
        if value > 10: 
            return value
        else:
            return namednumber2numbername[idx][value]
    
    def apply_some_fun_name(s):  
        idx = list(s.index.get_level_values(1).unique())[0]
        return s.transform(some_fun_name, idx=idx)
    
    # 2. Apply function over the keys of the multi-index, replacing while operating:
    df = df.groupby(level=1).transform(apply_some_fun_name)
    
    # 3. I got the following result while using `np.random.seed(1)`:
                 0      1     2      3
    bar one    one   zero  zero  65535
        two      i  65534    ii      i
    baz one   zero   zero   one  65534
        two      i      i    ii  65535
    foo one   zero   zero  zero   zero
        two  65535     ii     i      i
    qux one   zero   zero  zero   zero
        two      i      i     i      i