代码之家 › 专栏 › 技术社区 › A T

有解决方案吗。g、 :使用numba,或Cythonto'transform`/'apply`和索引,多索引数据帧?

pandas-apply series dataframe pandas python

A T · 技术社区 · 6 年前

有解决方案吗。g、 :与numba或Cythonto一起 transform / apply 有索引吗?

我知道我可以用 iterrows , itertuples , iteritems 或 items .但我想做的应该是微不足道的矢量化我已经为我的实际用例建立了一个简单的代理( runnable code ):

df = pd.DataFrame(
    np.random.randn(8, 4),
    index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])

namednumber2numbername = {
    'one': ('zero', 'one', 'two', 'three', 'four',
            'five', 'six', 'seven', 'eight', 'nine'),
    'two': ('i',    'ii',  'iii', 'iv',    'v',
            'vi',   'vii', 'viii',  'ix',    'x')
}

def namednumber2numbername_applicator(series):        
    def to_s(value):
        if pd.isnull(value) or isinstance(value, string_types): return value
        value = np.ushort(value)
        if value > 10: return value

        # TODO: Figure out idx of `series.name` at this `value`â¦ instead of `'one'`

        return namednumber2numbername['one'][value]

    return series.apply(to_s)

df.transform(namednumber2numbername_applicator)

实际产量

             0      1      2      3
bar one   zero   zero    one  65535
    two   zero   zero   zero   zero
baz one   zero   zero   zero   zero
    two   zero    two   zero   zero
foo one  65535   zero   zero   zero
    two   zero  65535  65534   zero
qux one   zero    one   zero   zero
    two   zero   zero   zero   zero

我想要的输出

             0      1      2     3
bar one   zero   zero    one  65535
    two      i      i      i      i
baz one   zero   zero   zero   zero
    two      i    iii      i      i
foo one  65535   zero   zero   zero
    two      i  65535  65534      i
qux one   zero    one   zero   zero
    two      i      i      i      i

基本上,我在寻找和你一样的行为 JavaScript's Array.prototype.map (沿着 idx ).

0 回复 | 直到 6 年前

oppressionslayer 6 年前

为了得到这些结果,我编写了一个非常快速的转换版本。你可以做np。U发电机内部的速度也很快,但外部的速度要快得多:

import time
df = pd.DataFrame(
    np.random.randn(8, 4**7),
    index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])

start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)

# 1.150895118713379

以下是原件上的时间:

df = pd.DataFrame( np.random.randn(8, 4),
     index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']), 
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]) 

start = time.time() 
df.loc[:,] = np.ushort(df) 
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1) 
end = time.time() 
print(end - start)                                                                                                                                                                   
# 0.005067110061645508

In [453]: df                                                                                                                                                                                   
Out[453]: 
             0     1      2     3
bar one   zero  zero    one  zero
    two      i     i      i     i
baz one   zero  zero   zero  zero
    two      i     i     ii     i
foo one  65535  zero  65535  zero
    two      i     i      i     i
qux one   zero  zero   zero  zero
    two      i     i      i    ii

我把它写进了一行:

df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 10 else namednumber2numbername[pos[1]][np.ushort(value)] for pos, value in x.items()])                              

             0     1      2     3
bar one   zero  zero   zero  zero
    two      i     i     ii     i
baz one  65534  zero  65535  zero
    two     ii     i  65535     i
foo one   zero  zero   zero  zero
    two     ii     i      i    ii
qux one  65535  zero   zero  zero
    two      i     i      i     i

好的,没有的版本。items():


def what(x): 
   if type(x[0]) == np.float64: 
      if np.ushort(x[0])>10: 
         return np.ushort(x[0]) 
      else: 
         return(namednumber2numbername[x.index[0][1]][np.ushort(x[0])]) 

df.groupby(level=[0,1]).transform(what)

            0     1      2      3
bar one  zero   one   zero   zero
    two     i    ii  65535      i
baz one  zero  zero  65535   zero
    two     i     i      i      i
foo one  zero   one   zero   zero
    two     i     i      i      i
qux one   two  zero   zero  65534
    two     i     i      i     ii

还有一个班轮!!!!不根据您的要求提供商品!我们按级别0和1分组,然后执行计算以确定值:

df.groupby(level=[0,1]).transform(lambda x: np.ushort(x[0]) if type(x[0]) == np.float64 and np.ushort(x[0]) >10 else namednumber2numbername[x.index[0][1]][np.ushort(x[0])])

            0     1      2      3
bar one  zero   one   zero   zero
    two     i    ii  65535      i
baz one  zero  zero  65535   zero
    two     i     i      i      i
foo one  zero   one   zero   zero
    two     i     i      i      i
qux one   two  zero   zero  65534
    two     i     i      i     ii

为了获得其他值,我做了以下操作:

df.transform(lambda x: [ str(x.name[0]) + '_' + str(x.name[1]) + '_' + str( pos)+ '_' +str(value) for pos,value in x.items()])

print('Transformed DataFrame:\n',
      df.transform(what), sep='')

Transformed DataFrame:
                             Î±                                                        ...                          Ï                                                       Îµ
f                            a                          b                          c  ...                          b                           c                           j
one  Î±_a_one_79.96465755359696  Î±_b_one_31.32938096131651   Î±_c_one_2.61444370203201  ...   Ï_b_one_35.7457972161041  Ï_c_one_40.224465043054195  Îµ_j_one_43.527184108357496
two  Î±_a_two_42.66244395377804  Î±_b_two_65.92020941618344  Î±_c_two_77.26467264185487  ...  Ï_b_two_40.91908469505522  Ï_c_two_50.395561828234555   Îµ_j_two_71.67418483119914
one   Î±_a_one_47.9769845681328  Î±_b_one_38.90671671550259  Î±_c_one_67.13601594352508  ...  Ï_b_one_23.23799084164898  Ï_c_one_63.551178212994465  Îµ_j_one_16.975582723809303

这里有一个没有。项目:

df.transform(lambda x: ['_'.join((x.name[0], x.name[1], x.index[0], str(i) if type(i) == float else 0)) for i in list(x)])

输出

                             Î±                                                        ...                          Ï                                                       Îµ
f                            a                          b                          c  ...                          b                           c                           j
one  Î±_a_one_79.96465755359696  Î±_b_one_31.32938096131651   Î±_c_one_2.61444370203201  ...   Ï_b_one_35.7457972161041  Ï_c_one_40.224465043054195  Îµ_j_one_43.527184108357496
two  Î±_a_two_42.66244395377804  Î±_b_two_65.92020941618344  Î±_c_two_77.26467264185487  ...  Ï_b_two_40.91908469505522  Ï_c_two_50.395561828234555   Îµ_j_two_71.67418483119914
one   Î±_a_one_47.9769845681328  Î±_b_one_38.90671671550259  Î±_c_one_67.13601594352508  ...  Ï_b_one_23.23799084164898  Ï_c_one_63.551178212994465  Îµ_j_one_16.975582723809303

我也没有分组:

df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + df.T.eq(x).columns + '_' + x.astype(str) ,  axis=1).T

or even better and most simple:

df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 

or 

df.T.transform(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 

or with no .T:

df.transform(lambda x: x.index[0][0] + '_'+ x.index[0][1] + '_' + x.name + '_' + x.astype(str) ,  axis=1) 
                             Î±                                                        ...                          Ï                                                       Îµ
f                            a                          b                          c  ...                          b                           c                           j
one  Î±_a_one_79.96465755359696  Î±_b_one_31.32938096131651   Î±_c_one_2.61444370203201  ...   Ï_b_one_35.7457972161041  Ï_c_one_40.224465043054195  Îµ_j_one_43.527184108357496
two  Î±_a_two_42.66244395377804  Î±_b_two_65.92020941618344  Î±_c_two_77.26467264185487  ...  Ï_b_two_40.91908469505522  Ï_c_two_50.395561828234555   Îµ_j_two_71.67418483119914
one   Î±_a_one_47.9769845681328  Î±_b_one_38.90671671550259  Î±_c_one_67.13601594352508  ...  Ï_b_one_23.23799084164898  Ï_c_one_63.551178212994465  Îµ_j_one_16.975582723809303

Stef 6 年前

Transform 默认情况下,将函数应用于每一列。您可以将其应用于每个一行指定轴参数= 1 或 'columns' 。然后您可以访问行索引,并可以将其第二个名称字段传递给您的函数:

    def namednumber2numbername_applicator(series):        
        def to_s(value, name):
            if pd.isnull(value): return value
            value = np.ushort(value)
            if value > 10: return value

            return namednumber2numbername[name][value]

        return series.apply(to_s, args=((series.name[1]),))

df.transform(namednumber2numbername_applicator, 1)

结果:

             0      1      2      3
bar one  65535   zero   zero  65535
    two     ii      i    iii  65535
baz one  65535   zero   zero  65535
    two      i      i  65535      i
foo one   zero   zero   zero   zero
    two      i  65535      i      i
qux one   zero   zero   zero  65535
    two      i      i      i      i

anky 6 年前

下面是另一种使用 reindex 和 np.where() :

def myf(dataframe,dictionary):
    cond1=dataframe.isna()
    cond2=np.ushort(dataframe)>10
    m=(pd.DataFrame.from_dict(dictionary,orient='index')
                          .reindex(dataframe.index.get_level_values(1)))
    m.index=pd.MultiIndex.from_arrays((dataframe.index.get_level_values(0),m.index))
    arr=np.where(cond1|cond2,np.ushort(dataframe),
                                 m[m.columns.intersection(dataframe.columns)])
return pd.DataFrame(arr,dataframe.index,dataframe.columns)

myf(df,namednumber2numbername)

             0      1      2      3
bar one   zero    one    two  three
    two  65535     ii    iii  65535
baz one   zero    one  65535  three
    two      i     ii    iii     iv
foo one   zero  65535    two  three
    two      i     ii    iii     iv
qux one   zero  65535    two  65535
    two      i     ii    iii     iv

接下来的步骤是:

此函数使用字典创建数据帧( m )重新编制原始数据的索引。

在此之后,我们将添加一个额外的级别,使其成为与原始数据帧相同的多索引。(在func中打印m以查看m)

然后我们检查条件dataframe是否为Null或 np.ushort 价值超过10

如果条件匹配,请返回 NP无符号短整数 来自m中匹配列的dataframe else值的。

如果有任何步骤我没有检查,或者你想合并,请告诉我,因为我觉得这是避免行计算的一种方法。

Meow David Fraser 6 年前

一个使用级数的例子。地图:

class dict_default_key(dict):
    def __missing__(self, key):
        return key


number_names = [
    'zero',
    'one',
    'two',
    'three',
    'four',
    'five',
    'six',
    'seven',
    'eight',
    'nine'
]
roman_numerals = [
    'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x'
]
name_mapping = {
    'one': dict_default_key(
        {c: v for c, v in enumerate(number_names)}
    ),
    'two': dict_default_key(
        {c: v for c, v in enumerate(roman_numerals)}
    )
}

def translate(series):
    key = series.name[1]
    row_map = name_mapping[key]
    result = series.map(row_map)
    return result

ushorts = df.apply(np.ushort)
ushorts.apply(translate, axis=1)

Yaakov Bressler 6 年前

以下是我将如何着手解决这个问题:

# 1. Rewrite functions to include a parameter for `idx`
def some_fun_name(value, idx):  
    value = np.ushort(value)
    if value > 10: 
        return value
    else:
        return namednumber2numbername[idx][value]

def apply_some_fun_name(s):  
    idx = list(s.index.get_level_values(1).unique())[0]
    return s.transform(some_fun_name, idx=idx)

# 2. Apply function over the keys of the multi-index, replacing while operating:
df = df.groupby(level=1).transform(apply_some_fun_name)

# 3. I got the following result while using `np.random.seed(1)`:
             0      1     2      3
bar one    one   zero  zero  65535
    two      i  65534    ii      i
baz one   zero   zero   one  65534
    two      i      i    ii  65535
foo one   zero   zero  zero   zero
    two  65535     ii     i      i
qux one   zero   zero  zero   zero
    two      i      i     i      i