代码之家 › 专栏 › 技术社区 › JTA1618

如何按组将可变数量的空白行附加到数据帧?

padding append dataframe pandas python

JTA1618 · 技术社区 · 1 年前

我有一个数据帧,如下所示,包含x个个人ID(超过1000人)、x个每人事务数和x个变量(超过1000个变量):

个人ID	transaction_ID	变量_1	变量_2	变量_3	变量_X
人1	交易1	123	0	1.	abc
人1	交易2	456	1.	0	def
人1	交易3	123	0	1.	abc
personx	交易1	123	0	1.	abc
personx	交易2	456	0	1.	def

我想在每个个人id组的开头加上包含-10的行,这样每个个人id的总行数为6,如下所示:

个人ID	transaction_ID	变量_1	变量_2	变量_3	变量_X
人1	-10	-10	-10	-10	-10
人1	-10	-10	-10	-10	-10
人1	-10	-10	-10	-10	-10
人1	交易1	123	0	1.	abc
人1	交易2	456	1.	0	def
人1	交易3	123	0	1.	abc
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	交易1	123	0	1.	abc
personx	交易2	456	0	1.	def

以下是我尝试的代码(使用concat更新)及其下面的错误。

df2 = pd.DataFrame([[''] * len(newdf.columns)], columns=newdf.columns)
df2

for row in newdf.groupby('person_id')['transaction_id']:
   x=newdf.groupby('person_id')['person_id'].nunique()
   if x.any() < 6:
       newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

RuntimeWarning: '<' not supported between instances of 'int' and 'tuple', sort order is undefined for incomparable objects.
  newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

它将几个NaN行附加到数据帧的底部,但不根据需要在组之间附加。提前谢谢你,因为我是个初学者。

1 回复 | 直到 1 年前

Panda Kim 1 年前

密码

使用 groupby + apply

def func1(df):
    n = 6 - len(df)
    if n > 0:
        df1 = pd.DataFrame(df['Person_ID'].iloc[0], columns=['Person_ID'], index=range(0, n))
        return pd.concat([df1.reindex(df.columns, axis=1, fill_value=-10), df])
out = df.groupby('Person_ID', group_keys=False).apply(func1).reset_index(drop=True)

出来

示例代码

import pandas as pd
data1 = {'Person_ID': ['person1', 'person1', 'person1', 'personx', 'personx'], 
         'transaction_ID': ['transaction1', 'transaction2', 'transaction3', 'transaction1', 'transaction2'], 
         'variable_1': [123, 456, 123, 123, 456], 
         'variable_2': [0, 1, 0, 0, 0], 
         'variable_3': [1, 0, 1, 1, 1], 
         'variable_X': ['abc', 'def', 'abc', 'abc', 'def']}
df = pd.DataFrame(data1)

liassantos 1 年前

你可以使用这个方法 .concat() 而不是 .append() . 你可以使用 reindex() 以重复这些行。

试试这个例子:

    import pandas as pd

data = [['Person1', 'transaction1', 803.5, 1],
 ['Person2', 'transaction2', 776.6, 2],
 ['Person3', 'transaction3', 3.9, 0],
 ['Person4', 'transaction1', 8.1, 7],
  ['Person5', 'transaction2', 1.7, 1],
  ['Person6', 'transaction3', 505.6, 2],
   ['Person7', 'transaction1', 1.5, 1]]

df = pd.DataFrame(data, columns=['Person_ID', 'transaction_ID', 'variable_1', 'variable_2'])

dfnew = df #create a copy

new_column = df['Person_ID'] #you gonna use this column to insert its values

for column, values in df.iteritems(): #fill every cell with -10
  dfnew[column] = -10

dfnew.insert(0, 'New_Column_Person_ID', new_column) #insert values of the first column

unique_values=df.groupby('Person_ID')['Person_ID'].nunique()

index_unique_values = pd.DataFrame(unique_values.index)

z = pd.concat([dfnew, index_unique_values], ignore_index=True) #concat instead of append method

z.reindex(z.index.repeat(3)) #repeat rows