代码之家  ›  专栏  ›  技术社区  ›  ababuji

合并多个大型数据帧的有效方法

  •  4
  • ababuji  · 技术社区  · 6 年前

    假设我有4个小数据帧

    df1 ,请 df2 ,请 df3 df4

    import pandas as pd
    from functools import reduce
    import numpy as np
    
    df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
    df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
    df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
    df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   
    
    
    df1.columns = ['name', 'id', 'price']
    df2.columns = ['name', 'id', 'price']
    df3.columns = ['name', 'id', 'price']    
    df4.columns = ['name', 'id', 'price']   
    
    df1 = df1.rename(columns={'price':'pricepart1'})
    df2 = df2.rename(columns={'price':'pricepart2'})
    df3 = df3.rename(columns={'price':'pricepart3'})
    df4 = df4.rename(columns={'price':'pricepart4'})
    

    上面创建的是4个数据帧,我想要的是下面的代码。

    # Merge dataframes
    df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
    df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
    df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
    
    # Fill na values with 'missing'
    df = df.fillna('missing')
    

    所以我已经为4个没有很多行和列的数据帧实现了这一点。

    基本上,我想将上述外部合并解决方案扩展到多(48)个62245 x 3大小的数据帧:

    因此,我通过从另一个StackOverflow答案构建这个解决方案,该答案使用lambda reduce:

    from functools import reduce
    import pandas as pd
    import numpy as np
    dfList = []
    
    #To create the 48 DataFrames of size 62245 X 3
    for i in range(0, 49):
    
        dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))
    
    
    #The solution I came up with to extend the solution to more than 3 DataFrames
    df_merged = reduce(lambda  left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
    

    这导致了 MemoryError .

    我不知道该怎么做才能阻止内核死亡。我在这个问题上坚持了两天。我执行的合并操作的某些代码不会导致 存储器错误 或者给你同样的结果的东西,会得到真正的赞赏。

    此外,主数据帧中的3列(示例中不是可复制的48个数据帧)属于类型 int64 ,请 国际贸易64 float64 我希望它们保持这种状态,因为它代表的是整数和浮点。

    编辑:

    我不是迭代地尝试运行合并操作,也不是使用reduce lambda函数,而是在2组中完成的!另外,我更改了一些列的数据类型,有些列不需要 浮动64 .所以我把它降到 float16 .它走得很远,但最后还是扔了一个 存储器错误 .

    intermediatedfList = dfList    
    
    tempdfList = []    
    
    #Until I merge all the 48 frames two at a time, till it becomes size 2
    while(len(intermediatedfList) != 2):
    
        #If there are even number of DataFrames
        if len(intermediatedfList)%2 == 0:
    
            #Go in steps of two
            for i in range(0, len(intermediatedfList), 2):
    
                #Merge DataFrame in index i, i + 1
                df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
                print(df1.info(memory_usage='deep'))
    
                #Append it to this list
                tempdfList.append(df1)
    
            #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
            #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
            intermediatedfList = tempdfList 
    
        else:
    
            #If there are odd number of DataFrames, keep the first DataFrame out
    
            tempdfList = [intermediatedfList[0]]
    
            #Go in steps of two starting from 1 instead of 0
            for i in range(1, len(intermediatedfList), 2):
    
                #Merge DataFrame in index i, i + 1
                df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
                print(df1.info(memory_usage='deep'))
                tempdfList.append(df1)
    
            #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
            #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
            intermediatedfList = tempdfList 
    

    我有什么方法可以优化我的代码以避免 存储器错误 ,我甚至使用了AWS 192GB RAM(我现在欠他们7美元,我可以给其中一个yall),这比我得到的要远,它仍然抛出 存储器错误 将28个数据帧的列表减少到4个之后。

    4 回复  |  直到 6 年前
        1
  •  3
  •   cs95 abhishek58g    6 年前

    使用 pd.concat

    df_list = [df1, df2, ...]
    for df in df_list:
        df.set_index(['name', 'id'], inplace=True)
    
    df = pd.concat(df_list, axis=1) # join='inner'
    df.reset_index(inplace=True)
    

    或者,您可以替换 concat join

    from functools import reduce
    df = reduce(lambda x, y: x.join(y), df_list)
    

    merge

        2
  •  1
  •   jpp    6 年前

    for 循环。我所应用的唯一内存优化是向下转换为最佳 int pd.to_numeric

    import pandas as pd
    
    dfs = {}
    dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
    dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
    dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
    dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   
    
    df = dfs[1].copy()
    
    for i in range(2, max(dfs)+1):
        df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                      left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
        df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
    
    print(df)
    
       0  1   2   3   4   5
    0  a  1  10  15  -1  -1
    1  a  2  20  20  -1  -1
    2  b  1   4  -1  -1  -1
    3  c  1   2   2  -1  -1
    4  e  2  10  -1  20  20
    5  d  1  -1  -1  10  10
    6  f  1  -1  -1   1  15
    

    object 键入series。这里我们用 -1 NaN float

        3
  •  1
  •   user85779    6 年前

    Best way to join two large datasets in Pandas 例如代码。抱歉,没有复制和粘贴,但不想让我看起来像是试图从链接条目中的回答者那里获得学分。

        4
  •  0
  •   theletz    6 年前

    您不必使用合并…

    df = pd.concat([df1,df2,df3,df4])
    

    您将收到:

    Out[3]: 
       id name  pricepart1  pricepart2  pricepart3  pricepart4
    0   1    a        10.0         NaN         NaN         NaN
    1   2    a        20.0         NaN         NaN         NaN
    2   1    b         4.0         NaN         NaN         NaN
    3   1    c         2.0         NaN         NaN         NaN
    4   2    e        10.0         NaN         NaN         NaN
    0   1    a         NaN        15.0         NaN         NaN
    1   2    a         NaN        20.0         NaN         NaN
    2   1    c         NaN         2.0         NaN         NaN
    0   1    d         NaN         NaN        10.0         NaN
    1   2    e         NaN         NaN        20.0         NaN
    2   1    f         NaN         NaN         1.0         NaN
    0   1    d         NaN         NaN         NaN        10.0
    1   2    e         NaN         NaN         NaN        20.0
    2   1    f         NaN         NaN         NaN        15.0
    

    df.groupby(['name','id']).sum().fillna('missing').reset_index()
    

    如果使用48个DFS进行尝试,您将看到它解决了内存错误:

    dfList = []
    #To create the 48 DataFrames of size 62245 X 3
    for i in range(0, 49):
        dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))
    
    df = pd.concat(dfList)
    df.groupby(['name','id']).sum().fillna('missing').reset_index()