代码之家  ›  专栏  ›  技术社区  ›  sai

unix中使用python进行数据重组

  •  1
  • sai  · 技术社区  · 1 年前

    我正在尝试在CSV文件中拆分数据并重新排列数据。我的数据看起来像这样

    1:100011159-T-G,CDD3-597,G,G
    1:10002775-GA,CDD3-597,G,G
    1:100122796-C-T,CDD3-597,T,T
    1:100152282-CAAA-T,CDD3-597,C,C
    1:100011159-T-G,CDD3-598,G,G
    1:100152282-CAAA-T,CDD3-598,C,C  
    

    我想要一张这样的桌子:

    身份证件 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
    CDD3-597 游戏打得好 游戏打得好 TT 科科斯群岛
    CDD3-598 游戏打得好 科科斯群岛

    我写了以下代码:

    import pandas as pd
    
    
    input_file = "trail_berry.csv"
    output_file = "trail_output_result.csv"
    
    # Read the CSV file without header
    df = pd.read_csv(input_file, header=None)
    print(df[0].str.split(',', n=2, expand=True))
    # Extract SNP Name, ID, and Alleles from the data
    df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
    
    # Create a new DataFrame with unique SNP_Name values as columns
    result_df = pd.DataFrame(columns=df['SNP_Name'].unique(), dtype=str)
    
    # Populate the new DataFrame with ID and Alleles data
    for _, row in df.iterrows():
        result_df.at[row['ID'], row['SNP_Name']] = row['Alleles']
    
    # Reset the index
    result_df.reset_index(inplace=True)
    result_df.rename(columns={'index': 'ID'}, inplace=True)
    
    # Fill NaN values with an appropriate representation (e.g., 'NULL' or '')
    result_df = result_df.fillna('NULL')
    
    # Save the result to a new CSV file
    result_df.to_csv(output_file, index=False)
    
    # Print a message indicating that the file has been saved
    print("Result has been saved to {}".format(output_file))
    

    但这给了我以下错误:

    Traceback (most recent call last):
      File "berry_trail.py", line 11, in <module>
        df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
      File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3367, in __setitem__
        self._setitem_array(key, value)
      File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3389, in _setitem_array
        raise ValueError('Columns must be same length as key')
    

    有人能帮忙吗?我很难弄清楚。提前感谢! ValueError: Columns must be same length as key

    1 回复  |  直到 1 年前
        1
  •  2
  •   Andrej Kesely    1 年前

    你想做的叫做 枢轴转动 ,所以使用 DataFrame.pivot() :

    import pandas as pd
    
    df = pd.read_csv("your_file.csv", header=None)
    
    out = (
        df.pivot(index=1, columns=0, values=2)
        .rename_axis(index="ID", columns=None)
        .reset_index()
    )
    print(out)
    

    打印:

             ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
    0  CDD3-597              GG            GG              TT                 CC
    1  CDD3-598              GG           NaN             NaN                 CC
    

    编辑:使用您更新的输入:

    import pandas as pd
    
    df = pd.read_csv("your_file.csv", header=None)
    
    df["tmp"] = df[[2, 3]].agg("".join, axis=1)  # <-- join the two columns together
    
    out = (
        df.pivot(index=1, columns=0, values="tmp")
        .rename_axis(index="ID", columns=None)
        .reset_index()
    )
    print(out)