代码之家  ›  专栏  ›  技术社区  ›  Matt Miles

如何将嵌套的JSON转换为数据帧?

  •  0
  • Matt Miles  · 技术社区  · 5 月前

    A. response.json() 返回以下格式的JSON:

    {
        "workbooks": [
            {
                "name": "WORKBOOK_A",
                "embeddedDatasources": [
                    {
                        "upstreamTables": [
                            {"name": "WORKBOOK_A_TABLE_A"}]},
                    {
                        "upstreamTables": [
                            {"name": "WORKBOOK_A_TABLE_B"},
                            {"name": "WORKBOOK_A_TABLE_C"}]},
                    {
                        "upstreamTables": []}]},
            {
                "name": "WORKBOOK_B",
                "embeddedDatasources": [
                    {
                        "upstreamTables": [
                            {"name": "WORKBOOK_B_TABLE_A"},
                            {"name": "WORKBOOK_B_TABLE_B"}]},
                    {
                        "upstreamTables": [
                            {"name": "WORKBOOK_B_TABLE_C"},
                            {"name": "WORKBOOK_B_TABLE_D"}]}]}]}
    

    我正试图将其转换为这样的数据帧:

    工作簿 上游表
    工作簿_ A 工作簿_表格_ A
    工作簿_ A 工作簿_表_ B
    工作簿_ A 工作簿_A_TABLE_C
    工作簿_B 工作簿_B_TABLE_A
    工作簿_B 工作簿_B_TABLE_B
    工作簿_B 工作簿_B_TABLE_C
    工作簿_B 工作簿_B_TABLE_D

    "upstreamTables": [] 在这种情况下应该忽略。

    使用json_normalize

    df = pd.json_normalize(json_data)
    

    到目前为止还没有发挥作用,将数据提取为单独的数据帧并重新连接它们似乎过于剧烈。

    3 回复  |  直到 5 月前
        1
  •  1
  •   ouroboros1    5 月前

    这里有一种方法:

    • 通过 resp (即。, response.json() )to pd.json_normalize 两者皆有 record_path meta .添加 meta_prefix 为了避免 ValueError: Conflicting metadata 查阅 this post 否则,我们最终会得到2 name 柱。
    • 使用 df.rename 重命名列并重新排序。
    import pandas as pd
    
    # resp = {...}
    
    df = (pd.json_normalize(resp['workbooks'], 
                            record_path=['embeddedDatasources', 'upstreamTables'], 
                            meta='name', 
                            meta_prefix='meta_'
                            )
          .rename(columns={'name': 'upstreamTables',
                           'meta_name': 'workbooks'})
          [['workbooks', 'upstreamTables']]
          )
    

    输出:

        workbooks      upstreamTables
    0  WORKBOOK_A  WORKBOOK_A_TABLE_A
    1  WORKBOOK_A  WORKBOOK_A_TABLE_B
    2  WORKBOOK_A  WORKBOOK_A_TABLE_C
    3  WORKBOOK_B  WORKBOOK_B_TABLE_A
    4  WORKBOOK_B  WORKBOOK_B_TABLE_B
    5  WORKBOOK_B  WORKBOOK_B_TABLE_C
    6  WORKBOOK_B  WORKBOOK_B_TABLE_D
    
        2
  •  0
  •   steel_wire    5 月前

    为了保持简洁,我使用了列表解析,我希望它是可读的。

    import pandas as pd
        
    responseData={}
    for item in response['workbooks']:
        embbeddedDataList=item['embeddedDatasources']
        response_elements=[listElem['upstreamTables'] for listElem in embbeddedDataList if not listElem['upstreamTables']==[]]
        tabular_elements=[elem['name'] for elementList in response_elements for elem in elementList]
        responseData[item['name']]=tabular_elements
            
        
    workbooks=[] ; upstreamTables=[]
    
    for workbook in responseData:
        for streamEntry in responseData[workbook]:
            workbooks.append(workbook)
            upstreamTables.append(streamEntry)
            
    tabularResponse=pd.DataFrame()
    tabularResponse['workbooks']=workbooks
    tabularResponse['upstreamTables']=upstreamTables
    
        3
  •  0
  •   Khaja Hussain    5 月前

    要将JSON响应转换为所需的DataFrame,可以迭代JSON结构

    import pandas as pd    
    
    json_data = {
        "workbooks": [
            {
                "name": "WORKBOOK_A",
                "embeddedDatasources": [
                    {"upstreamTables": [{"name": "WORKBOOK_A_TABLE_A"}]},
                    {"upstreamTables": [{"name": "WORKBOOK_A_TABLE_B"}, {"name": "WORKBOOK_A_TABLE_C"}]},
                    {"upstreamTables": []}
                ]
            },
            {
                "name": "WORKBOOK_B",
                "embeddedDatasources": [
                    {"upstreamTables": [{"name": "WORKBOOK_B_TABLE_A"}, {"name": "WORKBOOK_B_TABLE_B"}]},
                    {"upstreamTables": [{"name": "WORKBOOK_B_TABLE_C"}, {"name": "WORKBOOK_B_TABLE_D"}]}
                ]
            }
        ]
    }
       
    data = []    
    
    for workbook in json_data['workbooks']:
        workbook_name = workbook['name']
        for datasource in workbook['embeddedDatasources']:
            for table in datasource['upstreamTables']:
                # Add workbook name and table name to the data list
                data.append({
                    "workbooks": workbook_name,
                    "upstreamTables": table['name']
                })    
    
    df = pd.DataFrame(data)
        
    print(df)
    

    这将得到所需表结构的结果。但请确保始终传递正确的JSON或使用任何在线工具,如 JSON Reader 或任何工具