代码之家 › 专栏 › 技术社区 › dreddy

将嵌套json扁平化为每个项有一行

dataframe pandas python-3.x json python

dreddy · 技术社区 · 4 年前

我有一个嵌套的json,我正在尝试将其压平:

[
  {
    "name": "table1",
    "count": 123,
    "columns": {
      "col1": "string",
      "col2": "string"
    },
    "partitions": 2
  },
  {
    "name": "table2",
    "count": 234,
    "columns": {
      "col3": "int",
      "col4": "string",
      "col5": "int"
    },
    "partitions": 4
  }
]

我正试图将其简化为以下内容:

name     count   col_name     col_type   partitions
table1    123      col1        string      2
table1    123      col2        string      2
table2    234      col3        int         4
table2    234      col4        string      4
table2    234      col5        int         4

正在将json读取到pandas数据帧中。

with open("file.json") as datafile:
    data = json.load(datafile)
dataframe = pd.DataFrame(data)

pd.json_normalize 不起作用,因为我不想创建太多列。相反,我正在尝试创建更多的行。有人能指导我如何在蟒蛇或熊猫身上最好地实现这一点吗?

感谢您的帮助。谢谢

2 回复 | 直到 4 年前

user2736738 4 年前

这是一个直接的解决方案,您可以根据相应的键形成一堆字典,然后从中创建一个数据帧。

import json
import pandas as pd
with open("abc.json") as datafile:
    data = json.load(datafile)
print(data)
d = [{'name': x['name'],'count':x['count'],'colname':k,'coltype':x['columns'][k], 'partitions':x['partitions']} for x in data for k in x['columns'].keys()]
df = pd.DataFrame.from_dict(d)
print(df)

输出

[{'name': 'table1', 'count': 123, 'columns': {'col1': 'string', 'col2': 'string'}, 'partitions': 2}, {'name': 'table2', 'count': 234, 'columns': {'col3': 'int', 'col4': 'string', 'col5': 'int'}, 'partitions': 4}]
     name  count colname coltype  partitions
0  table1    123    col1  string           2
1  table1    123    col2  string           2
2  table2    234    col3     int           4
3  table2    234    col4  string           4
4  table2    234    col5     int           4

Corralien 4 年前

您可以使用 wide_to_long :

df = pd.json_normalize(data)
cols = [c for c in df.columns if not c.startswith('columns')]

out = (pd.wide_to_long(df, stubnames='columns', i=cols, j='col_name', 
                       sep='.', suffix='col\d+')
         .rename(columns={'columns': 'col_type'})
         .query('col_type.notna()').reset_index())
print(out)

# Output
     name  count  partitions col_name col_type
0  table1    123           2     col1   string
1  table1    123           2     col2   string
2  table2    234           4     col3      int
3  table2    234           4     col4   string
4  table2    234           4     col5      int