代码之家  ›  专栏  ›  技术社区  ›  Brad Solomon

将非结构化名称和数据列表转换为嵌套字典

  •  1
  • Brad Solomon  · 技术社区  · 8 年前

    我有一个“非结构化”列表,如下所示:

    info = [
        'Joe Schmoe',
        'W / M / 64',
        'Richard Johnson',
        'OFFICER',
        'W / M /48',
        'Adrian Stevens',
        '? / ? / 27'
        ]
    

    非结构化 其中,列表由以下几组组成:

    • ( 姓名、官员状态、人口统计信息 )三胞胎,或
    • ( 姓名、人口统计信息

    在后一种情况下, Officer=False 在前者中, Officer=True . 人口统计信息字符串表示 Race / Gender / Age 具有 NaN 用文字问号表示。以下是我想了解的内容:

    res = {
        'Joe Schmoe': {
            'race': 'W',
            'gender': 'M',
            'age': 64,
            'officer': False
            },
        'Richard Johnson': {
            'race': 'W',
            'gender': 'M',
            'age': 48,
            'officer': True
            },
        'Adrian Stevens': {
            'race': 'NaN',
            'gender': 'NaN',
            'age': 27,
            'officer': False
            }
        }
    

    现在我已经构建了两个函数来实现这一点。第一个在下面,处理人口信息字符串。(我对这个没意见,把它放在这里作为参考。)

    import re
    
    def fix_demographic(info):
        # W / M / ?? --> W / M / NaN
        # ?/M/?  --> NaN / M / NaN
        # Keep as str NaN rather than np.nan for now
        race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
        return race, gender, age
    

    第二个函数解构列表并将其值放入字典结果中的不同位置:

    demographic = re.compile(r'(\w+|\?+)\s*\/\s*(\w+|\?+)\s*\/\s*(\w+|\?+)')
    
    
    def parse_victim_info(info: list):
        res = defaultdict(dict)
        for i in info:
            if not demographic.fullmatch(i) and i.lower() != 'officer':
                # We have a name
                previous = 'name'
                name = i
            if i.lower() == 'officer':
                res[name]['officer'] = True
                previous = 'officer'
            if demographic.fullmatch(i):
                # We have demographic info; did "OFFICER" come before it?
                if previous == 'name':
                    res[name]['officer'] = False
                race, gender, age = fix_demographic(i)
                res[name]['race'] = race
                res[name]['gender'] = gender
                res[name]['age'] = int(age) if age.isnumeric() else age
                previous = None
        return res
    
    >>> parse_victim_info(info)
    defaultdict(dict,
                {'Adrian Stevens': {'age': 27,
                  'gender': 'NaN',
                  'officer': False,
                  'race': 'NaN'},
                 'Richard Johnson': {'age': 48,
                  'gender': 'M',
                  'officer': True,
                  # ... ...
    

    第二个函数感觉太冗长了&对于它正在做的事情来说很乏味。

    有没有更好的方法可以更智能地记住迭代中最后一个值的分类?

    2 回复  |  直到 8 年前
        1
  •  4
  •   Brad Solomon    8 年前

    这种东西很适合 generator :

    代码:

    def find_triplets(data):
        data = iter(data)
        while True:
            name = next(data)
            demo = next(data)
            officer = demo == 'OFFICER'
            if officer:
                demo = next(data)
            yield name, officer, demo
    

    测试代码:

    info = [
        'Joe Schmoe',
        'W / M / 64',
        'Lillian Schmoe',
        'W / F / 60',
        'Richard Johnson',
        'OFFICER',
        'W / M /48',
        'Adrian Stevens',
        '? / ? / 27'
    ]
    
    for x in find_triplets(info):
        print(x)
    

    结果:

    ('Joe Schmoe', False, 'W / M / 64')
    ('Lillian Schmoe', False, 'W / F / 60')
    ('Richard Johnson', True, 'W / M /48')
    ('Adrian Stevens', False, '? / ? / 27')
    

    将元组三元组转换为 dict :

    import re
    
    def fix_demographic(info):
        # W / M / ?? --> W / M / NaN
        # ?/M/?  --> NaN / M / NaN
        # Keep as str NaN rather than np.nan for now
        race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
        return dict(race=race, gender=gender, age=age)
    
    
    data_dict = {name: dict(officer=officer, **fix_demographic(demo))
                 for name, officer, demo in find_triplets(info)}
    
    print(data_dict)
    

    结果:

    {
        'Joe Schmoe': {'officer': False, 'race': 'W', 'gender': 'M', 'age': '64'}, 
        'Lillian Schmoe': {'officer': False, 'race': 'W', 'gender': 'F', 'age': '60'}, 
        'Richard Johnson': {'officer': True, 'race': 'W', 'gender': 'M', 'age': '48'}, 
        'Adrian Stevens': {'officer': False, 'race': 'NaN', 'gender': 'NaN', 'age': '27'}
    }
    
        2
  •  0
  •   Brad Solomon    8 年前

    您可以使用 itertools.groupby 在Python3中:

    import itertools
    import re
    info = [
    'Joe Schmoe',
    'W / M / 64',
    'Lillian Schmoe',
    'W / F / 60',
    'Richard Johnson',
    'OFFICER',
    'W / M /48',
    'Adrian Stevens',
    '? / ? / 27'
    ]
    data = [list(b) for a, b in itertools.groupby(info, key=lambda x:x.count('/') > 0 or x == 'OFFICER')]
    
    final_data = {data[i][0]:{**{a:'NaN' if b == '?' else (int(b) if b.isdigit() else b) for a, b in zip(['race', 'gender', 'age'], filter(None, re.split('\s+|/', [h for h in data[i+1] if h.count('/') > 0][0])))}, **{"officer":"OFFICER" in data[i+1]}} for i in range(0, len(data), 2)} 
    

    输出:

    {'Joe Schmoe': {'race': 'W', 'gender': 'M', 'age': 64, 'officer': False}, 'Lillian Schmoe': {'race': 'W', 'gender': 'F', 'age': 60, 'officer': False}, 'Richard Johnson': {'race': 'W', 'gender': 'M', 'age': 48, 'officer': True}, 'Adrian Stevens': {'race': 'NaN', 'gender': 'NaN', 'age': 27, 'officer': False}}