代码之家 › 专栏 › 技术社区 › John Doe

从包含括号的文本(日志文件)中提取键值对

pandas python-3.x regex python

John Doe · 技术社区 · 7 年前

[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text

我想用这样一个键和值对进行升级:

Key      Value
aaa      some text here  
bbbb3    some other text here  
cc       more text

或者像这样的数据帧

aaa            | bbbb3                |cc
-------------------------------------------------
some text here | some other text here | more text
next line      | .....                | .....

r'\[(.{6})\]\s(.*?)\s\['

6 回复 | 直到 7 年前

cs95 abhishek58g 7 年前

使用 re.findall ,并将感兴趣的区域提取到列中。然后,可以根据需要删除空间。

既然您提到您愿意将其读入数据帧,那么您可以将该工作留给pandas。

import re
matches = re.findall(r'\[(.*?)\](.*?)(?=\[|$)', text)

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip()))

df
     Key                 Value
0    aaa        some text here
1  bbbb3  some other text here
2     cc             more text

或(关于:编辑),

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip())
        .set_index('Key')
        .transpose())

Key               aaa                 bbbb3         cc
Value  some text here  some other text here  more text

该图案与大括号内的文本相匹配,然后是大括号外的文本,直到下一个大括号。

\[      # Opening square brace 
(.*?)   # First capture group
\]      # Closing brace
(.*?)   # Second capture group
(?=     # Look-ahead 
   \[   # Next brace,
   |    # Or,
   $    # EOL
)

Pushpesh Kumar Rajwanshi 7 年前

试试这个正则表达式,它在命名组捕获中捕获您的密钥和值。

\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)

\[ --&燃气轮机;自从 [ 具有定义字符集的特殊含义,因此需要对其进行转义,并与文字匹配
\s*
(?P<key>\w+)+ --&燃气轮机;形成 key 捕获一个或多个单词[a-zA-Z0-9]字符的命名组。我用过 \w 为了保持简单,OP的字符串只包含字母数字字符,否则应该使用 [^]] 字符集,用于捕获方括号内的所有内容作为键。
\*
] --&燃气轮机;匹配一个文本 ] 不需要逃跑
--&燃气轮机;使用不需要作为值的一部分的任何前面的空间
(?P<value>[^[]*\s*) --&燃气轮机;形成 value 捕获任何字符异常的命名组 [ 此时它停止捕获并将捕获的值分组到命名组中 .

Demo

import re
s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

arr = re.findall(r'\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)', s)
print(arr)

产出,

[('aaa', 'some text here '), ('bbbb3', 'some other text here '), ('cc', 'more text')]

benvc 7 年前

您可以通过使用 re.split() 并输出到字典。例如:

import re

text = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

# split text on "[" or "]" and slice off the first empty list item
items = re.split(r'[\[\]]', text)[1:]

# loop over consecutive pairs in the list to create a dict
d = {items[i].strip(): items[i+1].strip() for i in range(0, len(items) - 1, 2)}

print(d)
# {'aaa': 'some text here', 'bbbb3': 'some other text here', 'cc': 'more text'}

Patrick Artner 7 年前

这里并不真正需要正则表达式-简单的字符串拆分即可:

s = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"    

parts = s.split("[")  # parts looks like: ['', 
                      #                    'aaa   ] some text here ',
                      #                    'bbbb3 ] some other text here ', 
                      #                    'cc    ] more text'] 
d = {}
# split parts further
for p in parts:
    if p.strip():
        key,value = p.split("]")            # split each part at ] and strip spaces
        d[key.strip()] = value.strip()      # put into dict

# Output:
form = "{:10} {}"
print( form.format("Key","Value"))

for i in d.items():
      print(form.format(*i))

输出:

Key        Value
cc         more text
aaa        some text here
bbbb3      some other text here

用于格式化的Doku:

几乎是一行:

d = {hh[0].strip():hh[1].strip() for hh in (k.split("]") for k in s.split("[") if k)}

Dani Mesejo 7 年前

你可以用 finditer

import re

s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

pattern = re.compile('\[(\S+?)\s+\]([\s\w]+)')
result = [(match.group(1).strip(), match.group(2).strip()) for match in pattern.finditer(s)]
print(result)

输出

[('aaa', 'some text here'), ('bbbb3', 'some other text here'), ('cc', 'more text')]

Gsk 7 年前

使用正则表达式,您可以找到 key,value 配对,存储在字典中,然后打印出来:

import re

mystr = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"

a = dict(re.findall(r"\[([A-Za-z0-9_\s]+)\]([A-Za-z0-9_\s]+(?=\[|$))", mystr))

for key, value in a.items():
    print key, value

# OUTPUT: 
# aaa     some text here 
# cc      more text
# bbbb3   some other text here

正则表达式匹配两个组:
第一组是 所有字符、数字和空格都用方括号括起来 第二个是

\[([A-Za-z0-9_\s]+)\]
([A-Za-z0-9_\s]+(?=\[|$))

注意,在第二组中,我们有一个 positive lookahead : (?=\[|$)

findall然后返回元组列表: [(key1,value1), (key2,value2), (key3,value3),...]
元组列表可以立即转换为字典:dict(my_tuple_list)。