代码之家 › 专栏 › 技术社区 › vr8ce

非成对标记中特定字符的正则表达式

regex python-3.x python

vr8ce · 技术社区 · 7 月前

我试图在一对不平衡的标签中找到一个字符。我可以在匹配的集合中识别它,并且当匹配的对都是单个字符时,我也可以识别它,但当对都是多个字符时,似乎无法在不匹配的集合内找到它的语法。我试过几种不同外观的组合,但都没有成功。

标签是 <lsq> 和 <rsq> ;每一对都是配对的。一条线既不能有一对,也不能有多对,和/或一对或多对不匹配的线,即。 <lsq> 没有匹配 <rsq> (虽然理论上有可能 <rsq> 没有a <lsq> ,我没有遇到任何人,也不关心他们。)

我试图在不匹配的对中找到(右单引号)的实例,即在 <lsq> 没有对应的 <rsq> 这可能是因为EOL在 <rsq> ,或者因为另一个 <lsq> 首先发生。

样本数据:

<p><lsq>Line one, matched one,<rsq></p>
<p><lsq>Line two, unmatchedâ one. <lsq>Line two, matchedâ pair one.<rsq></p>
<p>Line three, âfore no tag.</p>
<p>Line four, âfore first tag. <lsq>Line four, unmatched oneâ.</p>
<p>Line five free text before. <lsq>Line five, matched one,<rsq> <lsq>line five, âmatched two.<rsq> Line fiveâ free text after.</p>
<p><lsq>Line six matched one<rsq>, line six free text! <lsq>Line six matched two hittinâ and sittinâ and goinâ on forever.<rsq></p>
<p><lsq>Line seven unmatchedâ one.</p>
<p>Line eight free text. Line <lsq>eightâ unmatched one, <lsq>unmatchedâ two.</p>

正则表达式应仅匹配第二行(仅匹配一行)、第四行(仅对应一行)和第七行(不匹配),以及第八行中的第一行(八行)(我可以多次运行它来查找下一行)。(我在这里加入了这些词,以明确比赛的位置;但我只是在寻找比赛本身。)

这是在python中,正则表达式是搜索和替换的一部分,例如。

regex.sub(r"regex", r"<tag>", text_being_processed)

这假设只有匹配;如果使用捕获组更容易,我可以根据需要调整替换。

暂时忽略EOL,我尝试在两个文本之间查找文本 <lsq> 没有干预 <rsq> ,但我显然没有正确处理负面展望:

(?<=<lsq>)(?!<rsq>).*?(?=<lsq>)

它确实找到了连续 <lsq> s、但即使是那些有 <rsq> 在两者之间。我试着移动 <rsq> 环顾四周,还有其他一些事情,但都不正确。这是在试图在不匹配的组合中找到一个特定的角色之前。我在SO和网络上都搜索过类似的例子,但找不到。

1 回复 | 直到 7 月前

Barmar 7 月前

正则表达式不能很好地处理嵌套结构。使用解析器。

查找上述文本的示例:

# pip install beautifulsoup4
from bs4 import BeautifulSoup

data = '''\
<p><lsq>Line one, matched one,<rsq></p>
<p><lsq>Line two, unmatchedâ one. <lsq>Line two, matchedâ pair one.<rsq></p>
<p>Line three, âfore no tag.</p>
<p>Line four, âfore first tag. <lsq>Line four, unmatched oneâ.</p>
<p>Line five free text before. <lsq>Line five, matched one,<rsq> <lsq>line five, âmatched two.<rsq> Line fiveâ free text after.</p>
<p><lsq>Line six matched one<rsq>, line six free text! <lsq>Line six matched two hittinâ and sittinâ and goinâ on forever.<rsq></p>
<p><lsq>Line seven unmatchedâ one.</p>
<p>Line eight free text. Line <lsq>eightâ unmatched one, <lsq>unmatchedâ two.</p>
'''

soup = BeautifulSoup(data, 'html.parser')
for p in soup.find_all('p'):
    for lsq in p.find_all('lsq'):
        # Without proper end tags, BS thinks line 2's second
        # lsq/rsq pair is nested in the first lsq, so
        # recursive=False is needed to not detect the nested rsq.
        if not lsq.find('rsq', recursive=False):
            print(lsq.next_element)

输出:

Line two, unmatchedâ one. 
Line four, unmatched oneâ.
Line seven unmatchedâ one.
eightâ unmatched one, 
unmatchedâ two.