代码之家 › 专栏 › 技术社区 › Alok

正则表达式,用于检查字符串在python中是否包含至少一个或最多三个单词和多个哈希标记

regex python

Alok · 技术社区 · 6 年前

s1 = 'Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics'
s2 = 'Makeupby Antonia asia #makeup #makeupartist #makeupdolls'
s3 = 'Makeupby Antonia'
s4 = '#makeup #makeupartist #makeupdolls #abhcosmetics'  
s5 = 'Makeupby Antonia asia america #makeup #makeupartist'

正则表达式应该能够匹配 s1 和 s2 这是因为普通单词数最多为3个,并且这些单词有多个标签。

我可以使用 \b(?<![#])[\w]+
和
[#]{1}\w+
但当我组合表达式时,它确实起作用。

3 回复 | 直到 6 年前

Aran-Fey Kevin 6 年前

理智的解决方案

将文本拆分为单词,并计算其中有多少以哈希符号开头。

def check(text):
    words = text.split()

    num_hashtags = sum(word.startswith('#') for word in words)
    num_words = len(words) - num_hashtags

    return 1 <= num_words <= 3 and num_hashtags > 1

>>> [check(text) for text in [s1,s2,s3,s4]]
[True, True, False, False]

import re

def check(text):
    pattern = r'(?=.*\b(?<!#)\w+\b)(?!(?:.*\b(?<!#)\w+\b){4})(?:.*#){2}'
    return bool(re.match(pattern, text))

我故意不解释正则表达式,因为我不想让你用它。你可能会感到困惑,这应该是一个强烈的信号,表明这是一个糟糕的代码。

Paulo Scardine 6 年前

如果我正确理解了你的问题,如果你能假设单词总是在你可以使用的标记之前 r'^(\w+ ){1,3}#\w+ #\w+'

for s in ('Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
          'Makeupby Antonia asia #makeup #makeupartist #makeupdolls',
          'Makeupby Antonia',
          '#makeup #makeupartist #makeupdolls #abhcosmetics',  
          'Makeupby Antonia asia america #makeup #makeupartist',):
    print(bool(re.search(r'^(\w+ ){1,3}#\w+ #\w+', s)), s, sep=': ')

这将输出:

True: Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics
True: Makeupby Antonia asia #makeup #makeupartist #makeupdolls
False: Makeupby Antonia
False: #makeup #makeupartist #makeupdolls #abhcosmetics
False: Makeupby Antonia asia america #makeup #makeupartist

Calum You 6 年前

可能有很大的优化空间(可能有依赖项/更少的循环),但这里有一个非正则表达式解决方案,如评论中所述:

s_list = [s1, s2, s3, s4]

def hashtag_words(string_list):
    words = [s.split(" ") for s in string_list]
    hashcounts = [["#" in word for word in wordlist].count(True) for wordlist in words]
    normcounts = [len(wordlist) - hashcount for wordlist, hashcount in zip(words, hashcounts)]
    sel_strings = [s for s, h, n in zip(string_list, hashcounts, normcounts) if h>1 if n in (1,2,3)]
    return sel_strings

hashtag_words(s_list)

>['Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
 'Makeupby Antonia asia #makeup #makeupartist #makeupdolls']

推荐文章

DotFX · RegEx捕获关键字前但括号后的所有内容

4 月前

user66001 · 正则表达式用于匹配有引号和无引号的文本,并且不匹配任何部分

4 月前

perlchamp · 为什么这也匹配?

4 月前

con · Negative Lookaward在perl正则表达式中不起作用

4 月前

Andrus · 如何在sql中查找第二个匹配项

4 月前

iato · 确保正则表达式不从命名材料中的数字中提取

5 月前

vr8ce · 非成对标记中特定字符的正则表达式

5 月前

MARTIN · 交换第一个和最后一个单词,反转所有中间的字符

5 月前

Carsten · 使用最近的搜索模式更改文本块

5 月前

Eric Marceau · Grep:有没有一种特殊的方法可以将“无字符”作为“字符位置”匹配的置换?

5 月前