代码之家  ›  专栏  ›  技术社区  ›  Alok

正则表达式,用于检查字符串在python中是否包含至少一个或最多三个单词和多个哈希标记

  •  0
  • Alok  · 技术社区  · 6 年前
    s1 = 'Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics'
    s2 = 'Makeupby Antonia asia #makeup #makeupartist #makeupdolls'
    s3 = 'Makeupby Antonia'
    s4 = '#makeup #makeupartist #makeupdolls #abhcosmetics'  
    s5 = 'Makeupby Antonia asia america #makeup #makeupartist'
    

    正则表达式应该能够匹配 s1 s2 这是因为普通单词数最多为3个,并且这些单词有多个标签。

    我可以使用 \b(?<![#])[\w]+

    [#]{1}\w+
    但当我组合表达式时,它确实起作用。

    3 回复  |  直到 6 年前
        1
  •  4
  •   Aran-Fey Kevin    6 年前

    理智的解决方案

    将文本拆分为单词,并计算其中有多少以哈希符号开头。

    def check(text):
        words = text.split()
    
        num_hashtags = sum(word.startswith('#') for word in words)
        num_words = len(words) - num_hashtags
    
        return 1 <= num_words <= 3 and num_hashtags > 1
    
    >>> [check(text) for text in [s1,s2,s3,s4]]
    [True, True, False, False]
    

    import re
    
    def check(text):
        pattern = r'(?=.*\b(?<!#)\w+\b)(?!(?:.*\b(?<!#)\w+\b){4})(?:.*#){2}'
        return bool(re.match(pattern, text))
    

    我故意不解释正则表达式,因为我不想让你用它。你可能会感到困惑,这应该是一个强烈的信号,表明这是一个糟糕的代码。

        2
  •  1
  •   Paulo Scardine    6 年前

    如果我正确理解了你的问题,如果你能假设单词总是在你可以使用的标记之前 r'^(\w+ ){1,3}#\w+ #\w+'

    for s in ('Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
              'Makeupby Antonia asia #makeup #makeupartist #makeupdolls',
              'Makeupby Antonia',
              '#makeup #makeupartist #makeupdolls #abhcosmetics',  
              'Makeupby Antonia asia america #makeup #makeupartist',):
        print(bool(re.search(r'^(\w+ ){1,3}#\w+ #\w+', s)), s, sep=': ')
    

    这将输出:

    True: Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics
    True: Makeupby Antonia asia #makeup #makeupartist #makeupdolls
    False: Makeupby Antonia
    False: #makeup #makeupartist #makeupdolls #abhcosmetics
    False: Makeupby Antonia asia america #makeup #makeupartist
    
        3
  •  0
  •   Calum You    6 年前

    可能有很大的优化空间(可能有依赖项/更少的循环),但这里有一个非正则表达式解决方案,如评论中所述:

    s_list = [s1, s2, s3, s4]
    
    def hashtag_words(string_list):
        words = [s.split(" ") for s in string_list]
        hashcounts = [["#" in word for word in wordlist].count(True) for wordlist in words]
        normcounts = [len(wordlist) - hashcount for wordlist, hashcount in zip(words, hashcounts)]
        sel_strings = [s for s, h, n in zip(string_list, hashcounts, normcounts) if h>1 if n in (1,2,3)]
        return sel_strings
    
    hashtag_words(s_list)
    
    >['Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
     'Makeupby Antonia asia #makeup #makeupartist #makeupdolls']