代码之家  ›  专栏  ›  技术社区  ›  pylearner

在文档中搜索短语

  •  0
  • pylearner  · 技术社区  · 6 年前

    任务是匹配一个段落中的关键字,我所做的是我把段落分解成单词,并将它们放入一个列表中,然后使用另一个列表中的搜索词进行匹配。

    数据:

    Automatic Product Title Tagging
    Aim: To automate the process of product title tagging using manually tagged data. 
    
    ROUTE OPTIMIZATION – Spring Clean
    Aim:  Minimizing the overall travel time using optimization techniques. 
    
    CUSTOMER SEGMENTATION:
    Aim:  Develop an engine which segments and provides the score for
          customers based on their behavior and analyze their purchasing pattern. 
    

    s = ['tagged', 'product title',  'tagging', 'analyze']
    
    skills = []
    for word in data.split():
    
        print(word)    
        word.lower()
        if word in s:
    
            skills.append(word)
    skills1 = list(set(skills))
    
    print(skills1)
    
    ['tagged', 'tagging', 'analyze'] 
    

    当我使用split函数时,每个单词都被拆分,因此我无法检测单词 product title

    如果有人能帮忙,我将不胜感激。

    4 回复  |  直到 6 年前
        1
  •  2
  •   icedwater PedroMorgan    6 年前

    迭代列表 s 并检查字符串中是否有元素。

    演示:

    data = """
     Automatic Product Title Tagging  
     Aim: To automate the process of product title tagging using manually tagged data.
     ROUTE OPTIMIZATION – Spring Clean
     Aim:  Minimizing the overall travel time using optimization techniques.
     CUSTOMER SEGMENTATION:
     Aim:  Develop an engine which segments and provides the score for  
           customers based on their behavior and analyze their purchasing
           pattern. 
    """
    s = ['tagged', 'product title',  'tagging', 'analyze']
    data = data.lower()
    
    skills = []
    for i in s:
        if i.lower() in data:
            skills.append(i)
    print(skills)
    

    skills = [i for i in s if i.lower() in data]
    

    输出:

    ['tagged', 'product title', 'tagging', 'analyze']
    
        2
  •  3
  •   Leo K    6 年前

    你要搜索的不是“关键字”,而是短语。一种解决方案是使用正则表达式搜索(一个简单的 substring is in text 构造不会很好地工作,因为当给定“产品标题”时,它可能会 byproduct titles

    这应该做到:

    import re
    [ k for k in skills if re.search( r'\b' + k + r'\b', data, flags=re.IGNORECASE ) ]
    
        3
  •  0
  •   guroosh    6 年前

    2) 如果拆分,则可以在i和i+1索引中搜索匹配项

        4
  •  0
  •   wailinux    6 年前

    “目标:”必须在“数据”的每行中 所以我会找到这个词的索引(“Aim:”)

    p = "Automatic Product Title Tagging  Aim: To automate the process of product title tagging using manually tagged data."
    index = p.find("Aim:") # 33
    print(p[33:])
    output:
    "Aim: To automate the process of product title tagging using manually tagged data."
    w_lenght = len("Aim:") # 4 : for exclude word "Aim:"
    print(p[37:])
    output:
    " To automate the process of product title tagging using manually tagged data."
    

    例子:

    s = ['tagged', 'product title',  'tagging', 'analyze']
    skills = []
    for line in data.split("\n"):
        index = line.find("Aim:") + len("Aim:") #4
        if index != -1:
        for word in line[index:].split():
            if word.lower() in s:
                skills.append(word)
                print(word)