代码之家 › 专栏 › 技术社区 › unj2

在一长串字符中查找单词。自动标记

string-algorithm nlp computer-science algorithm

unj2 · 技术社区 · 14 年前

你如何在一长串字符中找到正确的单词?

输入:

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

谷歌的输出:

"The revised report on syntactic theories sequential controlandstate"

(考虑到它们产生输出的时间,这已经足够接近了)

你觉得谷歌是怎么做到的? 你将如何提高精确度?

5 回复 | 直到 14 年前

Claudiu 14 年前

我会尝试这样的递归算法:

尝试在每个位置插入一个空格。如果左边是单词,则在右边重复。
计算所有最终输出中的有效字数/总字数。最好的比例很可能就是你的答案。

例如,将其设置为“ContenceIsGood”将运行:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

所以你可以选择3作为答案。

piccolbo 14 年前

尝试随机规则语法(相当于隐马尔可夫模型),规则如下:

for every word in a dictionary:
stream -> word_i stream with probability p_w
word_i -> letter_i1 ...letter_in` with probability q_w (this is the spelling of word_i)
stream -> letter stream with prob p (for any letter)
stream -> epsilon with prob 1

概率可以从训练集中得到,但请参见下面的讨论。最有可能的解析是使用维特比算法计算的,它在隐藏状态的数量上具有二次时间复杂性,在本例中是您的词汇表,因此您可能会遇到大型词汇表的速度问题。但是,如果你把所有的p_w=1,q_w=1 p=0.5设置为,这意味着,在一个人工语言模型中,这些都是概率,在这个模型中,所有的词都是同样可能的,所有的非词都是同样不可能的。当然,如果不使用这种简化方法,您可以更好地分段,但是算法的复杂性会降低很多。如果你在 wikipedia entry 对于这个特殊情况,您可以尝试简化它。到k位置的维特比分析概率可以简化为 VP(k) = max_l(VP(k-l) * (1 if text[k-l:k] is a word else .5^l) 你可以用一个词的最大长度来绑定l,并通过哈希搜索来查找l字母是否构成一个词。它的复杂性与词汇的大小无关,并且 O(<text length> <max l>) . 对不起,这不是证据,只是一个草图,但应该让你去。另一种可能的优化方法是,如果创建字典的trie,则可以检查子字符串是否是任何正确单词的前缀。因此,当您查询文本[k-l:k]并得到否定的答案时,您已经知道对于任何d,文本[k-l:k+d]也是如此。要利用这一点,您必须显著地重新排列递归,因此我不确定是否可以充分利用这一点(它可以看到注释)。

Dr. belisarius 14 年前

这是我最近为一个代码高尔夫开发的Mathematica代码。
它是一种最小匹配、非贪婪的递归算法。这意味着“笔比剑强”(没有空格)这句话返回“笔比剑强”。

findAll[s_] :=
  Module[{a = s, b = "", c, sy = "="},
  While[
   StringLength[a] != 0,
   j = "";
   While[(c = findFirst[a]) == {} && StringLength[a] != 0,
    j = j <> StringTake[a, 1];
    sy = "~";
    a = StringDrop[a, 1];
   ];
   b = b <> " " <> j ;
   If[c != {},
    b = b <> " " <> c[[1]];
    a = StringDrop[a, StringLength[c[[1]]]];
   ];
  ];
   Return[{StringTrim[StringReplace[b, "  " -> " "]], sy}];
]

findFirst[s_] :=
  If[s != "" && (c = DictionaryLookup[s]) == {}, 
   findFirst[StringDrop[s, -1]], Return[c]];

样本输入

ss = {"twodreamstop", 
      "onebackstop", 
      "butterfingers", 
      "dependentrelationship", 
      "payperiodmatchcode", 
      "labordistributioncodedesc", 
      "benefitcalcrulecodedesc", 
      "psaddresstype", 
      "ageconrolnoticeperiod",
      "month05", 
      "as_benefits", 
      "fname"}

产量

 twodreamstop              = two dreams top
 onebackstop               = one backstop
 butterfingers             = butterfingers
 dependentrelationship     = dependent relationship
 payperiodmatchcode        = pay period match code
 labordistributioncodedesc ~ labor distribution coded es c
 benefitcalcrulecodedesc   ~ benefit c a lc rule coded es c
 psaddresstype             ~ p sad dress type
 ageconrolnoticeperiod     ~ age con rol notice period
 month05                   ~ month 05
 as_benefits               ~ as _ benefits
 fname                     ~ f name

高温高压

Skarab 14 年前

检查拼写更正算法。下面是一篇关于谷歌算法的文章的链接- http://www.norvig.com/spell-correct.html . 在这里你会发现 a scientific paper on this topic from google .

user109134 14 年前

在进行递归拆分和字典查找之后,为了提高短语中单词对的质量,您可能有兴趣使用单词对的相互信息。

这基本上是通过一个训练集,找出单词对的m.i.值,它告诉你阿尔伯特·辛普森比阿尔伯特·爱因斯坦的可能性要小。

你可以尝试直接在科学中搜索这个主题的学术论文。有关相互信息的基本信息,请参阅 http://en.wikipedia.org/wiki/Mutual_information

去年,我参与了一个搜索引擎项目的短语搜索部分,在这个项目中,我试图通过维基百科数据集进行解析,并对每一个词对进行排序。我有C++的代码,如果你能和它分享,如果你能找到它的使用。它解析wikimedia,并为每一个词对找出相互的信息。