代码之家 › 专栏 › 技术社区 › ÊÉÄ±u

匹配包含排列的单词的行

information-retrieval nlp

4

ÊÉÄ±u · 技术社区 · 16 年前

假设您有一个包含varchar列的大表。

如何匹配varchar列中包含单词“preferred”的行,但数据有点嘈杂,并且偶尔包含拼写错误,例如:

['$2.10 Cumulative Convertible Preffered Stock, $25 par value',
'5.95% Preferres Stock',
'Class A Preffered',
'Series A Peferred Shares',
'Series A Perferred Shares',
'Series A Prefered Stock',
'Series A Preffered Stock',
'Perfered',
'Preffered  C']

上述拼写错误中“preferred”一词的排列似乎显示了 family resemblance 但他们几乎没有共同点。注意,把每个单词分开,然后跑 levenshtein 在每一行的每一个字上都将是令人望而却步的昂贵。

更新:

还有一些其他类似的例子,例如“restricted”:

['Resticted Stock Plan',
'resticted securities',
'Ristricted Common Stock',
'Common stock (restrticted, subject to vesting)',
'Common Stock (Retricted)',
'Restircted Stock Award',
'Restriced Common Stock',]

4 回复 | 直到 16 年前

1

Simon Nickerson 16 年前

你能试着在一个小样本的表格上训练它,找出可能的拼写错误(使用split+levenshtein),然后在完整的表格上使用得到的单词列表吗?

2

1

tpdi 16 年前

再创建两个表,拼写和可能的拼写:

--你可以找出类型

create table spelling ( id, word ) ; 
create table possible_spelling 
( id, spelling_id references spelling(id), spelling ) 
-- possible spelling also includes the correct spelling
-- all values are lowercase

insert into spelling( word ) values ('preferred');
insert into possible_spelling( spelling_id, spelling ) 
 select 1, '%preferred%' union select 1, '%prefered%' union ....;

select * 
from bigtable a 
join possible_spelling b
on (lower(a.data) like b.spelling )
join spelling c on (b.spelling_id = c.id) 
where c.word = 'preferred';

反对:这会很慢,需要设置。答:不是那么慢,这应该是一次性的事情来分类和修复您的数据。一次设置,一次对每个传入行进行分类。

3

1

bytebender 16 年前

是用tsql还是什么语言来实现?

你也许可以用正则表达式击中其中的大多数。

以下的一些变化

"p(er|re|e)f{1,2}er{1,2}ed"

"r(e|i)s?t(ri|ir|rti|i)ct?ed"

你要确保这不是大写敏感…

4

1

unmounted 16 年前

我可能会做这样的事——如果你能和莱文施坦一起逃脱一次的话——这里是 an amazing spellchecker implementation by Peter Norvig :

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

他提供了一套训练设备 here: http://norvig.com/big.txt 以下是示例输出:

>>> correct('prefferred')
'preferred'
>>> correct('ristricted')
'restricted'
>>> correct('ristrickted')
'restricted'

在您的例子中,您可以将原始列复制到新列,但在复制时要通过拼写检查器。然后放一个 fulltext 对拼写正确的列进行索引,并将查询与之匹配,但返回原始列的结果。你只需要做一次,而不是每次都计算距离。您也可以拼写检查输入,或仅作为回退检查更正的版本。不管怎样,都值得研究诺维格的例子。