代码之家 › 专栏 › 技术社区 › jamiecon

删除SQL Server 2005全文索引中的干扰词

full-text-search sql-server-2005

jamiecon · 技术社区 · 16 年前

在一个非常典型的场景中,我的Web应用程序上有一个“搜索”文本框,它将用户输入直接传递给存储过程,然后使用全文索引搜索两个表中的两个字段,这些字段使用适当的键联接。

我正在使用包含谓词搜索字段。在传入搜索字符串之前,我执行以下操作:

SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'

改变城堡到 “The*”或“Castle*” 例如。这是必要的,因为我希望人们能够搜索 中国科学院 并获得结果城堡 .

WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)

问题是,现在我已经在每个单词的末尾附加了一个通配符,noise单词(例如这个 )还附加了一个通配符,因此不再显示为被删除。这意味着搜索城堡将返回带有诸如剧院等。

我的第一个想法是更改或改为和,但如果在查询中使用干扰词,这似乎只是不返回匹配项。

我所要做的就是允许用户输入多个空格分隔的单词,这些单词以任何顺序代表他们搜索的单词的整体或前缀,并删除噪音单词,如这个从他们的输入(否则当他们搜索城堡他们得到了一个大的项目列表,结果他们需要在列表中间的某个地方。

我可以继续执行我自己的噪声词清除程序,但看起来全文索引应该能够处理一些事情。

感谢您的帮助!

杰米

5 回复 | 直到 11 年前

galuvian 16 年前

噪声词在存储索引前被去除。因此,不可能编写一个搜索停止词的查询。如果确实要启用此行为,则需要编辑停止词列表。( http://msdn.microsoft.com/en-us/library/ms142551.aspx )然后重新构建索引。

amsimmon 16 年前

我也有同样的问题,经过彻底的调查,我得出结论,没有什么好的解决办法。

作为一个折衷方案,我正在实施蛮力解决方案:

1)打开c:\program files\microsoft sql server\mssql.1\mssql\ftdata\noisenu.txt并复制其中的所有文本。

2)粘贴到应用程序中的代码文件中,将换行符替换为“,”以获取这样的列表初始值设定项:

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3)在提交搜索字符串之前,将其分解为单词,并删除噪音单词中的任何单词,如下所示:

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

不是一个理想的解决方案,但只要噪声词文件不变,就应该工作。多语言支持将使用按语言列出的列表字典。

Herb Caudill 15 年前

这是一个工作功能。文件 noiseENU.txt 按原样复制 \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function

Manuel Alves 14 年前

您还可以在进行查询之前删除干扰词。语言ID列表: http://msdn.microsoft.com/en-us/library/ms190303.aspx

dim querytextwithout noise as string=removenoisewords(querytext,connectionstring,1033)

公共函数removenoisewords(byval input ext as string, ByVal CNSTR作为字符串, byval languageid为整数)为字符串

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function

jamiecon 16 年前

类似于我的方法。

虽然我希望使用全文索引来执行词干、速度和多单词搜索等功能,但实际上我只是在两个表中索引几个nvarchar(100)字段。每个表的行数都很容易保持在50000行以下。

我的解决方案是从文本文件中删除所有干扰词,并允许索引器编译包含所有词的索引。它仍然只有几千个条目。

然后,我对搜索字符串中的空格进行替换,如我最初的文章中所描述的那样,获取contains以处理多个单词,并分别对单词进行词干处理。

似乎工作得很好,但我会密切关注表现。