代码之家  ›  专栏  ›  技术社区  ›  Mat Nadrofsky

移动设备上的全文搜索?

  •  4
  • Mat Nadrofsky  · 技术社区  · 17 年前

    我们很快将着手开发一种新的移动应用程序。这个特殊的应用程序将用于大量搜索基于文本的字段。对于哪种数据库引擎最适合在移动平台上进行这类搜索,该小组有什么建议吗?

    具体内容包括Windows Mobile 6,我们将使用.Net CF。此外,一些基于文本的字段将在35到500个字符之间。该设备将以两种不同的方式运行:批处理和WiFi。当然,对于WiFi,我们可以将请求提交到一个成熟的DB引擎,然后将结果取回。这个问题围绕“批处理”版本展开,该版本将包含一个数据库,其中包含设备闪存/可移动存储卡上的信息。

    无论如何,我知道SQLCE有一些基本的索引,但是在获得完整的版本之前,你不会进入真正的“全文”风格的索引,当然,这在移动平台上是不可用的。

    数据的外观示例如下:

    “围裙木匠可调节皮革容器口袋腰部五金带” 等等等等。

    我还没有对任何其他具体选项进行评估,因为我想我会利用这个团队的经验,首先为我指出一些具体的途径。

    2 回复  |  直到 17 年前
        1
  •  5
  •   Jason Down    17 年前

    我创建了一个类来保存每个对象的id和文本(在我的例子中,我称它为sku(项目编号)和描述)。这将创建一个较小的对象,该对象使用的内存较少,因为它仅用于搜索。在找到匹配项后,我仍然会从数据库中获取完整的对象。

    public class SmallItem
    {
        private int _sku;
        public int Sku
        {
            get { return _sku; }
            set { _sku = value; }
        }
    
        // Size of max description size + 1 for null terminator.
        private char[] _description = new char[36];
        public char[] Description
        {
            get { return _description; }
            set { _description = value; }
        }
    
        public SmallItem()
        {
        }
    }
    

    以下是自定义包含函数的示例:

    public static int[] Contains(string[] descriptionTerms, int maxResults, List<SmallItem> itemList)
    {
        // Don't allow more than the maximum allowable results constant.            
        int[] matchingSkus = new int[maxResults];
    
        // Indexes and counters.
        int matchNumber = 0;
        int currentWord = 0;
        int totalWords = descriptionTerms.Count() - 1;  // - 1 because it will be used with 0 based array indexes
    
        bool matchedWord;
    
        try
        {   
            /* Character array of character arrays. Each array is a word we want to match.
             * We need the + 1 because totalWords had - 1 (We are setting a size/length here,
             * so it is not 0 based... we used - 1 on totalWords because it is used for 0
             * based index referencing.)
             * */
            char[][] allWordsToMatch = new char[totalWords + 1][];
    
            // Character array to hold the current word to match. 
            char[] wordToMatch = new char[36]; // Max allowable word size + null terminator... I just picked 36 to be consistent with max description size.
    
            // Loop through the original string array or words to match and create the character arrays. 
            for (currentWord = 0; currentWord <= totalWords; currentWord++)
            {
                char[] desc = new char[descriptionTerms[currentWord].Length + 1];
                Array.Copy(descriptionTerms[currentWord].ToUpper().ToCharArray(), desc, descriptionTerms[currentWord].Length);
                allWordsToMatch[currentWord] = desc;
            }
    
            // Offsets for description and filter(word to match) pointers.
            int descriptionOffset = 0, filterOffset = 0;
    
            // Loop through the list of items trying to find matching words.
            foreach (SmallItem i in itemList)
            {
                // If we have reached our maximum allowable matches, we should stop searching and just return the results.
                if (matchNumber == maxResults)
                    break;
    
                // Loop through the "words to match" filter list.
                for (currentWord = 0; currentWord <= totalWords; currentWord++)
                {
                    // Reset our match flag and current word to match.
                    matchedWord = false;
                    wordToMatch = allWordsToMatch[currentWord];
    
                    // Delving into unmanaged code for SCREAMING performance ;)
                    unsafe
                    {
                        // Pointer to the description of the current item on the list (starting at first char).
                        fixed (char* pdesc = &i.Description[0])
                        {
                            // Pointer to the current word we are trying to match (starting at first char).
                            fixed (char* pfilter = &wordToMatch[0])
                            {
                                // Reset the description offset.
                                descriptionOffset = 0;
    
                                // Continue our search on the current word until we hit a null terminator for the char array.
                                while (*(pdesc + descriptionOffset) != '\0')
                                {
                                    // We've matched the first character of the word we're trying to match.
                                    if (*(pdesc + descriptionOffset) == *pfilter)
                                    {
                                        // Reset the filter offset.
                                                filterOffset = 0;
    
                                        /* Keep moving the offsets together while we have consecutive character matches. Once we hit a non-match
                                         * or a null terminator, we need to jump out of this loop.
                                         * */
                                        while (*(pfilter + filterOffset) != '\0' && *(pfilter + filterOffset) == *(pdesc + descriptionOffset))
                                        {
                                            // Increase the offsets together to the next character.
                                            ++filterOffset;
                                            ++descriptionOffset;
                                        }
    
                                        // We hit matches all the way to the null terminator. The entire word was a match.
                                        if (*(pfilter + filterOffset) == '\0')
                                        {
                                            // If our current word matched is the last word on the match list, we have matched all words.
                                            if (currentWord == totalWords)
                                            {
                                                // Add the sku as a match.
                                                matchingSkus[matchNumber] = i.Sku.ToString();
                                                matchNumber++;
    
                                                /* Break out of this item description. We have matched all needed words and can move to
                                                 * the next item.
                                                 * */
                                                break;
                                            }
    
                                            /* We've matched a word, but still have more words left in our list of words to match.
                                             * Set our match flag to true, which will mean we continue continue to search for the
                                             * next word on the list.
                                             * */
                                             matchedWord = true;
                                        }
                                    }
    
                                    // No match on the current character. Move to next one.
                                    descriptionOffset++;
                                }
    
                                /* The current word had no match, so no sense in looking for the rest of the words. Break to the
                                 * next item description.
                                 * */
                                 if (!matchedWord)
                                    break;
                            }
                        }
                    }
                }
            };
    
            // We have our list of matching skus. We'll resize the array and pass it back.
            Array.Resize(ref matchingSkus, matchNumber);
            return matchingSkus;
        }
        catch (Exception ex)
        {
            // Handle the exception
        }
    }
    

    一旦获得匹配SKU的列表,就可以遍历数组并构建一个仅返回匹配SKU的查询命令。

    • 搜索约171000个项目
    • 创建所有匹配项的列表
    • 查询数据库,仅返回匹配项
    • 构建完整的项目(类似于SmallItem类,但有更多字段)
    • 用完整的blow项对象填充datagrid。

    我也尝试过在没有非托管代码的情况下使用String.IndexOf(并尝试了String.Contains…与IndexOf的性能相同)。那条路慢多了。。。大约25秒。

    我还尝试使用StreamReader和包含[Sku编号]|[Description]行的文件。该代码类似于非托管代码示例。整个扫描过程用了大约15秒。速度不算太差,但也不算太好。与我向您展示的方式相比,file and StreamReader方法有一个优势。文件可以提前创建。我向您展示的方式需要内存和应用程序启动时加载列表的初始时间。对于我们的171000件商品,这大约需要2分钟。如果你可以在每次应用启动时等待初始加载(当然可以在单独的线程上完成),那么用这种方法搜索是最快的方法(至少我已经找到了)。

    希望有帮助。

    PS-感谢Dolch对一些非托管代码的帮助。

        2
  •  2
  •   Wayne Bloss    17 年前

    你可以试试Lucene.Net。我不确定它是否适合移动设备,但它被称为“高性能、全功能的文本搜索引擎库”。

    http://incubator.apache.org/lucene.net/ http://lucene.apache.org/java/docs/