代码之家 › 专栏 › 技术社区 › Anto

根据ID列表从CSV中选择行

full-text-search csv linq c#

Anto · 技术社区 · 10 年前

我的任务是从CSV文件中提取数十万行,其中该行包含指定的ID。因此,我有大约300000个ID存储在字符串列表中,需要提取CSV中包含任何这些ID的任何行。目前,我正在使用Linq语句查看每一行是否包含List中的任何ID:

using (StreamReader sr = new StreamReader(csvFile))
{             
    string inLine = sr.ReadLine();
    if(searchStrings.Any(sr.ReadLine().Contains))
    {
         stremWriter.Write(inLine);
    }
}

这种方法很好,但速度很慢,因为searchStrings列表中有300000个值,CSV中有几百万行我需要搜索。

有人知道如何提高搜索效率以加快搜索速度吗? 还是提取所需行的替代方法?

谢谢

2 回复 | 直到 10 年前

Kharenis 10 年前

我以前遇到过类似的问题,我必须遍历数十万行.csv并解析每一行。

我采用了一种线程化的方法,尝试分批同时进行读取和解析。这里是 粗略地 我是怎么做到的;

    using System.Collections.Concurrent; using System.Threading;
    private static ConcurrentBag<String> items = new ConcurrentBag<String>();
    private static List<String> searchStrings;
    static void Main(string[] args)
    {

        using (StreamReader sr = new StreamReader(csvFile))
        {
            const int buffer_size = 10000;
            string[] buffer = new string[buffer_size];

            int count = 0;
            String line = null;
            while ((line = sr.ReadLine()) != null)
            {
                buffer[count] = line;
                count++;
                if (count == buffer_size)
                {
                    new Thread(() =>
                        {
                            find(buffer);
                        }).Start();

                    buffer = new String[buffer_size];
                    count = 0;
                }
            }

            if (count > 0)
            {
                find(buffer);
            }

            //some kind of sync here, can be done with a bool - make sure all the threads have finished executing
            foreach (var str in searchStrings)
                streamWriter.write(str);
        }
    }

    private static void find(string[] buffer)
    {
        //do your search algorithm on the array of strings
       //add to the concurrentbag if they match
    }

我只是根据我记得以前做过的事情快速地把这段代码拼凑起来,所以它可能并不完全正确。这样做当然会加快速度(至少对于非常大的文件)。

这个想法是始终从hdd中读取,因为字符串解析可能非常昂贵,因此在多个内核上进行批处理可以大大加快速度。

有了这个,我能够在7秒内解析(将每行拆分为大约50个项目,解析键/值对,并从中构建内存中的对象,这是迄今为止最耗时的部分)大约250k行。

CaseyR 10 年前

抛开这一点,它与问题上的任何标签都没有特别的关系,但*nix“grep-f”功能在这里可以工作。本质上,您将有一个包含您想要匹配的字符串列表的文件(例如StringsToFind.txt),您将拥有csv输入文件(例如input.csv),下面的命令将输出匹配的行到output.csv

grep -f StringsToFind.txt input.csv > output.csv

看见 grep man page 了解更多详细信息。