代码之家 › 专栏 › 技术社区 › Bananas

使用univocity解析两个不同的csv文件并写入新的csv文件

univocity parsing csv java

Bananas · 技术社区 · 7 年前

我总是在java程序中使用univocity解析器来比较csv文件。它工作得很好,速度快得多。

但问题是,这次我试图解析两个具有复杂值的不同大容量csv文件,并在新的csv文件中打印差异,

查看作者的一个示例,在将文件1读入列表并转换为映射后,我尝试使用processFile,但在解析时仍然出现错误。

下面是我的示例输入和预期输出文件。

输入-文件1

"h1","h2","h3","h4","h5"
"00000","US","9503.00.0089","USA","9503.0089"
"","EU","9503.00.7000","EUROPEAN UNION","9503.00.7000"
"#1200","US","5601.22.0010","USA","5601.22.0010"
"0180691","US","9503.00.0073","USA","9503.00.0073"
âDRTY01â,âCAâ,â9603.01.0088â,âCANâ,â9603.01.0088â

输入-文件2

"h1","h2","h3","h6","h7","h8","h9","h10",h11 
"018890","US","","2015","101","1","1","All",ââ
"00000","US","9503.00.0090","1986","101","1","1","All","9503.00.0090"
"0180691","US","9503.00.0073","2019","101","1","1","All","9503.00.0073â
âDRTY01â,âCAâ,â9603.01.0087â,â2002â,â102â,â1â,â2â,âCAâ, â9603.01.0087â

在file1和file2中选择h1,h2公共值,然后比较file1的h3和file2的h3,如果两个文件h3不相等,那么我想打印-h1-h1,h4,h10,h5,h11,h6,h7,h7。到文件3

输出-文件3

âh1â,âh4â,â h10â,âh5â, âh11â,âh6â,âh7â,âh8â,âh9â
"00000","USAâ,âAllâ,â9503.00.0089â,â9503.00.0090â, "1986","101","1","1"   
"DRTY01â,âCANâ,âCAâ,â9603.01.0088â,â9603.01.0087â,â2002â,â102â,â1â,â2â

1 回复 | 直到 7 年前

utkarsh31 7 年前

我有一个你的问题的解决方案,但请做回归测试。所以我假设的是 h1和h2的组合将是唯一的值 . 我正在创建一个HashMap,其中一个映射作为键,csv文件的整行作为值。我们将重写所创建类的hashcode和equals方法,如:

哈希代码将只使用h1和h2来生成代码(因为它们肯定是唯一的)
我们还将使用h3作为比较条件,当两个h3相同时,该条件将返回false。

equals中的逻辑是-如果h1和h2在map1和map2中相同,而h3不同,请给出map1和map2中的行。该逻辑在映射中使用额外的空间,但总体计算逻辑减少到 O(N) . 下面的代码将为您提供所需的地图行。我没有正确执行IO和异常处理,请妥善处理。

public class UnivocityTest
{

    public static void main(String[] args) throws FileNotFoundException
    {
        // Get data from csv file1
        List<String[]> f1 = getData("example.csv");
        // Get data from csv file2


       List<String[]> f2 = getData("example1.csv");

        // Convert data to a Map with HeaderList class and entire row.
        Map<HeaderList, String[]> map1 = convertAndReturn(f1);
        Map<HeaderList, String[]> map2 = convertAndReturn(f2);

        //Currently prints the required rows.
        compareData(map1, map2);
    }

    // Convert csv to List<String[]>
    private static List<String[]> getData(String file) throws FileNotFoundException
    {
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setLineSeparatorDetectionEnabled(true);
        RowListProcessor rowProcessor = new RowListProcessor();
        parserSettings.setProcessor(rowProcessor);
        parserSettings.setHeaderExtractionEnabled(true);

        CsvParser parser = new CsvParser(parserSettings);
        parser.parse(getReader(file));
        // String[] headers = rowProcessor.getHeaders();
        List<String[]> rows = rowProcessor.getRows();

        return rows;
    }

    // get reader object
    private static Reader getReader(String string) throws FileNotFoundException
    {
        // TODO Add proper file handling and exception handling
        return new FileReader(new File(string));
    }

    // Return HashMap
    private static Map<HeaderList, String[]> convertAndReturn(List<String[]> f1)
    {
        Map<HeaderList, String[]> map = new java.util.HashMap<>();

        for (String[] each : f1)
        {
            // For each row in csv create a corresponding HeaderList object with h1,h2 and h3 as key
            // and row as value.
            HeaderList header = new HeaderList(each[0], each[1], each[2]);
            map.put(header, each);
        }

        return map;
    }

    private static void compareData(Map<HeaderList, String[]> map1, Map<HeaderList, String[]> map2)
    {
        // Iterates over the map1 keys one by one. For each key we check if there is a matching key
        // in map2. The matching condition will be h1 and h2 should be same while h3 should be
        // different. Once a key like that is found currently I'm printing both the rows, here you
        // can get the rows you want from the map and return them.

        for (HeaderList each : map1.keySet())
        {
            if (map2.containsKey(each))
            {
//TODO Assume you want columns h3,h4 from file1 and h6  h7 from file2.
                //We know map1 represents file1 with columns h3 and h4 at positions 2 and 3 inside the String[]
                //We know map2 represents file1 with columns h6 and h7 at positions 3 and 4 inside the String[]
                String h3FromFile1 = map1.get(each)[2];
                String h4FromFile1 = map1.get(each)[3];
                String h6FromFile2 = map2.get(each)[3];
                String h7FromFile2 = map2.get(each)[4];
                System.out.println("Required Columns: ");
                System.out.println("h3 file1: "+ h3FromFile1);
                System.out.println("h4 file1: "+ h4FromFile1);
                System.out.println("h6 file2: "+ h6FromFile2);
                System.out.println("h7 file2: " + h7FromFile2);
                System.out.println(Arrays.toString(map1.get(each)));
                System.out.println(Arrays.toString(map2.get(each)));
                System.out.println("-------------------------------");
            }
        }
    }

}

包含三列h1、h2、h3的bean类:

class HeaderList
        {

            private String h1;

            private String h2;

            private String h3;

            public HeaderList(String h1, String h2, String h3)
            {
                super();
                this.h1 = h1;
                this.h2 = h2;
                this.h3 = h3;
            }

            /**
             * The hash code method which generate same hashkey for h1 and h2.
             * 
             * @inheritDoc
             */
            @Override
            public int hashCode()
            {
                final int prime = 31;
                int result = 1;
                result = prime * result + ((h1 == null) ? 0 : h1.hashCode());
                result = prime * result + ((h2 == null) ? 0 : h2.hashCode());
                return result;
            }

            /**
             * The equals method assumes each csv file row will be uniquely identified my h1 and h2
             * combined. Please see if h1 and h2 cannot be uniquely identified then it may lead to data
             * loss. For h3 we return true only for same values.
             * 
             * @inheritDoc
             */
            @Override
            public boolean equals(Object obj)
            {
                if (this == obj)
                    return true;
                if (obj == null)
                    return false;
                if (getClass() != obj.getClass())
                    return false;
                HeaderList other = (HeaderList) obj;
                if (h1 == null)
                {
                    if (other.h1 != null)
                        return false;
                }
                else if (!h1.equals(other.h1))
                    return false;
                if (h2 == null)
                {
                    if (other.h2 != null)
                        return false;
                }
                else if (!h2.equals(other.h2))
                    return false;
                if (h3 == null)
                {
                    if (other.h3 == null)
                        return false;
                }
                else if (h3.equals(other.h3))
                    return false;
                return true;
            }

            /**
             * @inheritDoc
             */
            @Override
            public String toString()
            {
                return "HeaderList [h1=" + h1 + ", h2=" + h2 + ", h3=" + h3 + "]";
            }

        }

给定输入csv文件的输出:

Required Columns: 
h3 file1: 9603.01.0088
h4 file1: CAN
h6 file2: 2002
h7 file2: 102
[DRTY01, CA, 9603.01.0088, CAN, 9603.01.0088]
[DRTY01, CA, 9603.01.0087, 2002, 102, 1, 2, CA, 9603.01.0087]
-------------------------------
Required Columns: 
h3 file1: 9503.00.0089
h4 file1: USA
h6 file2: 1986
h7 file2: 101
[00000, US, 9503.00.0089, USA, 9503.0089]
[00000, US, 9503.00.0090, 1986, 101, 1, 1, All, 9503.00.0090]
-------------------------------