代码之家  ›  专栏  ›  技术社区  ›  shakedzy

SpBaseJava:当列不在同一顺序时如何比较模式?

  •  -3
  • shakedzy  · 技术社区  · 6 年前

    跟随 this question ,我现在运行以下代码:

    List<StructField> fields = new ArrayList<>();
    fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
    fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
    StructType schema1 = DataTypes.createStructType(fields);
    Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
    Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);
    
    fields = new ArrayList<>();
    fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
    fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
    StructType schema2 = DataTypes.createStructType(fields);
    Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
    Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);
    
    finalDf1.printSchema();
    finalDf2.printSchema();
    System.out.println(finalDf1.schema());
    System.out.println(finalDf2.schema());
    System.out.println(finalDf1.schema().equals(finalDf2.schema()));
    

    输出如下:

    root
     |-- A: long (nullable = true)
     |-- B: double (nullable = true)
    
    root
     |-- B: double (nullable = true)
     |-- A: long (nullable = true)
    
    StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
    StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
    false
    

    虽然列的排列顺序不同,但这两个数据集具有完全相同的列和列类型。这里需要什么比较才能得到 true ?

    3 回复  |  直到 6 年前
        1
  •  0
  •   Akrem    6 年前

    如果它们有不同的顺序,那么它们就不一样了。即使它们都有相同数量的列和相同的名称。如果要查看两个架构是否具有相同的列名,请从两个数据帧中获取列表中的架构,然后编写代码进行比较。参见下面的Java示例

    public static void main(String[] args)
    {
    
        List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
        List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());
    
    
        if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
        {
            System.out.println("Yes, schemas have the same column names");
        }else
        {
            System.out.println("No, schemas do not have the same column names");
        }
    }
    
    private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
    {
        if(firstSchema.size() != secondSchema.size())
        {
            return false;
        }else 
        {
            for (String column : secondSchema)
            {
                if(!firstSchema.contains(column))
                    return false;
            }
        }
        return true;
    }
    
        2
  •  1
  •   Ged    6 年前

    假设order cols不匹配,相同的名称是相同的语义,并且需要相同数量的列。

    一个使用Scala的例子,你应该能够适应Java:

    import spark.implicits._
    val df = sc.parallelize(Seq(
            ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
            ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
            )).toDF("c1", "c2", "Val1", "Val2")
    val names = df.columns
    
    val df2 = sc.parallelize(Seq(
           ("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
    val names2 = df2.columns
    
    names.sortWith(_ < _) sameElements names2.sortWith(_ < _)
    

    返回true或false,尝试输入。

        3
  •  0
  •   shakedzy    6 年前

    遵循前面的答案,似乎是比较 StructFields (列和类型)而不仅仅是名称,如下所示:

    Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));
    Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));
    boolean result = set1.equals(set2);