代码之家  ›  专栏  ›  技术社区  ›  LucieCBurgess

Scala中带Spark udf的范围模式匹配

  •  0
  • LucieCBurgess  · 技术社区  · 7 年前

    我有一个Spark数据框,其中包含我正在使用Likert量表与数字分数匹配的字符串。不同的问题ID对应不同的分数。我试图在Apache Spark udf中的Scala范围内进行模式匹配,并以以下问题为指导:

    How can I pattern match on a range in Scala?

    但当我使用范围而不是简单的OR语句时,我会遇到编译错误, 即

    31 | 32 | 33 | 34 工作正常

    31 to 35 不编译。你知道我在语法上哪里出错了吗?

    另外,在最后一种情况下,我想映射到字符串而不是Int, case _ => "None" 但这会产生一个错误: java.lang.UnsupportedOperationException: Schema for type Any is not supported

    大概这是Spark的一个常见问题,因为它完全有可能返回 Any 在原生Scala中?

    这是我的代码:

    def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
    
          case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 //this is fine
          case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
          case ((31 | 32 | 33 | 34 | 35), "Often") => 2
          case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
          case ((x if 41 until 55 contains x), "None of the time") => 1 //this line won't compile
          case _ => 0 //would like to map to "None"
        })
    

    然后在Spark数据帧上使用udf,如下所示:

    val df3 = df.withColumn("NumericScore", calculateScore(df("QuestionId"), df("AnswerText")))
    
    2 回复  |  直到 7 年前
        1
  •  2
  •   Alper t. Turker    7 年前

    保护表达式应放在图案之后:

    def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
      case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 
      case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
      case ((31 | 32 | 33 | 34 | 35), "Often") => 2
      case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
      case (x, "None of the time") if 41 until 55 contains x => 1
      case _ => 0 //would like to map to "None"
    })
    
        2
  •  2
  •   Ramesh Maharjan    7 年前

    如果要映射最后一个 case case _ 至“无” String ,则所有案例都应返回 一串

    下列的 udf 功能应该适合您

    def calculateScore  = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
      case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => "4" //this is fine
      case ((31 | 32 | 33 | 34 | 35), "Occasionally") => "3"
      case ((31 | 32 | 33 | 34 | 35), "Often") => "2"
      case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => "1"
      case (x, "None of the time") if (x >= 41 && x < 55) => "1" //this line won't compile
      case _ => "None"
    })
    

    如果要映射最后一个 案例 案例_ None ,则需要将其他返回类型更改为 Option 没有一个 是的孩子 选项

    以下代码也适用于您

    def calculateScore  = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
      case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => Some(4) //this is fine
      case ((31 | 32 | 33 | 34 | 35), "Occasionally") => Some(3)
      case ((31 | 32 | 33 | 34 | 35), "Often") => Some(2)
      case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => Some(1)
      case (x, "None of the time") if (x >= 41 && x < 55) => Some(1) //this line won't compile
      case _ => None
    })
    

    最后一点是您收到的错误消息 java.lang.UnsupportedOperationException: Schema for type Any is not supported 明确指出: 自定义项 返回类型为的函数 Any 不支持。所有的 return types match cases 应保持一致。

    推荐文章