代码之家  ›  专栏  ›  技术社区  ›  Tiffany

如何在不创建无数中间数据帧的情况下应用多个索引器和编码器?

  •  1
  • Tiffany  · 技术社区  · 7 年前

    这是我的代码:

    val workindexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
    val workencoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
    
    val educationindexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
    val educationencoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
    
    val maritalindexer = new StringIndexer().setInputCol("marital_status").setOutputCol("maritalIndex")
    val maritalencoder = new OneHotEncoder().setInputCol("maritalIndex").setOutputCol("maritalVec")
    
    val occupationindexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
    val occupationencoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
    
    val relationindexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
    val relationencoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
    
    val raceindexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
    val raceencoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
    
    val sexindexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
    val sexencoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
    
    val nativeindexer = new StringIndexer().setInputCol("native_country").setOutputCol("native_countryIndex")
    val nativeencoder = new OneHotEncoder().setInputCol("native_countryIndex").setOutputCol("native_countryVec")
    
    val labelindexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")
    

    2 回复  |  直到 7 年前
        1
  •  1
  •   Alper t. Turker    7 年前

    我会用 RFormula :

    import org.apache.spark.ml.feature.RFormula
    
    val features = Seq("workclass", "education", 
       "marital_status", "occupation", "relationship", 
       "race", "sex", "native", "country")
    
    val formula = new RFormula().setFormula(s"label ~ ${features.mkString(" + ")}")
    

    Vector .

        2
  •  1
  •   Jacek Laskowski    7 年前

    使用名为 ML Pipelines :

    使用ML管道,您可以“连接”(或“管道”) “编码器和索引器,而不创建无数中间数据帧”

    import org.apache.spark.ml._
    val pipeline = new Pipeline().setStages(Array(workindexer, workencoder...))