代码之家  ›  专栏  ›  技术社区  ›  user2205916

PiSpul-MLLIB随机森林特征重要性W/特征名称[重复]

  •  0
  • user2205916  · 技术社区  · 6 年前

    我正试图用列名来绘制某些基于树的模型的特征参数。我在用Pyspark。

    因为我也有文本分类变量和数字变量,所以我不得不使用流水线方法,它是这样的。

    1. 使用字符串索引器索引字符串列
    2. 对所有列使用一个热编码器
    3. 使用VectorAssembler创建包含特征向量的特征列

      一些来自 docs 对于步骤1、2、3-

      from pyspark.ml import Pipeline
      from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
      VectorAssembler
      categoricalColumns = ["workclass", "education", "marital_status", 
      "occupation", "relationship", "race", "sex", "native_country"]
       stages = [] # stages in our Pipeline
       for categoricalCol in categoricalColumns:
          # Category Indexing with StringIndexer
          stringIndexer = StringIndexer(inputCol=categoricalCol, 
          outputCol=categoricalCol + "Index")
          # Use OneHotEncoder to convert categorical variables into binary 
          SparseVectors
          # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
          outputCol=categoricalCol + "classVec")
          encoder = OneHotEncoderEstimator(inputCols= 
          [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
          # Add stages.  These are not run here, but will run all at once later on.
          stages += [stringIndexer, encoder]
      
      numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
      "capital_loss", "hours_per_week"]
      assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
      assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
      stages += [assembler]
      
      # Create a Pipeline.
      pipeline = Pipeline(stages=stages)
      # Run the feature transformations.
      #  - fit() computes feature statistics as needed.
      #  - transform() actually transforms the features.
      pipelineModel = pipeline.fit(dataset)
      dataset = pipelineModel.transform(dataset)
      
    4. 最后训练模型

      在训练和EVE之后,我可以使用“Mask.TimeCurrices”来获取特征排名,但是我没有得到特征/列名,而只是特征号,类似于此。

      print dtModel_1.featureImportances
      
      (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
      

    如何将其映射回初始列名和值?所以我可以策划?**

    0 回复  |  直到 7 年前
        1
  •  8
  •   user9964676    7 年前

    提取元数据作为 shown here user6910411

    attrs = sorted(
        (attr["idx"], attr["name"]) for attr in (chain(*dataset
            .schema["features"]
            .metadata["ml_attr"]["attrs"].values())))
    

    并结合特征重要性:

    [(name, dtModel_1.featureImportances[idx])
     for idx, name in attrs
     if dtModel_1.featureImportances[idx]]
    
        2
  •  2
  •   aamirr    7 年前

    转换后的数据集metdata具有所需的属性。-

    1. 创建熊猫数据文件(一般特征列表不会很大,所以存储熊猫DF没有内存问题)

      pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
      ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
      
    2. 然后创建要映射的广播字典。在分布式环境中,广播是必要的。

      feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
      
      feature_dict_broad = sc.broadcast(feature_dict)