代码之家  ›  专栏  ›  技术社区  ›  Mohamed Abdillah

当从一个巨大的数据集中预测一行数据时,如何防止得到一个0样本数组(shape=(n,0))的错误?

  •  -1
  • Mohamed Abdillah  · 技术社区  · 7 年前

    我目前正在构建一个原型系统,包括一个预测工作。如果我根据下表中的几行进行预测,我的代码工作正常,结果也很完美。 Several rows from the dataset

    但是,当从我的数据集中预测一行数据时,如下表所示, One row data

    我得到了这个错误:ValueError:找到了样本数为0的数组(shape=(0,8)),而最小值为1是必需的。这意味着我不能仅仅基于一行来做预测,这是我工作的主要内容。

    下面是我的代码:

    def upload_file(request):
     template='upload_file.html'
     if request.method == 'GET':
     return render(request, template)
     CSV_file=request.FILES['csv_file']
    
     if not CSV_file.name.endswith('.csv'):
     messages.error(request, 'This is not a CSV file')
     return HttpResponseRedirect(reverse('add_pull_requests'))
    
     train=pd.read_csv(CSV_file) 
    
     features_col = ['Comments', 'LC_added', 'LC_deleted', 'Commits', 'Changed_files', 'Evaluation_time','First_status','Reputation'] 
     class_label=['Label']
     X = train[features_col] # This also test
     y=train[class_label]
    
     random_state = 0
     # test_size=request.GET.get('test_size')
     for train_index, test_index in loo.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
         y_train, y_test = y.iloc[train_index], y.iloc[test_index]
     # X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)
     clf = tree.DecisionTreeClassifier()
     clf = clf.fit(X_train, y_train)
     y_pred = clf.predict(X_test)
     print('Up to here is ok')
     try:
     Accuracy="{0:.2f}%".format(accuracy_score(y_test, y_pred) * 100)
     Precision="{0:.2f}%".format(metrics.precision_score(y_test, y_pred) * 100)
     Recall="{0:.2f}%".format(metrics.recall_score(y_test, y_pred) * 100)
     F1_meseaure="{0:.2f}%".format(2*metrics.precision_score(y_test,y_pred)*metrics.recall_score(y_test,y_pred)/(metrics.precision_score(y_test,y_pred)+metrics.recall_score(y_test,y_pred))*100)
     except ZeroDivisionError:
     print("Error: dividing by zero")
     F1_meseaure='nan%'
    
    
     print("Accuracy:",Accuracy )
     print("Precision:", Precision)
     print("Recall:", Recall)
     print("F1-measure: ", F1_meseaure)
    
     importances_feautres = pd.DataFrame({'features': features_col, 'importance': np.round(clf.feature_importances_, 3)})
     importances_feautres = importances_feautres.sort_values('importance', ascending=False).set_index('features')
    
     print(importances_feautres.shape)
     importances_feautres = [ls[0] for ls in importances_feautres.values.tolist()]
    
     classification_report={'accuracy':Accuracy, 'pricision':Precision, 'recall':Recall, 'f1_score':F1_meseaure}
    
     importance_features={'importances_feautre':importances_feautres}
    
    
     data={
     'new_data':new_data,
      'classification_report':classification_report,
     'importance_feature':importance_features,
     'features':features_col,
      }
    
     return render(request,template, data)
    

    错误来自以下代码行:

    for train_index, test_index in loo.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    

    如果我用下面的行替换这些行,我会得到相同的错误:

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)
    

    1 回复  |  直到 7 年前
        1
  •  0
  •   Mohamed Abdillah    7 年前

    @维韦克·库马尔,非常感谢你。我明白你说的了。我使用这个模型使用整个数据集和另一个持久性模型pickle在新的单行中进行预测。

    import pickle 
    model=pickle.dump(clf) # clf is coming from the above model
    clf2 = pickle.loads(model)
    clf2.predict(X[i:i+1]) # where i is the index of the row that we want to predict
    

    我想这样没关系,除非我不理解你的建议。