代码之家  ›  专栏  ›  技术社区  ›  user8270077

在使用iris数据集的knn中,重加权距离返回与常规距离相同的结果

  •  0
  • user8270077  · 技术社区  · 7 年前

    我正在试验距离上的权重如何影响kNN算法的性能,对于一个可复制的示例,我正在使用iris数据集。

    令我惊讶的是,加权2个预测值比其余2个预测值高100倍,生成了与未加权模型相同的预测。这一违反直觉的发现是什么?

    我的代码如下:

    X_original = iris['data']
    Y = iris['target']
    
    sc = StandardScaler() # Defines the parameters of the Scaler
    
    X = sc.fit_transform(X_original)  # Transforms the original data to standardized data and returns them
    
    from sklearn.model_selection import StratifiedShuffleSplit
    
    sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
    
    split = sss.split(X, Y)
    
    s = list(split)
    
    train_index = s[0][0]
    
    test_index = s[0][1]
    
    X_train = X[train_index, ] 
    
    X_test = X[test_index, ] 
    
    Y_train = Y[train_index] 
    
    Y_test = Y[test_index] 
    
    from sklearn.neighbors import KNeighborsClassifier
    
    knn = KNeighborsClassifier(n_neighbors = 6)
    
    iris_fit = knn.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                      # All the data should be numeric
                                                      # There should be no NaNs
    
    predictions_w1 = knn.predict(X_test)
    
    weights = np.array([1, 1, 100, 100])
    weights =weights/np.sum(weights)
    
    knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2, 
                               metric_params={'w': weights})
    
    iris_fit_w = knn_w.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                      # All the data should be numeric
                                                      # There should be no NaNs
    
    predictions_w100 = knn_w.predict(X_test)
    
    (predictions_w1 != predictions_w100).sum()
    0
    
    1 回复  |  直到 6 年前
        1
  •  0
  •   Jan K    7 年前

    它们并不总是相同的,将随机状态添加到您的列车测试分割中,您将看到不同值的变化。

     StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
    

    此外,在第3个(花瓣长度)和第4个(花瓣宽度)特征上具有这种极端权重的加权闵可夫斯基距离基本上会给出相同的结果,就像在这2个特征上只使用未加权的闵可夫斯基运行KNN一样。由于它们似乎信息量很大,因此与考虑所有4个特性的情况相比,得到的结果非常相似也就不足为奇了。请参见下面的wiki图片

    From wiki