代码之家 › 专栏 › 技术社区 › Burak

正在分类和绘制的数据点数量与数据集中的点数量不匹配

knn scikit-learn machine-learning pandas python

Burak · 技术社区 · 7 年前

我使用的数据集有54个数据点,要在Python中使用k-NN分类器进行分类,其中#个邻居:20。我的代码进行分类并绘制结果,但我在散点图中只看到22个数据点,而没有看到54个数据点被分类。

在机器学习中,所有数据点都没有被分类和绘制,这有什么原因吗?

所选邻居的#是否会影响正在分类和绘制的数据点的#?谢谢

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
import pandas as pd
from sklearn import preprocessing

# Preprocessing of dataset done here.
n_neighbors = 20
dataset = pd.read_csv('cereal.csv')
X = dataset.iloc[:, [3,5]].values
y = dataset.iloc[:, 1].values
y_set = preprocessing.LabelEncoder()
y_fit = y_set.fit(y)
y_trans = y_set.transform(y)

# sorting dataset done here.Total number of data points :77 but 54 will 
# be selected to use
j = 0
for i in range (0,77):
if y[i] == 'K' or y[i] == 'G' or y[i] == 'P':
    j = j+1

new_data = np.zeros((j,2))
new_let = [0] * j
j = 0

for i in range (0,77):
if y[i] == 'K' or y[i] == 'G' or y[i] == 'P':
    new_data[j] = X[i]
    new_let[j] = y[i]
    j = j+1

# Plotting and setting up mesh grid done here

h = .02
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Cylassifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y_trans)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

plt.scatter(X[:, 0], X[:, 1], c=y_trans, cmap=cmap_bold,
            edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
          % (n_neighbors, weights))
plt.show()

1 回复 | 直到 7 年前

Mihai Chelaru klin 7 年前

首先,在分类器和绘图中使用数据集的所有77个点。使用54个点创建的变量既不用于拟合分类器,也不用于生成结果图。

运行脚本后,应在Anaconda中检查变量资源管理器,以查看所使用的不同变量的大小。

至于你正在生成的曲线图,如果你看看数据的分布方式,你就会明白为什么你只看到22个点:

Cereal K-NN

如果查看原始数据集,在这两列(脂肪和卡路里)中有几个点共享重复值。因此,多个点在绘图上堆叠在一起,因此尽管您正在绘制77个点,但在绘图上只能“看到”其中的22个点。如果希望看到所有属性都很好地分开,可能需要选择其他属性。