代码之家 › 专栏 › 技术社区 › Oblomov

训练、验证和测试集中大熊猫数据帧的分层分割

deep-learning machine-learning dataframe pandas python

3

Oblomov · 技术社区 · 7 年前

以下极为简化的数据帧表示包含医疗诊断的更大的数据帧:

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

对于机器学习, 我需要把这个数据帧随机分成三个子帧 按以下方式:

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

如果拆分数组指定进入每个子帧的完整数据的分数,则子帧中的数据需要互斥,拆分数组需要和为1。 另外,每个子集中阳性诊断的部分需要大致相同。

Answers to this question 建议使用 the pandas sample method 或 the train_test_split function from sklearn . 但这些解似乎都不能很好地概括为n个分裂,也不能提供分层分裂。

1 回复 | 直到 7 年前

1

5

Oblomov 7 年前

`np.array_split`

n

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

`train_test_split`

train_test_split

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

X y