代码之家  ›  专栏  ›  技术社区  ›  mrgloom

Pandas:无周期获取数据子集

  •  1
  • mrgloom  · 技术社区  · 7 年前

    我正在尝试根据以下内容将列车数据拆分为列车/测试拆分 customer_id (数据框中的几行可以具有相同的 customer\u id )我想知道我们能做什么 build df_test drop from df_train 没有一个循环的部分是以熊猫特有的方式吗?

    #Split data for train / test split
    
    df_train = pd.read_csv('data/train.csv')
    print('df_train.shape', df_train.shape)
    
    df_train = df_train.replace(np.nan, 'nan', regex=True)
    
    train_customer_id_set = df_train.customer_id.unique()
    print('len(train_customer_id_set)', len(train_customer_id_set))
    
    #Split train data to train/test by customer_id
    n = 1000
    test_customer_id_set = list(train_customer_id_set)
    random.shuffle(test_customer_id_set)
    test_customer_id_set = test_customer_id_set[:n]
    
    #Q: how to do it without cycle?
    
    #build df_test
    df_list = []
    for customer_id in test_customer_id_set:
        df = df_train[df_train['customer_id']==customer_id]
        df_list.append(df)
    df_test = pd.concat(df_list)
    
    #drop from df_train
    for customer_id in test_customer_id_set:
        df_train = df_train.drop(df_train[df_train.customer_id==customer_id].index)
    
    train_customer_id_set = df_train.customer_id.unique()
    
    print('df_train.shape', df_train.shape)
    print('df_test.shape', df_test.shape)
    
    1 回复  |  直到 7 年前
        1
  •  2
  •   Ami Tavory    7 年前

    在您计算的点之后 test_customer_id_set ,看起来你所做的相当于:

    df_test = df_train[df_train.customer_id.isin(test_customer_id_set)]
    df_train = df_train[~df_train.customer_id.isin(test_customer_id_set)]