代码之家  ›  专栏  ›  技术社区  ›  Christian Alis

如何将Dask数据帧作为输入传递给Dask ml模型?

  •  1
  • Christian Alis  · 技术社区  · 6 年前

    通常的ML管道涉及将pandas或dask数据帧处理成可以传递到ML模型的形式。然而,许多dask-ml模型不能接受dask数据帧,因为它们不跟踪每个分区的行数。呼叫 fit Cannot fit on dask.dataframe due to unknown partition lengths error . 我应该怎么做才能将Dask数据帧传递给Dask ml模型?

    举个例子:

    import dask.dataframe as dd
    import pandas as pd
    from dask_ml.cluster import KMeans
    
    df = dd.from_pandas(pd.DataFrame({'A': [1, 2, 3, 4, 5], 
                                      'B': [6, 7, 8, 9, 10]}),
                        npartitions=2)
    
    kmeans = KMeans()
    kmeans.fit(df)
    

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-53-6c1545864b12> in <module>()
          6 
          7 kmeans = KMeans()
    ----> 8 kmeans.fit(df)
    
    ~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/cluster/k_means.py in fit(self, X, y)
        187 
        188     def fit(self, X, y=None):
    --> 189         X = self._check_array(X)
        190         labels, centroids, inertia, n_iter = k_means(
        191             X,
    
    ~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/utils.py in wraps(*args, **kwargs)
        298         def wraps(*args, **kwargs):
        299             with _timer(f.__name__, _logger=logger, level=level):
    --> 300                 results = f(*args, **kwargs)
        301             return results
        302 
    
    ~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/cluster/k_means.py in _check_array(self, X)
        159         elif isinstance(X, dd.DataFrame):
        160             raise TypeError(
    --> 161                 "Cannot fit on dask.dataframe due to unknown " "partition lengths."
        162             )
        163 
    
    TypeError: Cannot fit on dask.dataframe due to unknown partition lengths.
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   TomAugspurger    6 年前

    dask ml master现在支持这一点 https://github.com/dask/dask-ml/pull/393

    这将包含在daskml0.10版本中。