代码之家  ›  专栏  ›  技术社区  ›  n0shadow

根据ID将numpy行转换为列

  •  3
  • n0shadow  · 技术社区  · 7 年前

    假设我有一个 numpy 在两种项目类型的ID之间映射的数组:

    [[1, 12],
     [1, 13],
     [1, 14],
     [2, 13],
     [2, 14],
     [3, 11]]
    

    我想重新排列这个数组,使新数组中的每一行表示与原始数组中相同ID匹配的所有项。在这里,每一列将表示原始数组中的一个映射,直至对新数组中的列数进行指定的形状限制。如果我们想从上面的数组中获得这个结果,确保我们只有2列,我们将获得:

    [[12, 13],  #Represents 1 - 14 was not kept as only 2 columns are allowed
     [13, 14],  #Represents 2
     [11,  0]]  #Represents 3 - 0 was used as padding since 3 did not have 2 mappings
    

    这里的na_ve方法是使用一个for循环,当新数组遇到原始数组中的行时,该循环填充新数组。有没有更有效的方法 麻木的 的功能?

    4 回复  |  直到 7 年前
        1
  •  2
  •   Paul Panzer    7 年前

    def pp(map_, maxitems=2):
        M = sparse.csr_matrix((map_[:, 1], map_[:, 0], np.arange(map_.shape[0]+1)))
        M = M.tocsc()
        sizes = np.diff(M.indptr)
        ids, = np.where(sizes)
        D = np.concatenate([M.data, np.zeros((maxitems - 1,), dtype=M.data.dtype)])
        D = np.lib.stride_tricks.as_strided(D, (D.size - maxitems + 1, maxitems),
                                            2 * D.strides)
        result = D[M.indptr[ids]]
        result[np.arange(maxitems) >= sizes[ids, None]] = 0
        return result
    

    计时使用@crisz的代码,但修改为使用较少重复的测试数据。我还添加了一点“验证”:Chrisz和我的解决方案给出了相同的答案,另外两个输出了不同的格式,所以我无法检查它们。

    enter image description here

    代码:

    from scipy import sparse
    import numpy as np
    from collections import defaultdict, deque
    
    def pp(map_, maxitems=2):
        M = sparse.csr_matrix((map_[:, 1], map_[:, 0], np.arange(map_.shape[0]+1)))
        M = M.tocsc()
        sizes = np.diff(M.indptr)
        ids, = np.where(sizes)
        D = np.concatenate([M.data, np.zeros((maxitems - 1,), dtype=M.data.dtype)])
        D = np.lib.stride_tricks.as_strided(D, (D.size - maxitems + 1, maxitems),
                                            2 * D.strides)
        result = D[M.indptr[ids]]
        result[np.arange(maxitems) >= sizes[ids, None]] = 0
        return result
    
    def chrisz(a):
      return [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]
    
    def piotr(a):
      d = defaultdict(lambda: deque((0, 0), maxlen=2))
      for key, val in a:
        d[key].append(val)
      return d
    
    def karams(arr):
      cols = arr.shape[1]
      ids = arr[:, 0]
      inds = np.where(np.diff(ids) != 0)[0] + 1
      sp = np.split(arr[:,1:], inds)
      result = [a[:2].ravel() if a.size >= cols else np.pad(a.ravel(), (0, cols -1 * (cols - a.size)), 'constant')for a in sp]
      return result
    
    def make(nid, ntot):
        return np.c_[np.random.randint(0, nid, (ntot,)),
                     np.random.randint(0, 2**30, (ntot,))]
    
    from timeit import timeit
    import pandas as pd
    import matplotlib.pyplot as plt
    
    res = pd.DataFrame(
           index=['pp', 'chrisz', 'piotr', 'karams'],
           columns=[10, 50, 100, 500, 1000, 5000, 10000],# 50000],
           dtype=float
    )
    
    for c in res.columns:
    #        l = np.repeat(np.array([[1, 12],[1, 13],[1, 14],[2, 13],[2, 14],[3, 11]]), c, axis=0)
        l = make(c // 2, c * 6)
        assert np.all(chrisz(l) == pp(l))
        for f in res.index:
            stmt = '{}(l)'.format(f)
            setp = 'from __main__ import l, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=30)
    
    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N");
    ax.set_ylabel("time (relative)");
    
    plt.show()
    
        2
  •  3
  •   Mazdak    7 年前

    这里有一个一般的,主要是麻木的方法:

    In [144]: def array_packer(arr):
         ...:     cols = arr.shape[1]
         ...:     ids = arr[:, 0]
         ...:     inds = np.where(np.diff(ids) != 0)[0] + 1
         ...:     sp = np.split(arr[:,1:], inds)
         ...:     result = [np.unique(a[: cols]) if a.shape[0] >= cols else
         ...:                    np.pad(np.unique(a), (0, (cols - 1) * (cols - a.shape[0])), 'constant')
         ...:                 for a in sp]
         ...:     return result
         ...:     
         ...:     
    

    演示:

    In [145]: a = np.array([[1, 12, 15, 45],
         ...:  [1, 13, 23, 9],
         ...:  [1, 14, 14, 11],
         ...:  [2, 13, 90, 34],
         ...:  [2, 14, 23, 43],
         ...:  [3, 11, 123, 53]])
         ...:  
    
    In [146]: array_packer(a)
    Out[146]: 
    [array([ 9, 11, 12, 13, 14, 15, 23, 45,  0,  0,  0]),
     array([13, 14, 23, 34, 43, 90,  0,  0,  0,  0,  0,  0]),
     array([ 11,  53, 123,   0,   0,   0,   0,   0,   0,   0,   0,   0])]
    
    In [147]: a = np.array([[1, 12, 15],
         ...:  [1, 13, 23],
         ...:  [1, 14, 14],
         ...:  [2, 13, 90],
         ...:  [2, 14, 23],
         ...:  [3, 11, 123]])
         ...: 
         ...:   
         ...:  
    
    In [148]: array_packer(a)
    Out[148]: 
    [array([12, 13, 14, 15, 23]),
     array([13, 14, 23, 90,  0,  0]),
     array([ 11, 123,   0,   0,   0,   0])]
    
        3
  •  2
  •   Piotr    7 年前

    对于这个问题,naive for循环实际上是一个相当有效的解决方案:

    from collections import defaultdict, deque
    d = defaultdict(lambda: deque((0, 0), maxlen=2))
    
    %%timeit
    for key, val in a:
        d[key].append(val)
    4.43 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    # result: {1: deque([13, 14]), 2: deque([13, 14]), 3: deque([0, 11])}
    

    相比之下,此线程中建议的numpy解决方案慢了4倍:

    %timeit [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]
    18.6 µs ± 336 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    numpy很棒,我自己也经常使用它,但我觉得这个例子很麻烦。

        4
  •  1
  •   user3483203    7 年前

    稍微从几乎重复的到pad,只选择两个元素:

    [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]
    

    输出:

    [[12, 13], [13, 14], [11, 0]]
    

    如果要跟踪键:

    {i:[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])}
    
    # {1: [12, 13], 2: [13, 14], 3: [11, 0]}
    

    功能

    def chrisz(a):
      return [[*a[a[:,0]==i,1],0][:2] for i in np.unique(a[:,0])]
    
    def piotr(a):
      d = defaultdict(lambda: deque((0, 0), maxlen=2))
      for key, val in a:
        d[key].append(val)
      return d
    
    def karams(arr):
      cols = arr.shape[1]
      ids = arr[:, 0]
      inds = np.where(np.diff(ids) != 0)[0] + 1
      sp = np.split(arr[:,1:], inds)
      result = [a[:2].ravel() if a.size >= cols else np.pad(a.ravel(), (0, cols -1 * (cols - a.size)), 'constant')for a in sp]
      return result
    

    计时

    from timeit import timeit
    import pandas as pd
    import matplotlib.pyplot as plt
    
    res = pd.DataFrame(
           index=['chrisz', 'piotr', 'karams'],
           columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
           dtype=float
    )
    
    for f in res.index:
        for c i
    
    n res.columns:
            l = np.repeat(np.array([[1, 12],[1, 13],[1, 14],[2, 13],[2, 14],[3, 11]]), c, axis=0)
            stmt = '{}(l)'.format(f)
            setp = 'from __main__ import l, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=30)
    
    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N");
    ax.set_ylabel("time (relative)");
    
    plt.show()
    

    结果 (显然,@kasramvd是获胜者):

    enter image description here