代码之家  ›  专栏  ›  技术社区  ›  krock1516

使用padas读取文本文件以获取特定行

  •  2
  • krock1516  · 技术社区  · 7 年前

    我正在尝试用Pandas读取文本日志文件 read_csv 方法,我必须先读取文件中的每一行 ---- ,我定义了列名称,只是为了方便地获取基于列的数据,但我没有找到实现这一点的方法。

    我的原始日志数据:

    myserer143
    -------------------------------
    Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
    This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.
    
    Are you sure you want to continue [Yy/Nn]?
    
    Uninstalling dependant solutions...
    Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
    Unregistering the Script Task Plugin...
    Unregistering the Power Control Task Plugin...
    Unregistering the Service Control Task Plugin...
    Unregistering the Web Service Task Plugin...
    Unregistering the Reset Task Agent Task Plugin...
    Unregistering the Agent Control Task Plugin...
    Unregistering solution...
    Unregistering the SMF cli plug-in...
    Unregistering the Software Management Framework Agent sub-agent...
    Removing wrapper scripts and links for applications...
    Unregistering the Software Management Framework Agent Plugins...
    Removing wrapper scripts and links for applications...
    Unregistering solution...
    Unregistering the CTA cli plug-in...
    Unregistering the Client Task Scheduling sub-agent...
    Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
    Remove the wrapper script and link for the Task Util application...
    Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
    Unregistering the Client Task Scheduling Plugin...
    Unregistering the Alert User Task Plugin...
    Unregistering the shared library...
    Unregistering solution...
    Unregistering the Inventory Rule Agent...
    Removing wrapper scripts and links for applications...
    Unregistering the Inventory Rule Agent Plugin...
    Removing wrapper scripts and links for applications...
    Unregistering solution...
    Uninstalling dependant solutions finished.
    
    Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
    Removing wrapper scripts and links for applications...
    Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
    Remove non packaged files.
    Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
      Removing aex-* links in /usr/bin
      Removing RC init links and scripts
    Cleaning up after final package removal.
    Removal finished.
    
    Uninstallation has finished.
    dbserer144
    -------------------------------
    Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
    This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.
    
    Are you sure you want to continue [Yy/Nn]?
    
    Uninstalling dependant solutions...
    Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
    Unregistering the Script Task Plugin...
    Unregistering the Power Control Task Plugin...
    Unregistering the Service Control Task Plugin...
    Unregistering the Web Service Task Plugin...
    Unregistering the Reset Task Agent Task Plugin...
    Unregistering the Agent Control Task Plugin...
    Unregistering solution...
    Unregistering the SMF cli plug-in...
    Unregistering the Software Management Framework Agent sub-agent...
    Removing wrapper scripts and links for applications...
    Unregistering the Software Management Framework Agent Plugins...
    Removing wrapper scripts and links for applications...
    Unregistering solution...
    Unregistering the CTA cli plug-in...
    Unregistering the Client Task Scheduling sub-agent...
    Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
    Remove the wrapper script and link for the Task Util application...
    Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
    Unregistering the Client Task Scheduling Plugin...
    Unregistering the Alert User Task Plugin...
    Unregistering the shared library...
    Unregistering solution...
    Unregistering the Inventory Rule Agent...
    Removing wrapper scripts and links for applications...
    Unregistering the Inventory Rule Agent Plugin...
    Removing wrapper scripts and links for applications...
    Unregistering solution...
    Uninstalling dependant solutions finished.
    Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
    Removing wrapper scripts and links for applications...
    Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
    Remove non packaged files.
    Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
      Removing aex-* links in /usr/bin
      Removing RC init links and scripts
    Cleaning up after final package removal.
    Removal finished.
    
    Uninstallation has finished.
    

    DataFrame如下所示:

    >>> data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a", "b", "c"], engine="python")
    >>> data
                                                           a   b   c
    0                                              myserer143 NaN NaN
    1                        ------------------------------- NaN NaN
    2      Stopping Symantec Management Agent for UNIX, L... NaN NaN
    3      This will remove the Symantec Management Agent... NaN NaN
    4             Are you sure you want to continue [Yy/Nn]? NaN NaN
    5                    Uninstalling dependant solutions... NaN NaN
    6      Unregistering the Altiris Base Task Handlers f... NaN NaN
    7                Unregistering the Script Task Plugin... NaN NaN
    8         Unregistering the Power Control Task Plugin... NaN NaN
    9       Unregistering the Service Control Task Plugin... NaN NaN
    

    预期结果:

    myserer143
    dbserer144
    

    这是可行的

    myserer143 Uninstallation has finished
    dbserer144 Uninstallation has finished
    
    2 回复  |  直到 7 年前
        1
  •  2
  •   jezrael    7 年前

    使用 shift 具有 startswith 用于布尔掩码和按筛选 boolean indexing :

    data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")
    
    m1 = data['a'].shift(-1).str.startswith('----', na=False)
    m2 = data['a'].shift(-2).str.startswith('----', na=False)
    

    append :

    data = data[m1 | m2].append(data.iloc[[-1]])
    print (data)
                                   a
    0                     myserer143
    44  Uninstallation has finished.
    45                    dbserer144
    89  Uninstallation has finished.
    

    重塑值并将文本连接在一起:

    df = pd.DataFrame(data.values.reshape(-1,2)).apply(' '.join, 1).to_frame('data')
    print (df)
                                          data
    0  myserer143 Uninstallation has finished.
    1  dbserer144 Uninstallation has finished.
    

    编辑:

    data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")
    
    L = []
    with open('result.csv', 'r') as f:
        for line in f:
            line = line.strip()
            if line:
                L.append(line)
    L = L[-1:] + L
    
    out = [{'a':L[i-1], 'b':L[i-2]} for i, x in enumerate(L) if x.startswith('---') ]
    print (out)
    [{'a': 'myserer143', 'b': 'Uninstallation has finished.'}, 
     {'a': 'dbserer144', 'b': 'Uninstallation has finished.'}]
    
    df = pd.DataFrame(out)
    df['b'] = df['b'].shift(-1).fillna(df.loc[0,'b'])
    df = df.apply(' '.join, 1).to_frame('data')
    print (df)
                                          data
    0  myserer143 Uninstallation has finished.
    1  dbserer144 Uninstallation has finished.
    
        2
  •  1
  •   BernardL    7 年前

    考虑到数据中有许多行是不需要的,我认为最好在将数据加载到数据帧之前准备好数据。

    '-------... ,因此,在生成器中查找这些行并仅加载分隔符之前的2行是有意义的。

    from itertools import tee, islice, zip_longest
    
    results = []
    
    f = open('sample.txt','r')
    n = 2 #number of lines to check
    first = next(f)
    delim = next(f)
    
    results.append(first)
    peek, lines = tee(f)
    
    for idx, val in enumerate(lines):
        if val == delim:
            for val in islice(peek.__copy__(), idx - n, idx):
                results.append(val)
        last = idx
    
    for i in islice(peek.__copy__(), last, last + 1):
        results.append(i)
    
    results
    >> ['myserer143\n',
     'Uninstallation has finished.\n',
     'dbserer144\n',
     'Uninstallation has finished.\n',
     'dbserer144\n',
     'Uninstallation has finished.']
    

    此时,加载未使用的行不会浪费内存,返回的列表包含您需要的信息,方法是设置前几行的偏移量并获取最后一行。


    然后,您可以将结果成对分组,以便使用来自的Python配方加载到数据帧 itertools .

    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
        args = [iter(iterable)] * n
        return zip_longest(*args, fillvalue=fillvalue)
    
    results = [i.strip() for i in results]
    data = list(grouper(results, n))
    
    df = pd.DataFrame(data, columns = ['Name','Status'])
    df
    
    >>
             Name                        Status
    0  myserer143  Uninstallation has finished.
    1  dbserer144  Uninstallation has finished.
    2  dbserer144  Uninstallation has finished.