代码之家 › 专栏 › 技术社区 › Kade Williams

获取文件中出现字符串的第四行到最后一行

python

Kade Williams · 技术社区 · 7 年前

我正在搜索一个包含IP地址的日志文件。
日志示例:

10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

我现在可以从日志的最后一行获取IP地址。我还可以搜索所有具有相同IP地址的行号。

如果日志中的最后一个IP地址在日志中列出了3次或更多次,我怎么才能得到该IP地址的第三次到最后一次出现的行号?

例如,我要获取此行的行号:

10.1.177.198 Tue Jun 19 09:26:38 CDT 2018

或者更好的是,把整行打印出来。

下面是我的代码示例:

import re

def run():

    try:
        logfile = open('read.log', 'r')

        for line in logfile:  
            x1 = line.split()[0]
            for num, line in enumerate(logfile, 0):
                if x1 in line:
                    print("Found " + x1 + " at line:", num)

        print ('Last Line: ' + x1)

        logfile.close
    except OSError as e:
        print (e)

run()

我列出了发生特定IP地址的所有行号。

print("Found " + x1 + " at line:", num)

我想打印一行,其中“num”是行号列表中倒数第三个行号。

我的总体目标是从日志文件的最后一行获取IP地址。然后检查它是否已经被列出超过3次。如果有的话,我想找到地址的第三个到最后一个列表并得到行号(或者只打印该行中列出的地址和日期)

3 回复 | 直到 7 年前

Venkata Gogu 7 年前

跟踪所有事件并打印最后一个事件中的第三个事件。可以通过使用来优化 heapq .

def run():
    try:
        logfile = open('log.txt', 'r')

        ip_address_line_number = dict()
        for index,line in enumerate(logfile,1):  
            x1 = line.split()[0]
            log_time = line.split()[4]
            if x1 in ip_address_line_number : 
                ip_address_line_number[x1].append((index,log_time))
            else:
                ip_address_line_number[x1] = [(index,log_time)]

        if x1 in ip_address_line_number and len(ip_address_line_number.get(x1,None)) > 2:
            print('Last Line: '+ ip_address_line_number[x1][-3].__str__())
        else:
            print(x1 + ' has 0-2 occurences')
        logfile.close
    except OSError as e:
        print (e)

run()

pylang 7 年前

如果文件是倒读 :

第一个IP的第三次观测数据是什么?
在文件中,必须至少有 3+1 对第一个知识产权的观察。

有 many tools 这可以提供更简单的代码,但这里有一种灵活、通用的方法来存储效率。大致上,让我们:

向后读取文件
数到 3+1 观察
返回上次观察

鉴于

文件 test.log

# test.log 
10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

和代码 reverse_readline() generator ,我们可以写下:

代码

def run(filename, target=3, min_=3):
    """Return the line number and data of the `target`-last observation.

    Parameters
    ----------
    filename : str or Path
        Filepath or name to file.
    target : int
        Number of final expected observations from the bottom, 
        e.g. "third to last observation." 
    min_ : int
        Total observations must exceed this number.

    """
    idx, prior, data = 0, "", []    
    for i, line  in enumerate(reverse_readline(filename)):
        ip, text = line.strip().split(maxsplit=1)
        if i == 0:
            target_ip = ip
        if target == 0:
            idx, *data = prior
        if ip == target_ip:
            target -= 1                                      
            prior = i, ip, text

    # Edge case
    total_obs = prior[0]
    if total_obs < min_:
        print(f"Minimum observations was not met.  Got {total_obs} observations.")
        return None

    # Compute line number
    line_num = (i - idx) + 1                               # add 1 (zero-indexed)
    return  [line_num] + data

演示

run("test.log")
# [5, '10.1.177.198', 'Tue Jun 19 09:26:38 CDT 2018']

倒数第二次观察:

run("test.log", 2)
# [6, '10.1.177.198', 'Tue Jun 19 09:27:16 CDT 2018']

最低要求的观察:

run("test.log", 2, 7)
# Minimum observations was not met.  Got 6 observations.

根据需要添加错误处理。

细节

注:“观察”是包含目标IP的一行。

我们迭代内存效率 反转读取线() 发电机。
这个 target_ip 由反转文件的“第一”行确定。
我们只对第三个观察感兴趣,所以我们不需要保存所有的信息。因此,当我们迭代时,每次只能暂时保存一个观测值。 prior (减少内存消耗)。
target 是每次观察后递减的计数器。当 目标 计数器范围 0 , the 先前的 观察保存到发电机耗尽为止。
先前的 是一个元组,包含最后一次观察目标IP地址的行数据,即索引、地址和文本。
发电机耗尽以确定 total_obs 文件的保留值和长度,用于计算 line_num 误码率。
返回计算的行号和行数据。

SpghttCd 7 年前

使用 pandas 这很短:

import pandas as pd
df = pd.read_fwf('read.log', colspecs=[(None, 12), (13, None)], header=None, names=['IP', 'time'])

lastIP = df.IP[df.index[-1]]
lastIP_idx = df.groupby('IP').groups[lastIP]

n = 3
if len(lastIP_idx) >= n:
    print('\t'.join(list( df.loc[lastIP_idx[-n]] )))
else:
    print('occurence number of ' + lastIP + ' < ' + str(n))