代码之家 › 专栏 › 技术社区 › HBreshears

Python脚本读取文本文件中没有CR的LF,并替换为另一个字符

newline string python

HBreshears · 技术社区 · 11 月前

我从FFIEC站点获得了几个以制表符分隔的文本文件( https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx )<呼叫报告-单周期,调度RIE>在一个或多个字段中具有LF(换行)字符而没有CR(回车)字符。这些文件正在上传到SQL Server 2022(或在Excel中使用)。文件的每条记录(行)都以CRLF序列结尾。问题是,当读取文本文件时(在Excel中或使用SSIS导入到SQL Server),字段中的LF被解释为开始下一条记录。

我知道Windows与UNIX/Linux中的\r\n区别,并怀疑Python正在将其中任何一个作为序列处理。我没有尝试过Latin-1或cp1252编码。

我正在运行Windows 11 Pro。该脚本是从shell命令(SQL存储过程或Excel VBA)调用的,是清理文件以进行导入的更大脚本组的一部分。

我尝试的解决方案是读取文件,一次迭代一个字符,找到前面没有CR'\r'的LF'\n',并将其替换为分号';'。

Python代码(v3.12):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                output_data.append(';')
            # Skip this '\n'
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

遇到的问题是,脚本将文件中的所有LF和所有CRLF替换为“;”。

Sample showing original document (LF without CR) 第10-14行是同一记录的一部分。第16-21行是一条记录。

更新:我需要阅读手册!从3.x开始,Python可以选择忽略或使用不同的换行符自动替换。我的原始代码在while循环中也有一个逻辑错误。

我最终使用了它,因为它需要对我的其余代码进行更少的重写。我确实测试了@JRiggles的答案,并将其标记为解决方案(更干净,代码更少):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8', newline='\r\n') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                # Skip this '\n' and replace
                output_data.append(';')
            else:
                output_data.append(input_data[i])
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8', newline='\n') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

1 回复 | 直到 11 月前

JRiggles 11 月前

这听起来像是一份工作 re.sub .模式 (?<!\r)\n 将匹配任何LF字符 \n 其前面没有回车(CR) \r .

这是一个示例文件, sample data.txt (显示行尾的屏幕截图)

为了避免任何行尾转换,请以二进制读取模式打开文件 'rb'

import re


pattern = b'(?<!\r)\n'  # match any \n not preceded by \r

with open(r'<path to>\sample data.txt', 'rb') as file:
    data = file.read()
    print('Pre-substitution: ', data)
    # replace any matches with a semicolon ';'
    result = re.sub(pattern, b';', data)
    print('Post-substitution: ', result)

此打印:

Pre-substitution:  b'this line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\n'
Post-substitution:  b'this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\n'

值得一提的是,连续 n s都将被替换,所以 \n\n\n 成为 ;;; 和 \r\n\n 成为 r\n; .

另请注意 pattern string和替换值都是字节串( b'<str>' )-如果你不这样做,你会得到一个 TypeError !