代码之家  ›  专栏  ›  技术社区  ›  Clay

是否删除分隔文件中的嵌套换行符?

  •  1
  • Clay  · 技术社区  · 11 年前

    我有一个以插入符号分隔的文件。文件中唯一的插入符号是分隔符——文本中没有。其中一些字段是自由文本字段,并且包含嵌入的换行符。这使得解析文件变得非常困难。我需要记录末尾的换行符,但我需要将它们从带文本的字段中删除。

    这是来自 Global Integrated Shipping Information System 。这是三条记录,前面是标题行。在第一张图中,船名是NORMANNIA,在第二张图中是“Unkown”,在第三张图中它是KOTA BINTANG。

    ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken
    NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^-
    Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened.
    
    Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam)
    
    ReCAAP ISC via Focal Point (Singapore)"^-
    KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^-
    

    您会注意到,第一条和第三条记录很好,而且很容易解析。第二条记录“Unkown”有一些嵌套的换行符。

    我应该如何删除python脚本中的嵌套换行符(但不是记录末尾的换行符)(或者,如果有更简单的方法的话),以便将这些数据导入SAS?

    3 回复  |  直到 11 年前
        1
  •  2
  •   Vorsprung    11 年前

    将数据加载到字符串中a然后执行

    import re
    newa=re.sub('\n','',a)
    

    在纽瓦也不会有新的线路

    newa=re.sub('\n(?!$)','',a)
    

    它留下了排在最后的那些,但去掉了其余的

        2
  •  2
  •   VooDooNOFX    11 年前

    我看到您已经将其标记为regex,但我建议使用内置的CSV库来解析它。CSV库将正确地解析文件,并将换行符保留在应该的位置。

    Python CSV示例: http://docs.python.org/2/library/csv.html

        3
  •  1
  •   Clay    11 年前

    我通过计算遇到的分隔符的数量来解决这个问题,并在达到与单个记录相关联的数量时手动切换到新记录。然后,我去掉了所有的换行符,并将数据写回一个新文件。从本质上讲,它是原始文件,其中去掉了字段中的换行符,但在每条记录的末尾都有一个换行符。这是代码:

    f = open("events.csv", "r")
    
    carets_per_record = 33
    
    final_file = []
    temp_file  = []
    temp_str   = ''
    temp_cnt   = 0
    
    building   = False
    
    for i, line in enumerate(f):
    
        # If there are no carets on the line, we are building a string
        if line.count('^') == 0:
            building = True
    
        # If we are not building a string, then set temp_str equal to the line
        if building is False:
            temp_str = line
        else:
            temp_str = temp_str + " " + line
    
        # Count the number of carets on the line
        temp_cnt = temp_str.count('^')
    
        # If we do not have the proper number of carets, then we are building
        if temp_cnt < carets_per_record:
            building = True
    
        # If we do have the proper number of carets, then we are finished
        # and we can push this line to the list
        elif temp_cnt == carets_per_record:
            building = False
            temp_file.append(temp_str)
    
    # Strip embedded newline characters from the temp file
    for i, item in enumerate(temp_file):
        final_file.append(temp_file[i].replace('\n', ''))
    
    # Write the final_file list out to a csv final_file
    g = open("new_events.csv", "wb")
    
    
    # Write the lines back to the file
    for item in enumerate(final_file):
        # item is a tuple, so we get the content part and append a new line
         g.write(item[1] + '\n')
    
    # Close the files we were working with
    f.close()
    g.close()