代码之家  ›  专栏  ›  技术社区  ›  MAPK

如何在python中从多行文本的匹配行中获取内部文本?

  •  2
  • MAPK  · 技术社区  · 7 年前

    我有一个文本文件叫做 test.txt . 从…起 ,我想抓住以 >lcl 然后提取后的值 locus 标记并在紧括号内 ] . 我想对以后的值做同样的事情 location . 我想要的结果如下所示。如何在python中实现这一点?

    SS1G_08319  <504653..>506706
    SS1G_12233  complement(<502136..>503461)
    SS1G_02099  <2692251..>2693298
    SS1G_05227  complement(<1032740..>1033620)
    

    test.txt

    >lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
    ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
    CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
    CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
    TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
    >lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
    ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
    GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
    TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
    GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
    >lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
    ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
    AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
    CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
    TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
    >lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
    ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
    GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
    

    我是python新手,因此我尝试提出如下内容:

    results = []
    f = open("test.txt", 'r')
    
    while True:
        line = f.readline()
        if not line:
            break
        file_name = line.split("locus_tag")[-1].strip()
        f.readline()  # skip line 
        data_seq1 = f.readline().strip()
        f.readline()  
        data_seq2 = f.readline().strip()
        results.append((file_name, data_seq1))
    
    2 回复  |  直到 7 年前
        1
  •  4
  •   Chiheb Nexus    7 年前

    我认为,解决你的问题最简单的方法就是使用 regex

    import re
    
    results = []
    # Open the file in the 'read' mode
    # with statement will take care to close the file
    with open('YOUR_FILE_PATH', 'r') as f_file:
        # Read the entire file as a one string
        data = f_file.read()
        # Here we search for the string that begins with '>lcl'
        # and in which we find the [locus_tag=...] and [localtion=...]
        results = re.findall(r'>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]', data)
    
    for locus, location in results:
        print(locus, location)
    

    SS1G_08319 <504653..>506706
    SS1G_12233 complement(<502136..>503461)
    SS1G_02099 <2692251..>2693298
    SS1G_05227 complement(<1032740..>1033620)
    

    另一个变体使用 dict 因此,通过拆分行:

    import re
    
    results = {}
    with open('fichier1', 'r') as f_file:
        # Here we split the file's lines into a list
        data = f_file.readlines()
        for line in data:
            # Here we search for the lines that begins by '>lcl'
            # and same as the first attempt
            results.update(re.findall(r'^>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]', line))
    
    for locus, location in results.items():
        print(locus, location)
    

    编辑:创建一个 DataFrame 并将其导出到 csv 文件:

    import re
    from pandas import DataFrame as df
    
    results = {}
    with open('fichier1', 'r') as f_file:
        data = f_file.readlines()
        for line in data:
            results.update(re.findall(
                r'^>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]',
                line
            ))
    
    df_ = df(
        list(results.items()),
        index=range(1, len(results) + 1),
        columns=['locus', 'location']
    )
    print(df_)
    df_.to_csv('results.csv', sep=',')
    

    它将打印并创建一个名为 results.csv :

            locus                        location
    1  SS1G_12233    complement(<502136..>503461)
    2  SS1G_08319                <504653..>506706
    3  SS1G_05227  complement(<1032740..>1033620)
    4  SS1G_02099              <2692251..>2693298
    
        2
  •  2
  •   Mad Physicist    7 年前

    我想提出两种备选解决办法。一个是使用正则表达式提取行中的任何一组命名标记,另一个是完全滑稽的,但展示了一种不使用正则表达式的方法。

    import re
    
    def get_tags(filename, tags, prefix='>lcl'):
        tags = set(tags)
        pattern = re.compile(r'\[(.+?)=(.+?)\]')
    
        def parse_line(line):
            return {m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags}
    
        with open(filename) as f:
            return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
    

    此函数返回由您感兴趣的标记键入的字典列表。您可以这样使用它:

    tags = ['locus_tag', 'location']
    result = get_tags('test.txt', tags)
    

    for line in get_tags('test.txt', tags):
        print(*(line[tag] for tag in tags))
    

    这样做的好处是,您可以在以后选择时使用结果,并配置提取的标记。

    无正则表达式解决方案

    这个版本只是我写的东西,以表明这是可能的。请不要模拟它,因为代码是一个毫无意义的维护负担。

    def get_tags2(filename, tags, prefix='>lcl'):
        tags = set(tags)
    
        def parse_line(line):
            items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
            return dict(tag for tag in items if tag[0] in tags)
    
        with open(filename) as f:
            return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
    

    此函数的行为与第一个函数类似,但相比之下,解析函数是一团乱麻。它的健壮性也要差得多,例如,因为它假设所有的方括号或多或少都是匹配的。

    下面是一个IDEOne链接,展示了这两种方法: https://ideone.com/X2LKqL