代码之家 › 专栏 › 技术社区 › martin

在pandas中重复后正确分割CSV文件

split csv pandas python

martin · 技术社区 · 6 年前

我有包含5000行的CSV,每几百行CSV有一个重复的部分。

文件看起来像

Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3

Header . 我不知道该怎么做。我编写了整个脚本来处理一些生物学的东西,但是其中一种文件类型(上面)会产生问题,因为它是一个文件中的多个文件。脚本不想和它一起工作。

我读了很多关于拆分文件的文章,但是在pandas中重复值之后,我没有发现任何关于分离的内容。

在这种情况下,它将是3个文件(但文件中这些文件的数量不同)

0 回复 | 直到 6 年前

vurmux 6 年前

我找到了一个更好的解决办法 break 正如我在评论中建议的那样:

result 列出每个块数据并将其存储在list的单独元素中(例如dict)。如果你不读- 行,您可以保证,您刚刚读取的行与当前数据块相关。当前数据块是 结果 页眉行,只需将新元素附加到 结果

如果内容的大小是常量,则可以使用 itertools.cycle 迭代器将“编写”您的解析过程:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('\n'):
    i = next(iterator)
    if i == 0:
        result.append({'header': line})
    elif i == 1:
        result[-1]['num_of_samples'] = line
    elif i == 2:
        result[-1]['content_header'] = line
    elif i == 3:
        result[-1]['content'] = [line.split(', ')]
    else:
        result[-1]['content'].append(line.split(', '))

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('\n'):
    if line.startswith('Header'):  # Your condition for headers
        result.append({'header': line})
    elif line.startswith('number'):  # Your condition for number of samples
        result[-1]['num_of_samples'] = line
    elif line.startswith('Content'):  # Your condition for content headers
        result[-1]['content_header'] = line
    else:
        if 'content' not in result[-1]:  # We don't know is the content list created
            result[-1]['content'] = [line.split(', ')]
        else:
            result[-1]['content'].append(line.split(', '))

推荐文章