代码之家 › 专栏 › 技术社区 › Kavya shree

将TSV文件数据转换为可推送到数据库的数据帧

file csv pandas python

Kavya shree · 技术社区 · 11 月前

我们有保存IOT数据的TSV文件,希望使用panda转换为类似表的结构。我研究过TSV数据,类似于下面给出的,逻辑是这样的

读取文件
添加新列名
做转置
重新索引

正如所解释的,这有点具有挑战性,col1到col3是维度数据,剩下的是事实数据

tsv文件数据如下

col1 qweqweq
第2列345435
第3列2024年1月1日35:08:09
col4 1
col5 0
第4列0
col5 0
col4 1
col5 1
第4列0
col5 1

希望投影为类似表格的结构

col1	col2	col3	col4	col5
qweqweq	345435	01/01/2024 35:08:09	1.	0
qweqweq	345435	01/01/2024 35:08:09	0	0
qweqweq	345435	01/01/2024 35:08:09	1.	1.
qweqweq	345435	01/01/2024 35:08:09	0	1.

col4和col5在每个IOT文件中可以不同。如何与蟒蛇,熊猫实现?

1 回复 | 直到 11 月前

mozway 11 月前

假设您可以依靠“col1”来定义组,则可以使用 pivot 使用消除行重复后 cumsum 和 groupby.cumcount 和 groupby.ffill :

df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
        .assign(index=lambda x: x[0].eq('col1').cumsum(),
                n=lambda x: x.groupby(['index', 0]).cumcount())
        .pivot(index=['index', 'n'], columns=0, values=1)
        .groupby(level='index').ffill()
        .reset_index(drop=True).rename_axis(columns=None)
     )

输出

      col1    col2                 col3 col4 col5
0  qweqweq  345435  01/01/2024 35:08:09    1    0
1  qweqweq  345435  01/01/2024 35:08:09    0    0
2  qweqweq  345435  01/01/2024 35:08:09    1    1
3  qweqweq  345435  01/01/2024 35:08:09    0    1

可复制输入:

import io

input_file = io.StringIO('''col1\tqweqweq
col2\t345435
col3\t01/01/2024 35:08:09
col4\t1
col5\t0
col4\t0
col5\t0
col4\t1
col5\t1
col4\t0
col5\t1''')

中间体:

# before pivot
       0                    1  index  n
0   col1              qweqweq      1  0
1   col2               345435      1  0
2   col3  01/01/2024 35:08:09      1  0
3   col4                    1      1  0
4   col5                    0      1  0
5   col4                    0      1  1
6   col5                    0      1  1
7   col4                    1      1  2
8   col5                    1      1  2
9   col4                    0      1  3
10  col5                    1      1  3

# before the cleanup-step:
0           col1    col2                 col3 col4 col5
index n                                                
1     0  qweqweq  345435  01/01/2024 35:08:09    1    0
      1  qweqweq  345435  01/01/2024 35:08:09    0    0
      2  qweqweq  345435  01/01/2024 35:08:09    1    1
      3  qweqweq  345435  01/01/2024 35:08:09    0    1

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前