我对pyspark还很陌生,仍在努力寻找解决方法。请帮忙,我正在尝试使用pyspark从平面文件中提取数据,并使用长度和位置为其指定列名。
然后,稍后我必须在azure synapse上将其转换为镶木地板格式。我一直在努力把它做好。
我使用了这里提供的建议
https://stackoverflow.com/a/74074532/16562593
以生成以下代码:
df = spark.read.text("PMNTINSTR_RES_24052023163601_00010.txt")
col_idx = {"MDRecordType": {"Position": 1, "Length":2},\
"Channel": {"Position": 3, "Length":10},\
"MessageType": {"Position": 13, "Length":20},\
"ISOCountryCode": {"Position": 33, "Length":2},\
"SWIFTBIC": {"Position": 35, "Length":11},\
"InterchangeId": {"Position": 46, "Length":31},\
"ProcessingIndicator": {"Position": 77, "Length":10},\
"TimeStamp": {"Position": 87, "Length":20},\
"INRecordType": {"Position": 1, "Length":2},\
"InstructionId": {"Position": 3, "Length":31},\
"InterchangeId": {"Position": 34, "Length":31},\
"Source": {"Position": 75, "Length":10},\
"SenderBIC": {"Position": 85, "Length":11}}
df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['Position'], col_idx[x] ['Length'])).alias(x), col_idx)).show()
固定宽度的文件看起来是这样的(为了实用,我已经截断了一些行)
MDDUMMY FILEFEEDBACK ZAXXXXXXXX 49752700 20230524163505
IN49752700 DUMMY XXXXXXXX 20230524163505 1002
IS49752700 49752700 20230524163505 20230524T 0000100001
输出为
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC| InterchangeId|ProcessingIndicator| TimeStamp|INRecordType| InstructionId|Status|Source|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
| MD| DUMMY|FILEFEEDBACK| ZA| XXXXXXXX|XXXXXXXXX 49752700| |20230524163505| MD|DUMMY FILEFEE...| | 202|
| IN|49752700| | C|PARSDUMMY| CPARSDUMMY| | | IN| 49752700| | |
所需为
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC| InterchangeId|ProcessingIndicator| TimeStamp|INRecordType| InstructionId|Source|SenderBIC|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
| MD| DUMMY|FILEFEEDBACK| ZA| XXXXXXXX| 49752700| |20230524163505| IN| 49752700|DUMMY | XXXXXXXX|
请帮助我如何正确阅读所有记录以获得所需的输出