代码之家  ›  专栏  ›  技术社区  ›  Londiwe Ndlangamandla

使用pyspark中以新行结束的多行固定宽度(平面)文件

  •  0
  • Londiwe Ndlangamandla  · 技术社区  · 2 年前

    我对pyspark还很陌生,仍在努力寻找解决方法。请帮忙,我正在尝试使用pyspark从平面文件中提取数据,并使用长度和位置为其指定列名。 然后,稍后我必须在azure synapse上将其转换为镶木地板格式。我一直在努力把它做好。

    我使用了这里提供的建议 https://stackoverflow.com/a/74074532/16562593 以生成以下代码:

    df = spark.read.text("PMNTINSTR_RES_24052023163601_00010.txt")
    col_idx = {"MDRecordType": {"Position": 1, "Length":2},\
        "Channel": {"Position": 3, "Length":10},\
        "MessageType": {"Position": 13, "Length":20},\
        "ISOCountryCode": {"Position": 33, "Length":2},\
        "SWIFTBIC": {"Position": 35, "Length":11},\
        "InterchangeId": {"Position": 46, "Length":31},\
        "ProcessingIndicator": {"Position": 77, "Length":10},\
        "TimeStamp": {"Position": 87, "Length":20},\
        "INRecordType": {"Position": 1, "Length":2},\
        "InstructionId": {"Position": 3, "Length":31},\
        "InterchangeId": {"Position": 34, "Length":31},\
        "Source": {"Position": 75, "Length":10},\
        "SenderBIC": {"Position": 85, "Length":11}}
    df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['Position'], col_idx[x]   ['Length'])).alias(x), col_idx)).show()
    

    固定宽度的文件看起来是这样的(为了实用,我已经截断了一些行)

    MDDUMMY      FILEFEEDBACK             ZAXXXXXXXX   49752700                                 20230524163505                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
    IN49752700                       DUMMY      XXXXXXXX   20230524163505      1002                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
    IS49752700                      49752700                       20230524163505      20230524T                                                                      0000100001
    

    输出为

    +------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
    |MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC|       InterchangeId|ProcessingIndicator|     TimeStamp|INRecordType|       InstructionId|Status|Source|
    +------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
    |          MD|   DUMMY|FILEFEEDBACK|            ZA| XXXXXXXX|XXXXXXXXX   49752700|                   |20230524163505|          MD|DUMMY     FILEFEE...|      |   202|
    |          IN|49752700|            |             C|PARSDUMMY|          CPARSDUMMY|                   |              |          IN|            49752700|      |      |
    

    所需为

    +------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
    |MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC|       InterchangeId|ProcessingIndicator|     TimeStamp|INRecordType|       InstructionId|Source|SenderBIC|
    +------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
    |          MD|   DUMMY|FILEFEEDBACK|            ZA| XXXXXXXX|            49752700|                   |20230524163505|          IN|            49752700|DUMMY | XXXXXXXX|
    

    请帮助我如何正确阅读所有记录以获得所需的输出

    0 回复  |  直到 2 年前
    推荐文章