代码之家  ›  专栏  ›  技术社区  ›  Baktaawar David Maust

提取文本文件中的部分电子邮件

  •  -1
  • Baktaawar David Maust  · 技术社区  · 6 年前

    我想做一些文本处理语料库,其中有电子邮件。

    Message-ID: <3490571.1075846143093.JavaMail.evans@thyme>
    Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
    From: steven.kean@enron.com
    To: kelly.kimberly@enron.com
    Subject: Re: India And The WTO Services Negotiation
    Mime-Version: 1.0
    Content-Type: text/plain; charset=us-ascii
    Content-Transfer-Encoding: 7bit
    X-From: Steven J Kean
    X-To: Kelly Kimberly
    X-cc: 
    X-bcc: 
    X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
    X-Origin: KEAN-S
    X-FileName: skean.nsf
    
    fyi
    ---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49 
    PM ---------------------------
    
    
    Joe Hillings@ENRON
    09/08/99 02:52 PM
    To: Joe Hillings/Corp/Enron@Enron
    cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H 
    Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
    Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
    Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
    Jeffrey Sherrick/Corp/Enron@Enron 
    Subject: Re: India And The WTO Services Negotiation  
    
    Sanjay: Some information of possible interest to you. I attended a meeting 
    this afternoon of the Coalition of Service Industries, one of the lead groups 
    promoting a wide range of services including energy services in the upcoming 
    WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week 
    and met with CII to discuss the upcoming WTO. CII apparently has a committee 
    looking into the WTO. Bob says that he told them that energy services was 
    among the CSI recommendations and he recalls that CII said that they too have 
    an interest.
    
    Since returning from the meeting I spoke with Kiran Pastricha and told her 
    the above. She actually arranged the meeting in Delhi. She asked that I send 
    her the packet of materials we distributed last week in Brussels and London. 
    One of her associates is leaving for India tomorrow and will take one of 
    these items to Delhi. 
    
    Joe
    
    
    
    Joe Hillings
    09/08/99 11:57 AM
    To: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT
    cc: Terence H Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
    Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
    Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
    Jeffrey Sherrick/Corp/Enron@Enron (bcc: Joe Hillings/Corp/Enron)
    Subject: India And The WTO Services Negotiation
    
    Sanjay: First some information and then a request for your advice and 
    involvment.
    
    A group of US companies and associations formed the US WTO Energy Services 
    Coalition in late May and has asked the US Government to include "energy 
    services" on their proposed agenda when the first meeting of the WTO GATTS 
    2000 ministerial convenes late this year in Seattle. Ken Lay will be among 
    the CEO speakers. These negotiations are expected to last three years and 
    cover a range of subjects including agriculture, textiles, e-commerce, 
    investment, etc.
    
    This morning I visited with Sudaker Rao at the Indian Embassy to tell him 
    about our coalition and to seek his advice on possible interest of the GOI. 
    After all, India is a leader in data processing matters and has other 
    companies including ONGC that must be interested in exporting energy 
    services. In fact probably Enron and other US companies may be engaging them 
    in India and possibly abroad.
    
    Sudaker told me that the GOI has gone through various phases of opposing the 
    services round to saying only agriculture to now who knows what. He agrees 
    with the strategy of our US WTO Energy Services Coalition to work with 
    companies and associations in asking them to contact their government to ask 
    that energy services be on their list of agenda items. It would seem to me 
    that India has such an interest. Sudaker and I agree that you are a key 
    person to advise us and possibly to suggest to CII or others that they make 
    such a pitch to the GOI Minister of Commerce.
    
    I will ask Lora to send you the packet of materials Chris Long and I 
    distributed in Brussels and London last week. I gave these materials to 
    Sudaker today.
    
    Everyone tells us that we need some developing countries with an interest in 
    this issue. They may not know what we are doing and that they are likely to 
    have an opportunity if energy services are ultimately negotiated.
    
    Please review and advise us how we should proceed. We do need to get 
    something done in October.
    Joe
    
    PS Terry Thorn is moderating a panel on energy services at the upcoming World 
    Services Congress in Atlanta. The Congress will cover many services issues. I 
    have noted in their materials that Mr. Alliwalia is among the speakers but 
    not on energy services. They expect people from all over the world to 
    participate.
    

    我可以在主目录中执行os.walk,然后它将遍历每个子目录,解析该子目录中的每个文本文件,然后对其他子目录重复它,等等。

    我需要提取文本文件中每封电子邮件的某些部分,并将其作为新行存储在数据集中(csv、pandas dataframe等)。

    有助于在数据集中提取和存储为行的列的部分。这个数据集的每一行都可以是每个文本文件中的每一封电子邮件。

    领域:

    Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email| 
    

    编辑:我看了添加的重复问题。它考虑了一个固定的规范和边界。这里不是这样的。我正在寻找一种简单的正则表达式方法来提取上述不同的字段

    1 回复  |  直到 6 年前
        1
  •  0
  •   Pedro Rodrigues    6 年前
    ^Date:\ (?P<date>.+?$)
    .+?
    ^From:\ (?P<sender>.+?$)
    .+?
    ^To:\ (?P<to>.+?$)
    .+?
    ^cc:\ (?P<cc>.+?$)
    .+?
    ^Subject:\ (?P<subject>.+?$)
    

    确保你正在使用 多托 , 多行 扩展 正则表达式引擎上的模式。

    Group `date`    63-99   `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
    Group `sender`  106-127 `steven.kean@enron.com`
    Group `to`  132-156 `kelly.kimberly@enron.com`
    Group `cc`  650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H `
    Group `subject` 930-974 `Re: India And The WTO Services Negotiation  `
    

    https://regex101.com/r/gHUOLi/1

    def match_email(long_string):
        regex = r'^Date:\ (?P<date>.+?$)
                  .+?
                  ^From:\ (?P<sender>.+?$)
                  .+?
                  ^To:\ (?P<to>.+?$)
                  .+?
                  ^cc:\ (?P<cc>.+?$)
                  .+?
                  ^Subject:\ (?P<subject>.+?$)'
        # try to match the thing
        match = re.search(regex, long_string.strip(), re.I | re.X)
    
        # if there is no match its over
        if match is None:
            return None, long_string
    
        # otherwise, get it
        email = match.groupdict()
    
        # remove whatever matched from the original string
        if email is not None:
            long_string = long_string.strip()[match.end():]
    
        # return the email, and the remaining string
        return email, long_string
    
    
    # now iterate over the long string
    emails = []
    email, tail = match_email(the_long_string)
    while email is not None:
        emails.append(email)
        email, tail = match_email(tail)
    
    print(emails)
    

    直接从 here 只是换了些名字之类的。