代码之家  ›  专栏  ›  技术社区  ›  David J.

找不到匹配项的python正则表达式

  •  0
  • David J.  · 技术社区  · 7 年前

    我正在查找以下文本字符串中的匹配项:

    '<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only                             
    *\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          
    *\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'
    

    我在Python3.6.4中定义了一个原始字符串:

    r = r'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
    

    ,然后用于搜索:

    a = re.search(r, raw_string, re.M|re.S)
    

    此操作不返回匹配项:

    a[0]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'NoneType' object is not subscriptable
    

    尽管完全相同的字符串和regex在regex101上匹配:

    https://regex101.com/r/qgJMbO/1

    有人能告诉我问题出在哪里吗?

    编辑:

    预期结果是:

    A[1] `与Windows 2000和Chemstation A.9.01兼容的inb2200\r\n\r\

    A[2] '仅限信息'

    A[3] 'AG/B404'

    1 回复  |  直到 7 年前
        1
  •  4
  •   johnashu    7 年前

    我提供了一个解决方案 BeautifulSoup re

    from bs4 import BeautifulSoup as bs4
    import re
    
    docstring = '<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only   *\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          *\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'
    
    
    soup = bs4(docstring, 'lxml')
    
    description_source = soup.find('pre')
    
    s = description_source.text
    
    r = 'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
    
    a = re.search(r, s, re.M|re.S)
    
    s = s.split('\r\n')
    
    print(s[2])
    print(a[2])
    print(a[3])
    

    退货

    INB22000 compatibility with Windows 2000 and ChemStation A.9.01
                              Information Only  
    AG/B404