代码之家  ›  专栏  ›  技术社区  ›  CDNthe2nd

Python-如何分割从html站点获取的文本

  •  -1
  • CDNthe2nd  · 技术社区  · 6 年前

    所以我做了一个小脚本,我打印出每次有一个更新我的UPS跟踪基本上。

     tracking_full_site = 'https://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=' + url #URL is the last tracking numbers that I can't provide due to incase someone changes anything with my tracking.
    
        headers = {
            'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
                           ' (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
        }
        resp = s.get(tracking_full_site, headers=headers, timeout=12)
        resp.raise_for_status()
    
        bs4 = soup(resp.text, 'lxml')
        old_list = []
    
        for item in bs4.findAll('tr', {'valign': 'top'}):
            where_is_it = " ".join(item.text.split())
            old_list.append(where_is_it)
    
        print(old_list)
    
        sys.exit()
    

    但是我得到的结果是:

    United States 28.08.2018 6:16 Package departed international carrier facility
    Edgewood, NY, United States 27.08.2018 20:00 Package transferred to post office
    United States 27.08.2018 18:42 Package processed by international carrier
    EDGEWOOD, NY, United States 24.08.2018 15:51 Package processed by UPS Mail Innovations origin facility
    24.08.2018 12:55 Package received for processing by UPS Mail Innovations
    United States 22.08.2018 8:19 Shipment information received by UPS Mail Innovations
    

    " ".join(item.text.split())

    我的问题是 ,我如何拆分它,以便我可以打印出国家等或日期,时间或描述?

    编辑:

    <table summary="" border="0" cellpadding="0" cellspacing="0" class="dataTable">
       <tbody>
          <tr>
             <th scope="col">Location</th>
             <th scope="col">Date</th>
             <th scope="col">Local Time</th>
             <th scope="col" class="full">Activity&nbsp;(<a class="btnlnkR helpIconR" href="javascript:helpModLvl('https://www.ups.com/content/se/en/tracking/tracking/description.html')">What's this?</a>)</th>
          </tr>
          <tr valign="top">
             <td class="nowrap">
                United States
             </td>
             <td class="nowrap">
                28.08.2018
             </td>
             <td class="nowrap">
                6:16
             </td>
             <td>Package departed international carrier facility</td>
          </tr>
          <tr valign="top" class="odd">
             <td class="nowrap">
                Edgewood,&nbsp;
                NY,&nbsp;
                United States
             </td>
             <td class="nowrap">
                27.08.2018
             </td>
             <td class="nowrap">
                20:00
             </td>
             <td>Package transferred to post office</td>
          </tr>
          <tr valign="top">
             <td class="nowrap">
                United States
             </td>
             <td class="nowrap">
                27.08.2018
             </td>
             <td class="nowrap">
                18:42
             </td>
             <td>Package processed by international carrier</td>
          </tr>
          <tr valign="top" class="odd">
             <td class="nowrap">
                EDGEWOOD,&nbsp;
                NY,&nbsp;
                United States
             </td>
             <td class="nowrap">
                24.08.2018
             </td>
             <td class="nowrap">
                15:51
             </td>
             <td>Package processed by UPS Mail Innovations origin facility</td>
          </tr>
          <tr valign="top">
             <td class="nowrap">
             </td>
             <td class="nowrap">
                24.08.2018
             </td>
             <td class="nowrap">
                12:55
             </td>
             <td>Package received for processing by UPS Mail Innovations</td>
          </tr>
          <tr valign="top" class="odd">
             <td class="nowrap">
                United States
             </td>
             <td class="nowrap">
                22.08.2018
             </td>
             <td class="nowrap">
                8:19
             </td>
             <td>Shipment information received by UPS Mail Innovations</td>
          </tr>
       </tbody>
    </table>
    

    Country: United State
    Date: 28.08.2018
    Time: 6:16
    Description: Package departed international carrier facility
    

    其中一位编辑回答:

    ['Sweden', '29.08.2018', '11:08', 'Package arrived at international carrier']
    ['United States', '28.08.2018', '6:16', 'Package departed international carrier facility']
    ['Edgewood,\t\t\t\t\t\t\t\n\n\t\t\t\t            \n\t\t\t\t            \t\n\t\t\t\t            \tNY,\t\t\t\t            \n\n\t\t\t\t            \n\t\t\t\t            \t\n\t\t\t\t            \tUnited States', '27.08.2018', '20:00', 'Package transferred to post office']
    ['United States', '27.08.2018', '18:42', 'Package processed by international carrier']
    ['EDGEWOOD,\t\t\t\t\t\t\t\n\n\t\t\t\t            \n\t\t\t\t            \t\n\t\t\t\t            \tNY,\t\t\t\t            \n\n\t\t\t\t            \n\t\t\t\t            \t\n\t\t\t\t            \tUnited States', '24.08.2018', '15:51', 'Package processed by UPS Mail Innovations origin facility']
    ['', '24.08.2018', '12:55', 'Package received for processing by UPS Mail Innovations']
    ['United States', '22.08.2018', '8:19', 'Shipment information received by UPS Mail Innovations']
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   GraphicalDot    6 年前
    array = []
    for item in soup.findAll('tr', {'valign': 'top'}):
         array.append([f.text.strip().replace("\xa0\n\t", "") for f in item.findAll("td")])
    output = []
    for e in array:
       output.append({"Country": e[0].replace("   ", ""), "Date": e[1], "Time": e[2], "Description": e[3]})
    
     if you want to print only the country, just do this
     for element in output:
        print (element["Country"])
    
        2
  •  0
  •   Ying Li    6 年前

    一旦得到GET响应,就将其放入变量(respString)中,然后解析它。其思想是通读html并识别信息所在的位置。

    <tr valign="top" class="odd">
       <td class="nowrap">
          United States
       </td>
       <td class="nowrap">
          22.08.2018
       </td>
       <td class="nowrap">
          8:19
       </td>
       <td>Shipment information received by UPS Mail Innovations</td>
    </tr>
    

    这将从解析HTML得到“美国”部分:

    var startIndex = respString.indexOf('<td class="nowrap">');
    var tempRespString = respString.substring(startIndex);
    var tempStartIndex = tempRespString.indexOf('>');
    var tempEndIndex = tempRespString.indexOf('</');
    var country = tempRespString.substring(tempStartIndex + 1, tempEndIndex);
    

    如果有多个类似的字符串,你不能正确索引它-说你需要针对第三个。。。

    '<td class="nowrap">'
    

    只要发挥创意,找到解析HTML响应所需数据的方法。