代码之家  ›  专栏  ›  技术社区  ›  zlbi

使用python从网页中抓取表

  •  -1
  • zlbi  · 技术社区  · 1 年前

    我希望得到的表格内容 this website 。但是,网页的设计非常特殊,我下面的代码只能得到第一页的表格:

    我知道,由于只有三页我可以手动复制,但我仍然希望写一个可以自动化整个过程的脚本。

    driver = webdriver.Chrome()
    driver.get(url) 
    time.sleep(5)   
    html_str = driver.page_source 
    soup = bs(html_str, "html.parser")
    soup.find("table")
    

    这是来自的分页监视器部分 soup ,我没有网络开发经验,也不明白点击“下一步”后会发生什么。

    <ha-paginator data-translation-block="false" data-translation-id="1442"><!-- --><nav aria-label="Page navigation" class="text-center" data-translation-block="false" data-translation-id="1443">
    <ul class="pagination" data-translation-block="false" data-translation-id="1444">
    <!-- -->
    <!-- --><li class="active" data-translation-block="false" data-translation-id="1445">
    <!-- --><a data-translated="false" data-translation-checksum="57ad7d2ec0e248914c2b0ae7efc17011d1435f99d807e43b172697027ffe46ce500c3ff64f5162eaa059c11a23fa5d8c442ab67bd219d74311601bed517cf477" href="#"> 1
            <!-- --><span class="sr-only">(current)</span>
    </a>
    </li><li data-translation-block="false" data-translation-id="1446">
    <!-- --><a data-translated="false" data-translation-checksum="7eece0387dc3c6876397df60e2d7dbe0e2c94ecdc42d7e50d5208a4c84885caa703c487d86900ac97f10ad493893db85144cf7889d8ac8fd008dfd4c8f0e98df" href="#"> 2
            <!-- -->
    </a>
    </li><li data-translation-block="false" data-translation-id="1447">
    <!-- --><a data-translated="false" data-translation-checksum="aa08ec665075172d835562b332e78832e7f9d3b7f3df47d5a32b8f3a1682daaed49831faf19eeaca164d8e94e3449ade2a83d83dfaa83878c832f644fea11f95" href="#"> 3
            <!-- -->
    </a>
    </li><!-- --><li data-translated="false" data-translation-block="true" data-translation-checksum="7d03f54e74b11d46eacd33365a0aa16a3ba2857949c7f795c2d9c07b5689fbc4230dc22c45af2303eba21a7d8016f197d9b474d4149db6d0df059ce00416e192" data-translation-id="1448">
    <a href="#">
              Next
            </a>
    </li>
    </ul>
    </nav>
    <!-- --></ha-paginator>
    <hr class="big" data-translation-block="true" data-translation-id="1449"/>
    </div>
    </div>
    </div>
    </ha-table-search>
    
    1 回复  |  直到 1 年前
        1
  •  0
  •   Andrej Kesely    1 年前

    您在页面上看到的数据是通过JavaScript从外部URL加载的,因此您可以直接从那里获取数据:

    import pandas as pd
    import requests
    
    url = "https://immi.homeaffairs.gov.au/_layouts/15/api/data.aspx/GetPriceList"
    
    data = requests.post(url, json={"category": "Visa", "onshore": "All"}).json()
    df = pd.DataFrame(data["d"]["data"])
    
    df.pop("note")
    print(df.head(5))
    

    打印:

      visaSubclassCode                                           visaSubclassText streamCode streamText onShore    basePrice  over18Price under18Price nonInternetPrice subsequentPrice
    0              100  Partner (Provisional and Migrant) visa (subclass 309/100)                            No  AUD8,850.00  AUD4,430.00  AUD2,215.00              N/A             N/A
    1              101                                  Child visa (subclass 101)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
    2              102                               Adoption visa (subclass 102)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
    3              117                        Orphan Relative visa (subclass 117)                            No  AUD1,870.00    AUD935.00    AUD470.00              N/A             N/A
    4              124                   Distinguished Talent visa (subclass 124)                            No  AUD4,110.00  AUD2,055.00  AUD1,030.00              N/A             N/A