代码之家  ›  专栏  ›  技术社区  ›  Jamil Rahman

在Python中对表进行web抓取时,返回一个空表

  •  2
  • Jamil Rahman  · 技术社区  · 4 年前

    我需要通过使用Python中的BeautifulSoup库从web站点抓取一个表。从URL https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html

    运行此代码时,会得到一个空表:

    import requests
    from bs4 import BeautifulSoup
    #
    vaacineProgressResponse = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
    vaacineProgressContent = BeautifulSoup(vaacineProgressResponse.content, 'html.parser')
    vaacineProgressContentTable = vaacineProgressContent.find_all('table', class_="g-summary-table  svelte-2wimac")
    if vaacineProgressContentTable is not None and len(vaacineProgressContentTable) > 0:
        vaacineProgressContentTable = vaacineProgressContentTable[0]
    #
    print ('the table =', vaacineProgressContentTable)
    

    the table = []
    
    Process finished with exit code 0
    

    下面的屏幕截图显示了网页中的表格(左侧)和相关的检查元素部分(右侧):

    enter image description here

    2 回复  |  直到 4 年前
        1
  •  3
  •   readyplayer77 Espoir Murhabazi    4 年前

    很简单-这是因为在你要搜索的类中有一个额外的空间。

    如果你把班级改成 g-summary-table svelte-2wimac

    以下代码应起作用:

    import requests
    from bs4 import BeautifulSoup
    #
    url = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
    soup = BeautifulSoup(url.content, 'html.parser')
    table = soup.find_all('table', class_="g-summary-table svelte-2wimac")
    print(table)
    

    我在《纽约时报》的互动网站上也做过类似的抓取,空间可能非常棘手。如果您添加了额外的空间或遗漏了一个空间,则返回一个空结果。

    如果找不到标签,我建议您首先使用 print(soup.prettify()) 找到你想要的标签。一定要抄袭 准确的 从美素印刷的内容来看。

        2
  •  0
  •   Jonathan Leon    4 年前

    或者,如果您想下载json格式的数据,然后读入pandas,您可以这样做。从上面开始的代码和从soup对象开始的代码相同

    import re
    import pandas as pd
    
    latest_dataset = soup.find(string=re.compile('latest')).splitlines()[2].split('"')[1]
    requests.get(latest_dataset).json()
    
    latest_timeseries = soup.find(string=re.compile('timeseries')).splitlines()[2].split('"')[3]
    requests.get(latest_timeseries).json()
    
    allwithrate = soup.find(string=re.compile('all_with_rate')).splitlines()[2].split('"')[1]
    requests.get(allwithrate).json()
    pd.DataFrame(requests.get(allwithrate).json())
    

    最后一个的输出

        geoid    location last_updated  total_vaccinations  people_vaccinated     display_name  ...                      Region          IncomeGroup                    country  gdp_per_cap  vaccinations_rate people_fully_vaccinated
    0     MUS   Mauritius   2021-02-17              3843.0             3843.0        Mauritius  ...          Sub-Saharan Africa          High income                  Mauritius  11099.24028             0.3037                     NaN
    1     DZA     Algeria   2021-02-19             75000.0                NaN          Algeria  ...  Middle East & North Africa  Lower middle income                    Algeria  3973.964072             0.1776                     NaN
    2     LAO        Laos   2021-03-17             40732.0            40732.0             Laos  ...         East Asia & Pacific  Lower middle income                    Lao PDR   2534.89828             0.5768                     NaN
    3     MOZ  Mozambique   2021-03-23             57305.0            57305.0       Mozambique  ...          Sub-Saharan Africa           Low income                 Mozambique  503.5707727             0.1943                     NaN
    4     CPV  Cape Verde   2021-03-24              2184.0             2184.0       Cape Verde  ...          Sub-Saharan Africa  Lower middle income                 Cabo Verde  3603.781793             0.4016                     NaN
    ..    ...         ...          ...                 ...                ...              ...  ...                         ...                  ...                        ...          ...                ...                     ...
    243   GUF         NaN          NaN                 NaN                NaN    French Guiana  ...                         NaN                  NaN                        NaN          NaN                NaN                     NaN
    244   KOS         NaN          NaN                 NaN                NaN           Kosovo  ...                         NaN                  NaN                        NaN          NaN                NaN                     NaN
    245   CUW         NaN          NaN                 NaN                NaN          Cura�ao  ...   Latin America & Caribbean          High income                    Curacao  19689.13982                NaN                     NaN
    246   CHI         NaN          NaN                 NaN                NaN  Channel Islands  ...       Europe & Central Asia          High income            Channel Islands  74462.64675                NaN                     NaN
    247   SXM         NaN          NaN                 NaN                NaN     Sint Maarten  ...   Latin America & Caribbean          High income  Sint Maarten (Dutch part)  29160.10381                NaN                     NaN
    
    [248 rows x 17 columns]