代码之家  ›  专栏  ›  技术社区  ›  lebrochet

BeautifulSoup:超过24个字符(从a到z)的迭代失败:降低复杂性以首次了解数据集:

  •  2
  • lebrochet  · 技术社区  · 1 年前

    我在一个网站上有一份西班牙保险公司的名单,共有24个标题:请参阅以下内容

    insurandes-西班牙: 完整列表: https://www.unespa.es/en/directory

    它分为24页: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z

    想法-目的:我想用BS4和请求从页面中获取数据,并最终将其保存到数据帧中: 好吧,使用BeautifulSoup(BS4)和Python中的请求从网站上抓取列表的任务似乎很合适;我认为我们需要采取以下步骤:

    首先,我们需要导入必要的库:BeautifulSoup、requests和panda。 b 然后我们需要使用请求库来获取每个感兴趣的页面的HTML内容:即A到Z页面。 c 然后我使用BeautifulSoup来解析HTML内容。 d 接下来,我认为从解析的HTML中提取相关信息(保险公司的名称)是下一步 e 最后,我想将提取的数据存储在pandas DataFrame中。

    但这不起作用…-也不适用于从A到Z的迭代:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # Function to scrape insurers from a given URL
    def scrape_insurers(url):
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extracting insurer names
            insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
            return insurers
        else:
            print("Failed to retrieve data from", url)
            return []
    
    # Define the base URL
    base_url = "https://www.unespa.es/en/directory/"
    
    # List to store all insurers
    all_insurers = []
    
    # Loop through each page (A to Z)
    for char in range(65, 91):  # ASCII codes for A to Z
        page_url = f"{base_url}#{chr(char)}"
        insurers = scrape_insurers(page_url)
        all_insurers.extend(insurers)
    
    # Convert the list of insurers to a pandas DataFrame
    df = pd.DataFrame({'Insurer': all_insurers})
    
    # Display the DataFrame
    print(df.head())
    
    # Save DataFrame to a CSV file
    df.to_csv('insurers_spain.csv', index=False)
    

    ……失败,结果如下:

    Failed to retrieve data from https://www.unespa.es/en/directory/#A
    Failed to retrieve data from https://www.unespa.es/en/directory/#B
    Failed to retrieve data from https://www.unespa.es/en/directory/#C
    Failed to retrieve data from https://www.unespa.es/en/directory/#D
    Failed to retrieve data from https://www.unespa.es/en/directory/#E
    

    等等等等:

    嗯,我认为首先减少复杂性的步骤是相当容易的。

    我认为最好只取一个我想访问的URL。最好测试一下我们的请求得到了什么结果。完成后,现在我可以评估请求;好吧,我想我可以使用漂亮的汤库来检查共同的特定字段。 好吧,我认为我应该避免一步做三件事(这显然是非常错误的)。

    所以我对第一个字符是这样做的:对A:

    import requests
    from bs4 import BeautifulSoup
    
    # Function to scrape insurers from a given URL
    def scrape_insurers(url):
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extracting insurer names
            insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
            return insurers
        else:
            print("Failed to retrieve data from", url)
            return []
    
    # Define the base URL
    base_url = "https://www.unespa.es/en/directory/#"
    
    # Define the character we want to fetch data for
    char = 'A'
    
    # Construct the URL for the specified character
    url = base_url + char
    
    # Fetch and print data for the specified character
    insurers_char = scrape_insurers(url)
    print(f"Insurers for character '{char}':")
    print(insurers_char)
    

    但请参阅此处的输出:

    Failed to retrieve data from https://www.unespa.es/en/directory/#A
    Insurers for character 'A':
    []
    
    1 回复  |  直到 1 年前
        1
  •  1
  •   Andrej Kesely    1 年前

    尝试

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.unespa.es/en/directory/"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
    }
    
    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
    
    data = []
    for c in soup.select(".contact-item"):
        for t in c.select("span, a"):
            t.unwrap()
        c.smooth()
    
        title, *other = c.get_text(separator="|||", strip=True).split("|||")
        data.append(
            {"Title": title, **{(s := d.split(":", maxsplit=1))[0]: s[1] for d in other}}
        )
    
    df = pd.DataFrame(data)
    print(df)
    

    打印:

                                                                                          Title                         Tfno.                           Fax                                                         Web                                                                                           Dirección                                          Email
    0                               A.M.A., AGRUPACIÓN MUTUAL ASEGURADORA, MUTUA DE SEGUROS APF                  91 343 47 00                (91) 343 47 68                                   http://www.amaseguros.com                                                              VÍA DE LOS POBLADOS, 3 28033  (MADRID)                                            NaN
    1                                                  ABANCA GENERALES DE SEGUROS Y REASEGUROS         881920742 / 881920744                           NaN                                                         NaN                                                  AV. LINARES RIVAS 30, 3º 15005 A CORUÑA (A CORUÑA)                                            NaN
    2                                     ABANCA VIDA Y PENSIONES DE SEGUROS Y REASEGUROS, S.A.                   981 188 075                           NaN                                                         NaN                                         AVENIDA DE LA MARINA, 1-3ª PLANTA 15001 A CORUÑA (A CORUÑA)                                            NaN
    3                                          ADMIRAL EUROPE COMPAÑIA DE SEGUROS S.A.U. (AECS)                           NaN                           NaN                              https://www.admiraleurope.com/                                               RODRÍGUEZ MARÍN, 61 - 1ª PLANTA 28016 MADRID (MADRID)                                            NaN
    4                                    AEGON ESPAÑA, SOCIEDAD ANÓNIMA DE SEGUROS Y REASEGUROS                  91 563 62 22                           NaN                                         http://www.aegon.es                 VÍA DE LOS POBLADOS, 3 - EDIFICIO 4B - PARQUE EMPRESARIAL CRISTALIA 28033  (MADRID)                                            NaN
    5                                          AGROPELAYO SOCIEDAD DE SEGUROS, SOCIEDAD ANÓNIMA                           NaN                           NaN                                                         NaN                                                             SANTA ENGRACIA, 67 - 69 28010  (MADRID)                                            NaN
    
    
    ...
    
    推荐文章