代码之家  ›  专栏  ›  技术社区  ›  Abhishek Kulkarni

去除靓汤提取物上的标签

  •  1
  • Abhishek Kulkarni  · 技术社区  · 7 年前

    我是新来的网页垃圾和试图找出如何删除不需要的标签。

    我想从加拿大银行网站上获取有关货币政策的公告和相应日期。我的代码如下:

    from bs4 import BeautifulSoup
    import urllib
    r=urllib.request.urlopen('https://www.bankofcanada.ca/content_type/publications/mpr/?post_type%5B0%5D=post&post_type%5B1%5D=page').read()
    soup = BeautifulSoup(r)
    
    soup.prettify()
    letters = soup.find_all("div", class_="media-body")
    lobbying = {}
    for element in letters:
        lobbying[element.a.get_text()] = {}
    print(lobbying)
    

    输出附加在屏幕截图中。 enter image description here

    预期产量:

    2017年4月12日:加拿大经济今年预计增长2 1/2%,2018年和2019年略低于2%

    2016年4月13日:随着复杂调整的继续,加拿大经济预计在2016年增长1.7%,明年将恢复潜力。

    提前谢谢

    1 回复  |  直到 7 年前
        1
  •  0
  •   Padraic Cunningham    7 年前

    你想要的 media-date media-excerpt 每个内部的标记 media div 去掉空白:

    from bs4 import BeautifulSoup
    import urllib.request
    
    r = urllib.request.urlopen(
        'https://www.bankofcanada.ca/content_type/publications/mpr/?post_type%5B0%5D=post&post_type%5B1%5D=page').read()
    soup = BeautifulSoup(r, "lxml")
    
    lobbying = {}
    
    # All media/div elements.
    for element in soup.select(".media"):
        # select_one pulls 1 match, pull the text from each tag.
        lobbying[element.select_one(".media-date").text] = element.select_one(".media-excerpt").text.strip()
    print(lobbying)
    

    这将给你:

       {
        'April 18, 2018': 'The Bank’s new forecast calls for economic growth of 2.0 percent this year, 2.1 per cent in 2019 and 1.8 per cent in 2020.',
        'January 17, 2018': 'Growth in the Canadian economy is projected to slow from 3 per cent in 2017 to 2.2 per cent this year and 1.6 per cent in 2019.',
        'October 25, 2017': 'Projections for Canadian economic growth have been increased to 3.1 per cent this year and 2.1 per cent in 2018, with growth of 1.5 per cent forecast for 2019.',
        'July 12, 2017': 'Growth in the Canadian economy is projected to reach 2.8 per cent this year before slowing to 2.0 per cent next year and 1.6 per cent in 2019.',
        'April 12, 2017': 'Canada’s economy is expected to grow by 2 1/2 per cent this year and just below 2 per cent in 2018 and 2019.',
        'January 18, 2017': 'The Canadian economy is expected to expand by 2.1 per cent this year and in 2018.',
        'October 19, 2016': 'Growth in the Canadian economy is expected to increase from 1.1 per cent this year to about 2.0 per cent in 2017 and 2018.',
        'July 13, 2016': 'Canadian economic growth is projected to accelerate from 1.3 per cent this year to 2.2 per cent in 2017.',
        'April 13, 2016': 'Canada’s economy is projected to grow by 1.7 per cent in 2016 and return to potential next year as complex adjustments continue.',
        'January 20, 2016': 'Growth in Canada’s economy is expected to reach 1.4 per cent this year and accelerate to 2.4 per cent in 2017.'}
    

    你也可以使用 听写理解 要创建dict:

    lobbying = {el.select_one(".media-date").text: el.select_one(".media-excerpt").text.strip()
                for el in soup.select(".media")}