代码之家  ›  专栏  ›  技术社区  ›  Tanay Roman

BeautifulSoup get_text返回非类型对象

  •  0
  • Tanay Roman  · 技术社区  · 7 年前

    我在努力美化网络抓取,我需要从这个标题 webpage ,尤其是“更多”标题部分。这是我目前尝试使用的代码。

    import requests
    from bs4 import BeautifulSoup
    from csv import writer
    
    response = requests.get('https://www.cnbc.com/finance/?page=1')
    
    soup = BeautifulSoup(response.text,'html.parser')
    
    posts = soup.find_all(id='pipeline')
    
    for post in posts:
        data = post.find_all('li')
        for entry in data:
            title = entry.find(class_='headline')
            print(title)
    

    运行此代码将以以下输出格式显示页面中的所有标题:

    <div class="headline">
    <a class=" " data-nodeid="105372063" href="/2018/08/02/after-apple-rallies-to-1-trillion-even-the-uber-bullish-crowd-on-wal.html">
               {{{*HEADLINE TEXT HERE*}}}
    </a> </div>
    

    但是,如果在获取上述代码中的标题时使用get_text()方法,则只获取前两个标题。

    title = entry.find(class_='headline').get_text()
    

    随后出现此错误:

    Traceback (most recent call last):
      File "C:\Users\Tanay Roman\Documents\python projects\scrapper.py", line 16, in <module>
        title = entry.find(class_='headline').get_text()
    AttributeError: 'NoneType' object has no attribute 'get_text'
    

    为什么添加get_text()方法只返回部分结果。我该怎么解决呢?

    1 回复  |  直到 7 年前
        1
  •  3
  •   Martijn Pieters    7 年前

    你误解了错误信息。不是因为 .get_text() 调用返回 NoneType 对象,它是 非定型 别用那种方法。

    只有一种类型的对象 非定型 ,值 None . 在这里它是由 entry.find(class_='headline') 因为在中找不到元素 entry 匹配搜索条件。换言之,这是为了 进入 元素,类中没有子元素 headline .

    有两个这样的 <li> 元素,一个具有id nativedvriver3 另一个是 nativedvriver9 ,这两种情况都会出现错误。您需要首先检查是否有匹配的元素:

    for entry in data:
        headline = entry.find(class_='headline')
        if headline is not None:
            title = headline.get_text()
    

    如果你使用 CSS selector :

    headlines = soup.select('#pipeline li .headline')
    for headline in headlines:
        headline_text = headline.get_text(strip=True)
        print(headline_text)
    

    这会产生:

    >>> headlines = soup.select('#pipeline li .headline')
    >>> for headline in headlines:
    ...     headline_text = headline.get_text(strip=True)
    ...     print(headline_text)
    ...
    Hedge funds fight back against tech in the war for talent
    Goldman Sachs sees more price pain ahead for bitcoin
    Dish Network shares rise 15% after subscriber losses are less than expected
    Bitcoin whale makes ‘enormous’ losing bet, so now other traders have to foot the bill
    The 'Netflix of fitness' looks to become a publicly traded stock as soon as next year
    Amazon slammed for ‘insult’ tax bill in the UK despite record profits
    Nasdaq could plunge 15 percent or more as ‘rolling bear market’ grips stocks: Morgan Stanley
    Take-Two shares surge 9% after gamemaker beats expectations due to 'Grand Theft Auto Online'
    UK bank RBS announces first dividend in 10 years
    Michael Cohen reportedly secured a $10 million deal with Trump donor to advance a nuclear project
    After-hours buzz: GPRO, AIG & more
    Bitcoin is still too 'unstable' to become mainstream money, UBS says
    Apple just hit a trillion but its stock performance has been dwarfed by the other tech giants
    The first company to ever reach $1 trillion in market value was in China and got crushed
    Apple at a trillion-dollar valuation isn’t crazy like the dot-com bubble
    After Apple rallies to $1 trillion, even the uber bullish crowd on Wall Street believes it may need to cool off
    
    推荐文章