代码之家  ›  专栏  ›  技术社区  ›  vinita

无法从谷歌搜索中获取文章结果

  •  -1
  • vinita  · 技术社区  · 7 年前

    我试图通过BeautifulSoup阅读此链接内容,然后尝试获取SPAN.F中的文章日期。

    import requests
    import json
    from bs4 import BeautifulSoup
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
    from selenium import webdriver
    link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
    browser=webdriver.Firefox()
    browser.get(link)
    s=requests.get(link)
    soup5 =BeautifulSoup(s.content,'html.parser')
    

    现在我想把所有的文章日期都拿出来 <span class="f">Apr 27, 2018 - </span> 以及相应的“链接URL” 但这段代码不能为我获取任何信息

    for i in soup5.find_all("div",{"class":"g"}):
        print (i.find_all("span",{"class":"f"}))
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   undetected Selenium    7 年前

    当你使用 所以不用 requests 你可以很容易地取出 page_source 通过 清汤 和调用 find_all() 打印日期如下:

    • 代码块:

      from bs4 import BeautifulSoup as soup
      from selenium import webdriver
      headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
      link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
      browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
      browser.get(link)
      soup5 = soup(browser.page_source,'html.parser')
      print("Dates are as follows : ")
      for i in soup5.find_all("span",{"class":"f"}):
          print (i.text)
      print("Link URLs are as follows : ")
      for i in soup5.find_all("cite",{"class":"iUh30"}):
          print (i.text)
      
    • 控制台输出:

      Dates are as follows : 
      Mar 19, 2018 - 
      Apr 27, 2018 - 
      Feb 1, 2018 - 
      Apr 17, 2018 - 
      Jan 9, 2018 - 
      Link URLs are as follows : 
      thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
      https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
      https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
      https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...
      

    更新

    如果你想打印 日期 链接URL 并排使用:

    • 代码块:

      from bs4 import BeautifulSoup as soup
      from selenium import webdriver
      headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
      link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
      browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
      browser.get(link)
      soup5 = soup(browser.page_source,'html.parser')
      for i,j in zip(soup5.find_all("span",{"class":"f"}), soup5.find_all("cite",{"class":"iUh30"})):
          print(i.text, j.text)
      
    • 控制台输出:

      Mar 19, 2018 -  thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
      Apr 27, 2018 -  https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
      Feb 1, 2018 -  https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
      Apr 17, 2018 -  https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...
      Jan 9, 2018 -  www.travisgop.com/2018_precinct_conventions
      
        2
  •  2
  •   Zilong Li    7 年前

    你不需要硒来完成这项任务。使用漂亮的汤 .select() 方法如下:

    import requests
    from bs4 import BeautifulSoup
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
    
    link = "https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
    
    r = requests.get(link, headers=headers, timeout=4)
    
    encoding = r.encoding if 'charset' in r.headers.get('content-type','').lower() else None
    
    soup = BeautifulSoup(r.content, 'html.parser', from_encoding=encoding)
    
    for d in soup.select("div.s > div"):
        # check if date exists
        if d.select("span.st > span.f"):
            date = d.select("span.st > span.f")
            link = d.select("div.f > cite")
            print(date[0].text)
            print(link[0].text)
    

    输出:

    2018. 4. 27. - 
    https://www.cnn.com/2017/11/10/politics/house.../index.html
    2018. 3. 19. - 
    thehill.com/.../379087-former-gop-lawmaker-announces-hes-leav...
    2018. 4. 11. - 
    https://www.nytimes.com/2018/04/11/us/.../paul-ryan-speaker.htm...
    2017. 10. 24. - 
    https://www.theguardian.com/.../jeff-flake-retire-republican-senat...