代码之家 › 专栏 › 技术社区 › vinita

无法从谷歌搜索中获取文章结果

beautifulsoup web-scraping selenium python

-1

vinita · 技术社区 · 7 年前

我试图通过BeautifulSoup阅读此链接内容,然后尝试获取SPAN.F中的文章日期。

import requests
import json
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
from selenium import webdriver
link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
browser=webdriver.Firefox()
browser.get(link)
s=requests.get(link)
soup5 =BeautifulSoup(s.content,'html.parser')

现在我想把所有的文章日期都拿出来 <span class="f">Apr 27, 2018 - </span> 以及相应的“链接URL” 但这段代码不能为我获取任何信息

for i in soup5.find_all("div",{"class":"g"}):
    print (i.find_all("span",{"class":"f"}))

2 回复 | 直到 7 年前

undetected Selenium 7 年前

当你使用硒所以不用 requests 你可以很容易地取出 page_source 通过清汤和调用 find_all() 打印日期如下:

代码块:

from bs4 import BeautifulSoup as soup
from selenium import webdriver
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
browser.get(link)
soup5 = soup(browser.page_source,'html.parser')
print("Dates are as follows : ")
for i in soup5.find_all("span",{"class":"f"}):
    print (i.text)
print("Link URLs are as follows : ")
for i in soup5.find_all("cite",{"class":"iUh30"}):
    print (i.text)

控制台输出:

Dates are as follows : 
Mar 19, 2018 - 
Apr 27, 2018 - 
Feb 1, 2018 - 
Apr 17, 2018 - 
Jan 9, 2018 - 
Link URLs are as follows : 
thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...

更新

如果你想打印日期和 链接URL 并排使用:

代码块:

from bs4 import BeautifulSoup as soup
from selenium import webdriver
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
browser.get(link)
soup5 = soup(browser.page_source,'html.parser')
for i,j in zip(soup5.find_all("span",{"class":"f"}), soup5.find_all("cite",{"class":"iUh30"})):
    print(i.text, j.text)

控制台输出:

Mar 19, 2018 -  thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
Apr 27, 2018 -  https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
Feb 1, 2018 -  https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
Apr 17, 2018 -  https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...
Jan 9, 2018 -  www.travisgop.com/2018_precinct_conventions

Zilong Li 7 年前

你不需要硒来完成这项任务。使用漂亮的汤 .select() 方法如下:

import requests
from bs4 import BeautifulSoup
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}

link = "https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"

r = requests.get(link, headers=headers, timeout=4)

encoding = r.encoding if 'charset' in r.headers.get('content-type','').lower() else None

soup = BeautifulSoup(r.content, 'html.parser', from_encoding=encoding)

for d in soup.select("div.s > div"):
    # check if date exists
    if d.select("span.st > span.f"):
        date = d.select("span.st > span.f")
        link = d.select("div.f > cite")
        print(date[0].text)
        print(link[0].text)

输出:

2018. 4. 27. - 
https://www.cnn.com/2017/11/10/politics/house.../index.html
2018. 3. 19. - 
thehill.com/.../379087-former-gop-lawmaker-announces-hes-leav...
2018. 4. 11. - 
https://www.nytimes.com/2018/04/11/us/.../paul-ryan-speaker.htm...
2017. 10. 24. - 
https://www.theguardian.com/.../jeff-flake-retire-republican-senat...