你可以使用谷歌的一个更程序化的api来获取结果,而不是尝试屏幕抓取人工搜索界面,没有错误检查或断言这符合所有谷歌T&C,建议您查看使用此url的详细信息:
import requests
def search(query, pages=4, rsz=8):
url = 'https://ajax.googleapis.com/ajax/services/search/web'
params = {
'v': 1.0, # Version
'q': query, # Query string
'rsz': rsz, # Result set size - max 8
}
for s in range(0, pages*rsz+1, rsz):
params['start'] = s
r = requests.get(url, params=params)
for result in r.json()['responseData']['results']:
yield result
E、 g.为“google”获得200个结果:
>>> list(search('google', pages=24, rsz=8))
[{'GsearchResultClass': 'GwebSearch',
'cacheUrl': 'http://www.google.com/search?q=cache:y14FcUQOGl4J:www.google.com',
'content': 'Search the world's information, including webpages, images, videos and more. \n<b>Google</b> has many special features to help you find exactly what you're looking\xa0...',
'title': '<b>Google</b>',
'titleNoFormatting': 'Google',
'unescapedUrl': 'https://www.google.com/',
'url': 'https://www.google.com/',
'visibleUrl': 'www.google.com'},
...
]
要使用谷歌的自定义搜索API,您需要注册为开发者。您每天可以获得100个免费查询(我不确定这是API调用还是允许同一查询的分页计算为1个查询):
您可以使用
requests
进行查询:
import requests
url = 'https://www.googleapis.com/customsearch/v1'
params = {
'key': '<key>',
'cx': '<cse reference>',
'q': '<search>',
'num': 10,
'start': 1
}
resp = requests.get(url, params=params)
results = resp.json()['items']
具有
start
您可以对上面的内容进行类似的分页。
还有很多其他可用参数,您可以查看CSE的REST文档:
https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request
谷歌还有一个客户端api库:
pip install google-api-python-client
您还可以使用:
from googleapiclient import discovery
service = discovery.build('customsearch', 'v1', developerKey='<key>')
params = {
'q': '<query>',
'cx': '<cse reference>',
'num': 10,
'start': 1
}
query = service.cse().list(**params)
results = query.execute()['items']