代码之家 › 专栏 › 技术社区 › OrangeOwner

Python BeautifulSoup4网络爬虫。findAll()未分析

findall beautifulsoup web-scraping python-3.x

OrangeOwner · 技术社区 · 7 年前

全部的

我正在尝试制作一个python web scraper,以从零售网站中提取所有产品名称。执行此操作的代码(在PyCharm中)如下所示:

import requests
from bs4 import BeautifulSoup

def louis_spider(max_pages):
    page = 0
    while page <= max_pages:
            url = 'https://us.testcompany.com/eng-us/women/hanbags/_/N-r4xtxc/to-' + str(page)
            source_code = requests.get(url)
            plain_text = source_code.text
            soup = BeautifulSoup(plain_text, 'html.parser')
            for eachItem in soup.findAll('main', {'class': 'content'}):
               printable = eachItem.get('id')
               print(printable)
               print('Test1')
            page += 1

louis_spider(0)

正如目前的情况(如上所述),代码不会打印任何内容,甚至不会打印“Test1”我用中的其他输入运行了这个。findAll()&。get()方法运气好: .findAll('a', {'class':'skiplinks'}) 和 .get('href') 已生成“#内容测试1”,并且 .findAll('div', {'id':'privateModeMessage'}) 和 .get('style') 已生成“显示:无测试1”。以下是网站“inspect element”代码的一部分,供您参考:

a snippet of the website's code, providing context for my mentioned attempts which worked

不幸的是,我上面的代码块没有产生任何结果!当我尝试引用 <main> 节-我在引用行时得到结果,直到它。理想情况下,我将能够提取网页上每个项目的名称(请参阅网站代码的另一个快照,以获取对网站相关行的特定引用)。这些线在 <主(>); 网站代码的一部分,因此我怀疑我的for循环从未在此处输入,原因与它不在 <主(>); ,就像我上面街区的那些。。。 the way I'd write this is .findAll('a', {'class': 'productName'}): and .get('class')

尽管如此,我还是找不到理由 <主(>); 对BeautifulSoup来说是不可能的。有人知道为什么会发生这种情况吗?提前感谢!

1 回复 | 直到 7 年前

drec4s 7 年前

根据您在评论中发布的代码,您将得到一个空列表,因为您的 XPath 是错误的。班级 productPrice 在a内 span 标签,非a div 。

通过执行以下操作,可以获得所需的值:

namesElements = browser.find_elements_by_xpath("//span[@class='productPrice']")
names = []
[names.append(x.text) for x in namesElements]
print(names)

推荐文章

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

3 月前

Cam · Pandas列表日期到日期时间

3 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

4 月前

LMC · Numpy数组布尔索引以获取包含元素

4 月前

vr8ce · 非成对标记中特定字符的正则表达式

5 月前

Kernel · 如果指定了crs参数,shapefile的geopandas.read_file将出错

5 月前

ShaAnder · 为什么sqllachemy返回的是类而不是字符串

5 月前

sixtytrees · detectron2软件包未安装(没有名为“torch”的模块),但我安装了torch

5 月前

Pernoctador · Python映射可以复制吗?我需要参考地图

5 月前