代码之家 › 专栏 › 技术社区 › Adam Matan

美汤:获取特定桌子的内容

tabular beautifulsoup web-scraping python

Adam Matan · 技术社区 · 15 年前

My local airport 不带IE的用户会被屏蔽,看起来很糟糕。我想编写一个python脚本,每隔几分钟就可以获取到达和离开页面的内容,并以更可读的方式显示它们。

我选择的工具是 mechanize 为了欺骗网站相信我用了IE,和 BeautifulSoup 用于解析页面以获取航班数据表。

老实说,我在漂亮的soup文档中迷路了,无法理解如何从整个文档中获取表(我知道其标题),以及如何从该表中获取行列表。

有什么想法吗?

3 回复 | 直到 7 年前

PiperWarrior 7 年前

这不是您所需要的特定代码,只是一个关于如何使用BeautifulSoup的演示。它查找ID为“table1”的表,并获取其所有tr元素。

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')

goggin13 15 年前

soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times

-14

user338971 15 年前

如果你在意的话,美丽的汤就不再被维护了,原始的维护者建议向LXML过渡。xpath应该做得很好。

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

2 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

3 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

3 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

3 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

3 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

3 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

3 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

3 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

3 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

3 年前