代码之家 › 专栏 › 技术社区 › Daniel Quinn

如何捕获HTML,不受捕获库的干扰?

lxml beautifulsoup web-scraping html python

Daniel Quinn · 技术社区 · 8 年前

有没有一个python库可以让我得到任意的html片段没有调戏加价?据我所知,lxml、beautifulsoup和pyquery都可以使 soup.find(".arbitrary-class") ,但它返回的html是格式化的。我想要原始的,原始的标记。

例如,假设我有这个:

<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <div class="arbitrary-class">
      This is some<br />
      markup with <br>
      <p>some potentially problematic</p>
      stuff in it <input type="text" name="w00t">
    </div>
  </body>
</html>

我想抓住 确切地 :

"
      This is some<br />
      markup with <br>
      <p>some potentially problematic</p>
      stuff in it <input type="text" name="w00t">
    "

…空格和全部,并且没有损坏要正确格式化的标记(如 <br /> 例如)。

问题是,似乎所有3个库都在内部构造dom,并返回一个python对象来表示文件应该而不是什么是 ,所以我不知道在哪里/如何获取所需的原始代码片段。

1 回复 | 直到 8 年前

stx101 7 年前

此代码:

from bs4 import BeautifulSoup
with open("index.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")
    print soup.select(".arbitrary-class")[0].contents

将返回列表:

[u'\n      This is some', <br/>, u'\n      markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n      stuff in it ', <input name="w00t" type="text"/>, u'\n']

编辑:

正如daniel在评论中指出的,这会导致规范化标记。

我能找到的唯一选择是使用解析器生成器,比如pyparsing。下面的代码是对其中一些 example code 对于 withAttribute 功能。

from pyparsing import *

html = """<html>
<head>
    <title>test</title>
</head>
<body>
    <div class="arbitrary-class">
    This is some<br />
    markup with <br>
    <p>some potentially problematic</p>
    stuff in it <input type="text" name="w00t">
    </div>
</body>
</html>"""

div,div_end = makeHTMLTags("div")

# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
    print repr(grid_header.body)

此代码的输出如下:

'\n    This is some<br />\n    markup with <br>\n    <p>some potentially problematic</p>\n    stuff in it <input type="text" name="w00t">'

注意第一个 <br/> 现在有了一个空间, <input> 标记在关闭前不再有添加的/标记。与规范的唯一区别是缺少尾随空白。你也许可以通过改进这个解决方案来解决这个差异。

推荐文章

Dinosaur · 使用BeautifulSoup点击div标签后抓取html页面

1 年前

Stackie · 无法使用Selenium访问废料数据的链接

1 年前

Avraham · 如何在JS中将beautifulsoup中的文本设置为.innerText而非.textContent

1 年前

Rayan CH TFG · 需要解释Python中的web抓取lambda函数

1 年前

hyoni · 使用Python BeautifulSoup进行网页抓取

1 年前

Canberra · 从网站上删除纬度和经度

1 年前

zero · bs4方法访问维基百科页面:获取信息框

1 年前

Oscar Tarrago · Beautiful Soup“.fund”无法从windows终端运行

1 年前

knowledge_seeker · 无法解析从Python中的HTML标记中的属性获得的JSON字符串

1 年前

Reonard1 · Web报废中的AttributeError

1 年前