代码之家 › 专栏 › 技术社区 › Makis

用python读取网页

libxml2 python

Makis · 技术社区 · 14 年前

我正在尝试用python读取和处理一个网页,其中包含如下行:

              <div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">

我目前只对艺术家姓名(AC/DC)和专辑名(Live)感兴趣。我可以用libxml2dom读取和打印它们,但我不知道如何区分链接,因为每个链接的节点值都是“无”。

一种显而易见的方法是一次读取输入行,但是是否有一种更聪明的方法来处理这个HTML文件,以便我可以在每个索引与另一个索引匹配的地方创建两个单独的列表,或者用这个信息创建一个结构?

import urllib
import sgmllib
import libxml2dom

def collect_text(node):
  "A function which collects text inside 'node', returning that text."

  s = ""
  for child_node in node.childNodes:
    if child_node.nodeType == child_node.TEXT_NODE:
        s += child_node.nodeValue
    else:
        s += collect_text(child_node)
  return s

  f = urllib.urlopen("/home/x/Documents/rym_list.html")

  s = f.read()

  doc = libxml2dom.parseString(s, html=1)

  links = doc.getElementsByTagName("a")
  for link in links:
    print "--\nNode " , artist.childNodes
    if artist.localName == "artist":
      print "artist"
    print collect_text(artist).encode('utf-8')

  f.close()

2 回复 | 直到 14 年前

MattH 14 年前

考虑到HTML的小缺陷,我不知道这在整个页面上是否有效,但下面是如何提取“ac/dc”和“live”使用 lxml.etree 和 xpath .

>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()')
['AC/DC', 'Live']

dhruvbird 14 年前

看看您是否可以使用jquery风格的dom/css选择器在javascript中解决这个问题,以获得您想要的元素/文本。
如果你能得到一份巨蟒汤的副本,那么你应该在几分钟内就可以走了。

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前

Brian Johnson · 为什么在Python中列出字典列表会引发TypeError?[已关闭]

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

Ashok Shrestha · 需要追踪特定的颜色线并获取坐标

1 年前

Nicote Ool · 在FastApi和Vue3中获得422

1 年前

NeoExceptCodeBad · 如果我有很多垂直线,我如何找到它们的边缘?

1 年前

Abdulaziz · 如何对集合内的列表进行排序[重复]

1 年前

user2743931 · 带有src目录的Python setup.py

1 年前

asmgx · 为什么合并数据帧不能按照python中的预期方式工作

1 年前