代码之家 › 专栏 › 技术社区 › Rodolfo

Scrapy响应返回一个空数组

scrapy xpath web-scraping shell python

0

Rodolfo · 技术社区 · 2 年前

我在爬这个 page 我试图提取主表的所有行。

以下内容 XPath 表达式应该给我想要的结果:

//div[@id='TableWithRules']//tbody/tr

使用scrape shell进行测试时,我注意到这个表达式确实返回了一个空数组:

#This response is empty: []
response.xpath("//div[@id='TableWithRules']//tbody").extract()
#This one is not:
response.xpath("//div[@id='TableWithRules']//thead").extract()

我猜网站所有者试图限制对表数据的抓取,但有什么办法可以解决吗?

0 回复 | 直到 2 年前

1

2

Alexander 2 年前

发生这种情况是因为您试图查询一个不存在的元素。这个 tbody 元素通常由浏览器注入到html中,在呈现之前实际上并不存在于源html中。如果您检查页面源代码,您可以看到这一点。

获取所有行的一个可能的解决方法是简单地绕过 表格主体 标记并直接查询行:

例子: scrapy shell https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hp

In [1]: rows = response.xpath("//div[@id='TableWithRules']//tr")

In [2]: len(rows)
Out[2]: 3366

或者,如果你想跳过标题行,你也可以这样做。

In [1]: rows = response.xpath("//div[@id='TableWithRules']//tr[td]")

In [2]: len(rows)
Out[2]: 3365

2

1

GIA 2 年前

如果在控制台中运行此JavaScript,它将从页面中提取所有名称和描述。

let trs = document.querySelectorAll('#TableWithRules tbody tr')

trs.forEach((el) => {
    let tds = el.querySelectorAll('td')
    let name = tds[0].innerText;
    let description = tds[1].innerText;
    console.log(name, description)
})

使用相同的代码硒例如:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get("https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hp")

trs = driver.find_elements(By.XPATH, "//div[@id='TableWithRules']//tbody//tr")
for tr in trs:
    tds = tr.find_elements(By.XPATH, ".//td")
    name = tds[0].text
    description = tds[1].text
    print(name, description)

driver.close()

输出

...
CVE-1999-0016 Land IP denial of service.
CVE-1999-0014 Unauthorized privileged access or denial of service via dtappgather program in CDE.
CVE-1999-0011 Denial of Service vulnerabilities in BIND 4.9 and BIND 8 Releases via CNAME record and zone transfer.
CVE-1999-0010 Denial of Service vulnerability in BIND 8 Releases via maliciously formatted DNS messages.
CVE-1999-0009 Inverse query buffer overflow in BIND 4.9 and BIND 8 Releases.
...

代码说明

最初,检索所有 tr 元素从 tbody 在 #TableWithRules 桌子。然后,构造一个for循环来迭代这些 tr 元素,提取全部 td 其中包含的元素。通常有两种 td 元素:一个用于 name 另一个为 description 。继续从以下位置获取文本 td[0] 和 td[1] .

那么“主题”呢?

过程为 THEAD 与上述类似。主要区别在于目标 THEAD 而不是 TBODY ,并专注于 th 元素而不是 td .