如果您不熟悉浏览器中的“开发人员工具”视图,请在完成此答案之前对此进行一些研究。你需要在一个新的浏览器会话中设置它,然后才能进入搜索页面真正查看流程。
GET
<form>
元素使用
POST
XHR
大多数开发人员工具中的请求
Network
窗格)。然而,这是一个制作拙劣的网站,太复杂了(
几乎
我曾经
curlconverter
分诊
岗位
XHR公司
岗位
请求,找到“Copy as cURL”菜单项并选择它。然后,仍然在剪贴板上,按照curlconverter的自述和手册页面上的说明进行操作
httr
函数返回。我真的不能保证带你看完这部分或回答
卷曲变换器
这里有问题。
httr公司
/
curl
给你留点饼干
和
为了得到一个key session变量,您需要在每次调用中传递,我们需要从一个新的R session开始,并用一个
得到
library(stringi) # Iprefer this for extracting matched strings
library(rvest)
library(httr)
primer <- httr::GET("https://jurispub.admin.ch/publiws/pub/search.jsf")
现在,我们需要提取该页面上javascript中的会话字符串:
httr::content(primer, as="text") %>%
stri_match_first_regex("session: '([[:alnum:]]+)'") %>%
.[,2] -> ice_session
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id64",
ice.event.captured = "form:_id63first",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "51",
ice.event.y = "336",
ice.event.left = "true",
ice.event.right = "false",
form = "form",
icefacesCssUpdates = "",
`form:_id63` = "first",
`form:_idcl` = "form:_id63first",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63first",
rand = "0.38654987905551663\\n\\n"
),
encode = "form"
) -> first_pg
现在我们有了第一页,我们需要它的数据。我不打算完全解决这个问题,但你应该能够从下面的内容推断出来。这个
岗位
httr::content(first_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
然而,这是一个可怕的使用HTML(程序员没有经过编辑的线索如何正确地做网页的东西),你不能只使用
html_table()
在它上面(而且你无论如何也不会想要,因为你可能想要到PDF的链接或者别的什么)。所以,我们可以随意拉出柱子:
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "A-3930/2013" "D-7885/2009" "E-5869/2012" "C-651/2011" "F-2439/2017" "D-7416/2009"
## [7] "D-838/2011" "C-859/2011" "E-1927/2017" "E-2606/2011"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=0002b1f8-ea53-40bb-8e38-402d9f3fdfa9"
## [2] "/publiws/download?decisionId=0002da8f-306e-4395-8eed-0b168df8634b"
## [3] "/publiws/download?decisionId=0003ec45-50be-45b2-8a56-5c0d866c2603"
## [4] "/publiws/download?decisionId=000508c2-c852-4aef-bc32-3385ddbbe88a"
## [5] "/publiws/download?decisionId=0006fbb9-228a-4bdc-ac8c-52db67df3b34"
## [6] "/publiws/download?decisionId=0008a971-6795-434d-90d4-7aeb1961606b"
## [7] "/publiws/download?decisionId=00099619-519c-4c8f-9cea-a16ed9ab9fd8"
## [8] "/publiws/download?decisionId=0009ac38-f2b0-4733-b379-05682473b5d9"
## [9] "/publiws/download?decisionId=000a4e0f-b2a2-483b-a49f-6ad12f4b7849"
## [10] "/publiws/download?decisionId=000be307-37b1-4d46-b651-223ceec9e533"
用肥皂泡,漂洗,重复其他专栏,但你可能需要做一些工作,以使他们很好,这是一个留给你的练习(即,我不会回答有关它的问题)。
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"
把它解析成结果的#和你在哪一页上是留给读者的练习。
现在,我们需要以编程方式单击“下一页”直到完成。我将进行两次手动迭代来证明它的有效性,以防止出现“它不起作用”的评论。您应该编写一个迭代器或循环来遍历接下来的所有页面,并以您想要的方式保存数据。
下一页(第一次迭代):
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id67",
ice.event.captured = "form:_id63next",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "330",
ice.event.y = "559",
ice.event.left = "true",
ice.event.right = "false",
form = "",
icefacesCssUpdates = "",
`form:_id63` = "next",
`form:_idcl` = "form:_id63next",
iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63next",
rand = "0.17641832791084566\\n\\n"
),
encode = "form"
) -> next_pg
httr::content(next_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "D-4059/2011" "D-4389/2006" "E-4019/2006" "D-4291/2008" "E-5642/2012" "E-7752/2010"
## [7] "D-7010/2014" "D-1551/2013" "C-7715/2010" "E-3187/2013"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464"
## [2] "/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82"
## [3] "/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b"
## [4] "/publiws/download?decisionId=0010a089-4f19-433e-b106-6d75833fae9a"
## [5] "/publiws/download?decisionId=00111bfc-3522-4a32-9e7a-fa2d9f171427"
## [6] "/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45"
## [7] "/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7"
## [8] "/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910"
## [9] "/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743"
## [10] "/publiws/download?decisionId=00167998-54d2-40a5-b02b-0c4546ac4760"
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 11 bis 20. Seite 2 von 5,730. Resultat sortiert nach: Relevanz"
请注意,列值不同,进度文本也不同。还要注意,我们很幸运,网站上不称职的程序员实际上有一个“下一个”事件,而不是强迫我们计算页码和X/Y坐标。
httr::POST(
url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
body = list(
`$ice.submit.partial` = "true",
ice.event.target = "form:_id67",
ice.event.captured = "form:_id63next",
ice.event.type = "onclick",
ice.event.alt = "false",
ice.event.ctrl = "false",
ice.event.shift = "false",
ice.event.meta = "false",
ice.event.x = "330",
ice.event.y = "559",
ice.event.left = "true",
ice.event.right = "false",
form = "",
icefacesCssUpdates = "",
`form:_id63` = "next",
`form:_idcl` = "form:_id63next",
iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
ice.session = ice_session,
ice.view = "1",
ice.focus = "form:_id63next",
rand = "0.17641832791084566\\n\\n"
),
encode = "form"
) -> next_pg
httr::content(next_pg) %>%
xml_find_first("//updates/update/content") %>%
xml_text() %>%
read_html() -> pg_tbl
data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
html_nodes(data_tbl, xpath=".//td[1]/a") %>%
html_text()
## [1] "D-3974/2010" "D-5847/2009" "D-4241/2015" "E-3043/2010" "D-602/2016" "C-2065/2008"
## [7] "D-2753/2007" "E-2446/2010" "C-1124/2015" "B-7400/2006"
html_nodes(data_tbl, xpath=".//td[2]/a") %>%
html_attr("href")
## [1] "/publiws/download?decisionId=00173ef1-2900-49d4-b7d3-39246e552a70"
## [2] "/publiws/download?decisionId=001a344c-86b7-4f32-97f7-94d30669a583"
## [3] "/publiws/download?decisionId=001ae810-300d-4291-8fd0-35de720a6678"
## [4] "/publiws/download?decisionId=001c2025-57dd-4bc6-8bd6-eedbd719a6e3"
## [5] "/publiws/download?decisionId=001c44ba-e605-455d-9609-ed7dffb17adc"
## [6] "/publiws/download?decisionId=001c6040-4b81-4137-a6ee-bad5a5019e71"
## [7] "/publiws/download?decisionId=001d0811-a5c2-4856-aef3-51a44f7f2b0e"
## [8] "/publiws/download?decisionId=001dbf61-b1b8-468d-936e-30b174a8bec9"
## [9] "/publiws/download?decisionId=001ea85a-0765-4a1f-9b81-3cecb9f36b31"
## [10] "/publiws/download?decisionId=001f2e34-9718-4ef7-a60c-e6bbe208003b"
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>%
html_text()
## [1] "57,294 Entscheide gefunden, zeige 21 bis 30. Seite 3 von 5,730. Resultat sortiert nach: Relevanz"
理想情况下,你应该把
岗位
在函数中,可以调用并返回
rbind
或
bind_rows
变成一个大数据框架。
如果您做到了这一点,另一种方法是使用RSelenium来编排“下一页”选择器上的页面单击,并检索回HTML(由于前面提到的程序员的无能,表仍然很糟糕,您需要使用列目标或其他一些HTML选择器魔术来从中获取有用的信息)。RSelenium引入了一种外部依赖性,您将看到,如果您在这么多R用户上进行搜索,那么这些用户就很难工作,特别是在同样糟糕的传统操作系统Windows上。如果您可以运行Selenium并使用RSelenium,那么从长远来看,如果上面的所有内容都让人望而生畏的话,可能会更容易(您仍然需要在某个时候摸索开发人员工具,因此上面的内容可能值得您去做,而且您还需要为Selenium的各种按钮提供HTML选择器目标)。
我会认真地避免使用phantomjs,因为它现在处于“尽力而为”的维护状态,您必须弄清楚如何使用JavaScript和R实现上述功能。