代码之家  ›  专栏  ›  技术社区  ›  Chasester

VBA使用嵌套的DIV抓取URL

vba
  •  0
  • Chasester  · 技术社区  · 4 月前

    我一直在关注这个 This Tutorial 从URL中抓取数据,因为它非常符合我的需求(3个div深)。不幸的是,在StackOverflow不再支持IE的情况下,我无法测试教程的代码以查看其是否按原样工作。就我而言,我无法使用chrome插件,在导航到URL之前,我必须先对网站进行身份验证。我还尝试了问题15191847的解决方案,特别是 gembird's 解决方案-它给了我同样的错误。

    当我运行下面的代码时,我得到一个“运行时错误'91'。我将I.document打印到一个文本文件中,并验证了我正在搜索的div id是正确的,并且它们被捕获了。错误继续出现 Set Questions = QuestionList.Children .有没有想过为什么它会向我显示错误?

    Dim ie As InternetExplorer
    Dim html As HTMLDocument
    Dim QuestionList As IHTMLElement, QuestionField As IHTMLElement
    Dim Questions As IHTMLElementCollection, QuestionFieldLinks As IHTMLElementCollection, QuestionFields As IHTMLElementCollection
    Dim Question As IHTMLElement
    
    Dim RowNumber As Long
    Dim votes As String, url As String, views As String, QuestionId As String
    
    url = "<<my url>>"
    
    'open Internet Explorer in memory, and go to website
    Set ie = New InternetExplorer
    ie.Visible = True
    ie.navigate url
    
    
    'Wait until IE is done loading page
    Do While ie.READYSTATE <> READYSTATE_COMPLETE
      Application.StatusBar = "Trying to go to " & url
      DoEvents
    Loop
    
    Cells.Clear
    'show text of HTML document returned
    Set html = ie.Document
    
    'close down IE and reset status bar
    Set ie = Nothing
    Application.StatusBar = ""
    
    'put heading across the top of row 3
    Range("A3").Value = "Field"
    Range("B3").Value = "Values"
    
    Set QuestionList = html.getElementByID("fieldgroup ")    
    Set Questions = QuestionList.Children
    
    RowNumber = 4
    
    For Each Question In Questions
      If Question.className = "fieldrow _text-field" Then
        'get a list of all of the parts of this question, and loop over them
        Set QuestionFields = Question.all
        
        For Each QuestionField In QuestionFields
          'if this is the question's votes, store it (get rid of any surrounding text)
          If QuestionField.className = "fieldlabel" Then
            Cells(RowNumber, 1).Value = Trim(QuestionField.innerText)
          End If
    
          'likewise for views (getting rid of any text)
          If QuestionField.className = "fieldvalue" Then
            Cells(RowNumber, 2).Value = Trim(QuestionField.innerText)
          End If
        Next QuestionField
        'go on to next row of worksheet
        RowNumber = RowNumber + 1
        
      End If
    Next
    Set html = Nothing
    

    HTML输出如下。

    <div class="fieldgroup " style="" group-title="">
     
    <div class="fieldrow _text-field">
    
    <div class="fieldlabel">Reporting</div>
    <div class="fieldvalue">Yes</div>
     </div>
        
    <div class="fieldrow _text-field">
    <div class="fieldlabel">Annotate ''Yes''</div>
    <div class="fieldvalue">Yes</div>
     </div>
    ...
    
    1 回复  |  直到 4 月前
        1
  •  1
  •   FunThomas    4 月前

    你搞混了 id class 它们是不同的东西,例如 What's the difference between an id and a class? .
    你的元素有 -属性,而不是ID。要搜索具有特定类属性的元素,请使用以下函数 getElementsByClassName 请注意,这是一个“复数”函数,它将返回 全部的 具有该特定类属性的元素。即使它只找到一个元素,它也会返回一个可以容纳任意数量元素的数据结构,您需要使用索引来访问其中一个元素。据我所知,在VBA中,这是一个基于0的数组。

    如果你总是确信 元素已找到,请使用

    Set QuestionList = html.getElementsByClassName("fieldgroup ")(0)
    Set Questions = QuestionList.Children
    

    或者(但在这种情况下,您需要对问题列表进行不同的定义):

    Set QuestionList = html.getElementsByClassName("fieldgroup ")
    Set Questions = QuestionList(0).Children
    

    我省略了错误检查,所以如果HTML不包含任何具有该类名的元素,您仍然会收到运行时错误。要编写健壮的代码,您应该添加该检查。