代码之家  ›  专栏  ›  技术社区  ›  rvarcher

如何从网页中找到的不完整的URL形成完整的URL?

  •  1
  • rvarcher  · 技术社区  · 16 年前

    比如说,我可以检索网页的文本 https://stackoverflow.com/questions 有一些真实的和虚构的链接:

        /questions
        /tags
        /questions?sort=votes
        /questions?sort=active
        randompage.aspx
        ../coolhomepage.aspx
    

    知道我的原始页面是 https://stackoverflow.com/问题 .NET中是否有方法解析指向此的链接?

        https://stackoverflow.com/questions
        https://stackoverflow.com/tags
        https://stackoverflow.com/questions?sort=votes
        https://stackoverflow.com/questions?sort=active
        https://stackoverflow.com/questions/randompage.aspx
        https://stackoverflow.com/coolhomepage.aspx
    

    有点像浏览器足够智能来解析链接的方式。

    ======================更新-使用David的解决方案:

        'Regex to match all <a ... /a> links
        Dim myRegEx As New Regex("\<\s*a                   (?# Find opening <a tag)           " & _
                                 ".+?href\s*=\s*['""]      (?# Then all to href=' or "" )     " & _
                                 "(?<href>.*?)['""]        (?# Then all to the next ' or "" ) " & _
                                 ".*?\>                    (?# Then all to > )                " & _
                                 "(?<name>.*?)\<\s*/a\s*\> (?# Then all to </a> )             ", _
                                 RegexOptions.IgnoreCase Or _
                                 RegexOptions.IgnorePatternWhitespace Or _
                                 RegexOptions.Multiline)
    
        'MatchCollection to hold all the links that are matched
        Dim myMatchCollection As MatchCollection
        myMatchCollection = myRegEx.Matches(Me._RawPageText)
    
        'Loop through all matches and evaluate the value of the href attribute.
        For i As Integer = 0 To myMatchCollection.Count - 1
            Dim thisLink As String = ""
            thisLink = myMatchCollection(i).Groups("href").Value()
            'This checks for Javascript and Mailto links.
            'This is not complete. There are others to check I just haven't encountered them yet.
            If thisLink.ToLower.StartsWith("javascript") Then
                thisLink = "JAVASCRIPT: " & thisLink
            ElseIf thisLink.ToLower.StartsWith("mailto") Then
                thisLink = "MAILTO: " & thisLink
            Else
                Dim baseUri As New Uri(Me.URL)
    
                If Not thisLink.ToLower.StartsWith("http") Then
                    'This is a partial URL so we will assume that it's relative to our originating URL
                    Dim myUri As New Uri(baseUri, thisLink)
                    thisLink = "RELATIVE LOCAL LINK: RESOLVED: " & myUri.ToString() & " ORIGINAL: " & thisLink
                Else
                    'The link starts with HTTP, determine if part of base host or is outside host.
                    Dim ThisUri As New Uri(thisLink)
                    If ThisUri.Host.ToLower = baseUri.Host.ToLower Then
                        thisLink = "INSIDE COMPLETE LINK: " & thisLink
                    Else
                        thisLink = "OUTSIDE LINK: " & thisLink
                    End If
                End If
    
            End If
    
            'I'm storing the found links into a Generic.List(Of String)
            'This link has descriptive text added to it.
            'TODO: Make collection to hold only unique internal links.
            Me._Links.Add(thisLink)
        Next
    
    3 回复  |  直到 16 年前
        1
  •  2
  •   David McEwing    16 年前

    你是说像这样?

    Uri baseUri = new Uri("http://www.contoso.com");
    Uri myUri = new Uri(baseUri, "catalog/shownew.htm");
    
    Console.WriteLine(myUri.ToString());
    

    样品来自 http://msdn.microsoft.com/en-us/library/9hst1w91.aspx

        2
  •  1
  •   John Rasch    16 年前

    如果您是指服务器端,则可以使用 ResolveUrl() :

    string url = ResolveUrl("~/questions");
    
        3
  •  0
  •   Chad Grant    16 年前

    我不明白您在这个上下文中所说的“解析”是什么意思,但是您可以尝试插入一个基本的HTML元素。因为你问浏览器会如何处理它。

    <base> 标记为页面上的所有链接指定默认地址或默认目标。“

    http://www.w3schools.com/TAGS/tag_base.asp