代码之家  ›  专栏  ›  技术社区  ›  Alex

选择通过脚本添加到DOM中的元素

  •  0
  • Alex  · 技术社区  · 15 年前

    <object> 或者一个 <embed> 标记使用:

    HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
    HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
    

    这似乎不管用。

    YouTube嵌入的视频如下所示:

        <embed height="385" width="640" type="application/x-shockwave-flash" 
    src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
    allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
    

    我觉得JavaScript可能会阻止swf播放器工作,希望不是。。。

    干杯

    1 回复  |  直到 15 年前
        1
  •  3
  •   Dan Tao    15 年前

    :

    我觉得你想错了,亚历克斯。假设我写了一些C代码如下:

    string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
    

    ,因为在格式良好的C#文件的上下文中,该文本表示 string codeBlock 正在分配变量。

    <object> <embed> 元素 根本不是元素

    事实上,如果 HtmlAgilityPack 如果是HTML,使用这些元素仍然不会成功,因为在JavaScript中,它们被大量使用 \ Unescape 方法来解决这个问题)。

    .


    YouTubeScraper

    <对象> < 来自JavaScript海洋的元素。

    class YouTubeScraper
    {
        public HtmlNode FindObjectElement(string url)
        {
            HtmlNodeCollection scriptNodes = FindScriptNodes(url);
    
            for (int i = 0; i < scriptNodes.Count; ++i)
            {
                HtmlNode scriptNode = scriptNodes[i];
    
                string javascript = scriptNode.InnerHtml;
    
                int objectNodeLocation = javascript.IndexOf("<object");
    
                if (objectNodeLocation != -1)
                {
                    string htmlStart = javascript.Substring(objectNodeLocation);
    
                    int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
    
                    if (objectNodeEndLocation != -1)
                    {
                        string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
    
                        string unescaped = Unescape(finalEscapedHtml);
    
                        var objectDoc = new HtmlDocument();
    
                        objectDoc.LoadHtml(unescaped);
    
                        HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
    
                        return objectNode;
                    }
                }
            }
    
            return null;
        }
    
        public HtmlNode FindEmbedElement(string url)
        {
            HtmlNodeCollection scriptNodes = FindScriptNodes(url);
    
            for (int i = 0; i < scriptNodes.Count; ++i)
            {
                HtmlNode scriptNode = scriptNodes[i];
    
                string javascript = scriptNode.InnerHtml;
    
                int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
    
                if (approxEmbedNodeLocation != -1)
                {
                    string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
    
                    int embedNodeEndLocation = htmlStart.IndexOf(">\";");
    
                    if (embedNodeEndLocation != -1)
                    {
                        string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
    
                        string unescaped = Unescape(finalEscapedHtml);
    
                        var embedDoc = new HtmlDocument();
    
                        embedDoc.LoadHtml(unescaped);
    
                        HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
    
                        return videoEmbedNode;
                    }
                }
            }
    
            return null;
        }
    
        protected HtmlNodeCollection FindScriptNodes(string url)
        {
            var doc = new HtmlDocument();
    
            WebRequest request = WebRequest.Create(url);
            using (var response = request.GetResponse())
            using (var stream = response.GetResponseStream())
            {
                doc.Load(stream);
            }
    
            HtmlNode root = doc.DocumentNode;
            HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
    
            return scriptNodes;
        }
    
        static string Unescape(string htmlFromJavascript)
        {
            // The JavaScript has escaped all of its HTML using backslashes. We need
            // to reverse this.
    
            // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
            // of this code. If you could improve it, please, I beg of you to do so. Personally,
            // I tested it on a grand total of three inputs. It worked for those, at least.
            return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
        }
    
        static string UnescapeFromBeginning(Match match)
        {
            string text = match.ToString();
    
            if (text.StartsWith("\\"))
            {
                return text.Substring(1);
            }
    
            return text;
        }
    }
    

    如果你感兴趣,这里有一个小演示我扔在一起(超级花哨,我知道):

    class Program
    {
        static void Main(string[] args)
        {
            var scraper = new YouTubeScraper();
    
            HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
            Console.WriteLine("David After Dentist:");
            Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
            Console.WriteLine();
    
            HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
            Console.WriteLine("Drunk History:");
            Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
            Console.WriteLine();
    
            HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
            Console.WriteLine("Jessica's Daily Affirmation:");
            Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
            Console.WriteLine();
    
            HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
            Console.WriteLine("Jazzercise - Move your Boogie Body:");
            Console.WriteLine(jazzerciseObjectNode.OuterHtml);
            Console.WriteLine();
    
            Console.Write("Finished! Hit Enter to quit.");
            Console.ReadLine();
        }
    }
    

    为什么不改用元素的Id呢?

    HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
    

    更新 在内部 JavaScript?这就是为什么这不起作用。(它们并不是真正要从 HtmlAgilityPack文件 ;所有这些JavaScript实际上都是 <script> <脚本> 标记的内部文本本身 从那里开始。

    推荐文章