代码之家  ›  专栏  ›  技术社区  ›  Robert

使用HtmlAgilityPack的不可知屏幕刮刀

  •  1
  • Robert  · 技术社区  · 10 年前

    让我们假设我想要一个屏幕抓取器,它不关心您是否向它传递HTML页面、指向XML文档的url或指向文本文件的url。

    示例:

    http://tonto.eia.doe.gov/oog/info/wohdp/dslpriwk.txt

    http://google.com

    如果页面是HTML或文本文件,这将起作用:

    public class ScreenScrapingService : IScreenScrapingService
    {
        public XDocument Scrape(string url)
        {
            var scraper = new HtmlWeb();
            var stringWriter = new StringWriter();
            var xml = new XmlTextWriter(stringWriter);
            scraper.LoadHtmlAsXml(url, xml);
            var text = stringWriter.ToString();
            return XDocument.Parse(text);
        }
    }
    

    然而如果是XML文件,例如:

    http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml

    [Test]
    public void Scrape_ShouldScrapeSomething()
    {
        //arrange
        var sut = new ScreenScrapingService();
    
        //act
        var result = sut.Scrape("http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml");
    
        //assert
    
    }
    

    然后我得到错误:

     An exception of type 'System.Xml.XmlException' occurred in System.Xml.dll but was not handled in user code
    

    是否可以这样写,这样它就不在乎URL最终是什么了?

    1 回复  |  直到 10 年前
        1
  •  1
  •   Xi Sigma    10 年前

    在visualstudio上获得确切的异常 CTR+ALT+E 并启用 CommonLanguageRunTimeExceptions ,看起来LoadHtmlAsXml需要html,所以最好的选择可能是使用 WebClient.DownloadString(url) HtmlDocument 具有属性 OptionOutputAsXml 设置为 true 如下所示,当失败时,抓住它

     public XDocument Scrape(string url)
        {
            var wc = new WebClient();
            var htmlorxml = wc.DownloadString(url);
            var doc = new HtmlDocument() { OptionOutputAsXml = true};
            var stringWriter = new StringWriter();
            doc.Save(stringWriter);
            try
            {
                return XDocument.Parse(stringWriter.ToString());
            }
            catch
            {
                //it only gets here when the string is xml already
                try
                {
                    return XDocument.Parse(htmlorxml);
                }
                catch
                {
                    return null;
                }
            }
    
        }