代码之家  ›  专栏  ›  技术社区  ›  Costo

HTML抓取和CSS查询

  •  11
  • Costo  · 技术社区  · 6 年前

    以下图书馆的优点和缺点是什么?

    从上面我已经使用了qp,它未能解析无效的html和simpledomparser,这做的很好,但它有点泄漏内存,因为对象模型。但你可以打电话来控制 $object->clear(); unset($object); 当你不再需要物品时。

    还有刮刀吗?你对他们有什么经验?我要把它变成一个社区wiki,我们可以建立一个有用的库列表,在抓取时可以使用。


    我根据拜伦的回答做了一些测试:

        <?
        include("lib/simplehtmldom/simple_html_dom.php");
        include("lib/phpQuery/phpQuery/phpQuery.php");
    
    
        echo "<pre>";
    
        $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
        $data['pq'] = $data['dom'] = $data['simple_dom'] = array();
    
        $timer_start = microtime(true);
    
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $x = new DOMXPath($dom);
    
        foreach($x->query("//a") as $node)
        {
             $data['dom'][] = $node->getAttribute("href");
        }
    
        foreach($x->query("//img") as $node)
        {
             $data['dom'][] = $node->getAttribute("src");
        }
    
        foreach($x->query("//input") as $node)
        {
             $data['dom'][] = $node->getAttribute("name");
        }
    
        $dom_time =  microtime(true) - $timer_start;
        echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";
    
    
    
    
    
    
        $timer_start = microtime(true);
        $doc = phpQuery::newDocument($html);
        foreach( $doc->find("a") as $node)
        {
           $data['pq'][] = $node->href;
        }
    
        foreach( $doc->find("img") as $node)
        {
           $data['pq'][] = $node->src;
        }
    
        foreach( $doc->find("input") as $node)
        {
           $data['pq'][] = $node->name;
        }
        $time =  microtime(true) - $timer_start;
        echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";
    
    
    
    
    
    
    
    
    
        $timer_start = microtime(true);
        $simple_dom = new simple_html_dom();
        $simple_dom->load($html);
        foreach( $simple_dom->find("a") as $node)
        {
           $data['simple_dom'][] = $node->href;
        }
    
        foreach( $simple_dom->find("img") as $node)
        {
           $data['simple_dom'][] = $node->src;
        }
    
        foreach( $simple_dom->find("input") as $node)
        {
           $data['simple_dom'][] = $node->name;
        }
        $simple_dom_time =  microtime(true) - $timer_start;
        echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";
    
    
        echo "</pre>";
    

    并且得到

    dom:         0.00359296798706 . Got 115 items 
    PQ:          0.010568857193 . Got 115 items 
    simple_dom:  0.0770139694214 . Got 115 items 
    
    1 回复  |  直到 14 年前
        1
  •  7
  •   Dan Williams    14 年前

    我以前只使用简单的HTMLDOM,直到有人给我展示了光明的哈利路亚。

    只需使用内置的DOM函数。它们是用C语言编写的,是PHP核心的一部分。它们比任何第三方解决方案都更快。使用Firebug,获取xpath查询非常简单。这个简单的改变使我的基于PHP的刮刀运行得更快,同时节省了我宝贵的时间。

    我的scraper过去需要60兆字节才能用curl异步地抓取10个站点。这甚至与您提到的简单的HTMLDOM内存修复程序有关。

    现在我的PHP进程永远不会超过8兆字节。

    强烈推荐。

    编辑

    好吧,我做了一些基准测试。内置的DOM至少快了一个数量级。

    Built in php DOM: 0.007061
    Simple html  DOM: 0.117781
    
    <?
    include("../lib/simple_html_dom.php");
    
    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
    $data['dom'] = $data['simple_dom'] = array();
    
    $timer_start = microtime(true);
    
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom); 
    
    foreach($x->query("//a") as $node) 
    {
         $data['dom'][] = $node->getAttribute("href");
    }
    
    foreach($x->query("//img") as $node) 
    {
         $data['dom'][] = $node->getAttribute("src");
    }
    
    foreach($x->query("//input") as $node) 
    {
         $data['dom'][] = $node->getAttribute("name");
    }
    
    $dom_time =  microtime(true) - $timer_start;
    
    echo "built in php DOM : $dom_time\n";
    
    $timer_start = microtime(true);
    $simple_dom = new simple_html_dom();
    $simple_dom->load($html);
    foreach( $simple_dom->find("a") as $node)
    {
       $data['simple_dom'][] = $node->href;
    }
    
    foreach( $simple_dom->find("img") as $node)
    {
       $data['simple_dom'][] = $node->src;
    }
    
    foreach( $simple_dom->find("input") as $node)
    {
       $data['simple_dom'][] = $node->name;
    }
    $simple_dom_time =  microtime(true) - $timer_start;
    
    echo "simple html  DOM : $simple_dom_time\n";