代码之家  ›  专栏  ›  技术社区  ›  Nick Woodhams

让我的domdocument/domxpath php脚本不占用内存?

php
  •  1
  • Nick Woodhams  · 技术社区  · 14 年前

    制作了这个脚本来对论坛上的某些链接进行爬行,并提取用户名、发布日期和发布编号。

    它工作得很好,唯一的问题是它占用了记忆,大约半个小时后,它会明显减速。

    有人建议加快速度吗?我在服务器上运行了一个wget来启动脚本。

    谢谢, 尼克

       <?
    //this php script is going to download pages and tear them apart from ###
    
    /*
    Here's the process:
    
    1. prepare url 
    2. get new HTML document from the web
    3. extract xpath data
    4. input in mysql database
    */
    
    
    $baseURL="http://www.###.com";
    
    //end viewtopic.php?p=357850
    for ($post = 325479; $post <= 357850; $post++) {
    
    //connect to mysql
    if (!mysql_connect('localhost','###','###')) echo mysql_error;
    mysql_select_db('###');
    
    //check to see if the post is already indexed
    $result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post'");
    if (mysql_num_rows($result) > 0) {
        //echo "Already in the database." . "<br>";
        mysql_close();
        continue;
    }
    
    $url=$baseURL."/viewtopic.php?p=".$post;
    //echo $url."<br>";
    
    //get new HTML document
    $html = new DOMDocument(); 
    $html->loadHTMLFile($url);
    
    $xpath = new DOMXpath($html);
    
    //select the page elements that you want
    //I want the parent of the TD class = forumRow
    $links = $xpath->query( "//td[@class='forumRow']/parent::tr" ); 
    
        foreach($links as $results){
            $newDom = new DOMDocument;
            $newDom->appendChild($newDom->importNode($results,true));
    
            $xpath = new DOMXpath ($newDom);
    
            //which parts of the selection do you want?
            $time_stamp = trim($xpath->query("//td[2]/table/tr/td/span")->item(0)->nodeValue);
            $user_name = trim($xpath->query("//a[@class='genmed']")->item(0)->nodeValue);
            $post_number = trim($xpath->query("//td/a/@name")->item(0)->nodeValue);
    
            $return[] = array(
                'time_stamp' => $time_stamp,
                'username' => $user_name,
                'post_number' => $post_number,
                );
        }
    
        foreach ($return as $output) {
            if (strlen($output['time_stamp']) > 0 && strlen($output['username']) > 0) 
              {
              //$timestamp = substr($output['time_stamp'],8,25);
              //echo $timestamp . "<br>";
              //$unixtimestamp = strtotime($timestamp);
              //echo $unixtimestamp;
              //echo $output['time_stamp']."<br>";
              preg_match("/[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec]{3} \d{1,2}[,] \d{4} \d{1,2}[:]\d{2}/", $output['time_stamp'],$matches). "<br>";
              $unixtimestamp = strtotime($matches[0]);
    
              //YYYY-MM-DD HH:MM:SS
              $phpdate=date("Y-m-d H:i:s",$unixtimestamp);
              $username=$output['username'];
              $post_number=$output['post_number'];
              //echo $phpdate ." by ". $username . " #" . $post_number ;
    
              $result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post_number'");
              if (mysql_num_rows($result) == 0) {         
                if (mysql_query("INSERT INTO ### VALUES('','$url','$username','$phpdate','$post_number')")) echo "Y";
                else echo "N";
                mysql_close();
              }
              echo "<br>";
              }
        }
    }
    ?>
    
    1 回复  |  直到 14 年前
        1
  •  1
  •   netcoder    14 年前
    推荐文章