代码之家  ›  专栏  ›  技术社区  ›  Marco Canora

web抓取时内存泄漏

  •  0
  • Marco Canora  · 技术社区  · 8 年前

    www.newpct1.com

    const cheerio = require('cheerio');
    const request = require('request');
    
    
    console.log('\"Site\",\"Title\",\"Size\",\"URL\"');
    const baseURL = 'http://newpct1.com/';
    const sites = ['documentales/pg/', 'peliculas/pg/', 'series/pg/', 'varios/pg/'];
    for (let i = 0; i < sites.length; i++) {
      let site = sites[i].split('/')[0];
      for (let j = 1; true; j++) { // Infinite loop
        let siteURL = baseURL + sites[i] + j;
        // getMediaURLs
        // -------------------------------------------------------------------------
        request(siteURL, (err, resp, body) => {
          if (!err) {
            let $ = cheerio.load(body);
            let lis = $('li', 'ul.pelilist');
            // If exists media
            if (lis.length) {
              $('a', lis).each((k, elem) => {
                let mediaURL = $(elem).attr('href');
                // getMediaAttrs
                //------------------------------------------------------------------
                request(mediaURL, (err, resp, body) => {
                  if (!err) {
                    let $ = cheerio.load(body);
                    let title = $('strong', 'h1').text();
                    let size = $('.imp').eq(1).text().split(':')[1];
                    let torrent = $('a.btn-torrent').attr('href');
                    console.log('\"%s\",\"%s\",\"%s\",\"%s\"', site, title, size,
                      torrent);
                  }
                });
                //------------------------------------------------------------------
              });
            }
          }
        });
        // -------------------------------------------------------------------------
      }
    }
    

    这段代码的问题是永远不会结束执行,抛出此错误(内存泄漏):

    <--- Last few GCs --->
    
       22242 ms: Mark-sweep 1372.4 (1439.0) -> 1370.7 (1439.0) MB, 1088.7 / 0.0 ms [allocation failure] [GC in old space requested].
       23345 ms: Mark-sweep 1370.7 (1439.0) -> 1370.7 (1439.0) MB, 1103.0 / 0.0 ms [allocation failure] [GC in old space requested].
       24447 ms: Mark-sweep 1370.7 (1439.0) -> 1370.6 (1418.0) MB, 1102.1 / 0.0 ms [last resort gc].
       25527 ms: Mark-sweep 1370.6 (1418.0) -> 1370.6 (1418.0) MB, 1079.5 / 0.0 ms [last resort gc].
    
    
    <--- JS stacktrace --->
    
    ==== JS stack trace =========================================
    
    Security context: 0x272c0e23fa99 <JS Object>
        1: httpify [/home/marco/node_modules/caseless/index.js:~50] [pc=0x3f51b4a2c2c5] (this=0x1e65c39fbdb9 <JS Function module.exports (SharedFunctionInfo 0x1e65c39fb581)>,resp=0x2906174cf6a9 <a Request with map 0x2efe262dbef9>,headers=0x11e0242443f1 <an Object with map 0x2efe26206829>)
        2: init [/home/marco/node_modules/request/request.js:~144] [pc=0x3f51b4a3ee1d] (this=0x2906174cf6a9 <a Requ...
    
    FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
     1: node::Abort() [node]
     2: 0x10d3f9c [node]
     3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
     4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
     5: v8::internal::Handle<v8::internal::JSFunction> v8::internal::Factory::New<v8::internal::JSFunction>(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationSpace) [node]
     6: v8::internal::Factory::NewFunction(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
     7: v8::internal::Factory::NewFunctionFromSharedFunctionInfo(v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
     8: v8::internal::Runtime_NewClosure_Tenured(int, v8::internal::Object**, v8::internal::Isolate*) [node]
     9: 0x3f51b47060c7
    

    我试图在一台内存更大(16 GB)的机器上执行,但抛出了相同的错误。

    我还制作了一个堆快照,但我看不出问题出在哪里。快照在这里: https://drive.google.com/open?id=0B5Ysugq64wdLSHdHVHctUXZaNGM

    2 回复  |  直到 8 年前
        1
  •  2
  •   Alexander Mihalicyn    8 年前

    您可以尝试使用 --expose-gc $ = null; global.gc(); 之前/之后 console.log

    如果问题是相同的,我们会尝试执行算法更改并优化内存使用。

    非常有用的参考文献: https://github.com/cheeriojs/cheerio/issues/830 https://github.com/cheeriojs/cheerio/issues/263

        2
  •  1
  •   juvian    8 年前

    关于如何摆脱无限循环的一般想法:您开始对每个站点发出请求,每当一个站点完成时,您就为该站点请求下一页。

    for (let i = 0; i < sites.length; i++) {
      let site = sites[i].split('/')[0];
      let siteURL = baseURL + sites[i];
      scrapSite(siteURL, 0);
    }
    
    function scrapSite(siteURL, idx) {
        request(siteURL + idx, (err, resp, body) => {
            if (!err) {
                ...
                scrapMedia();
    
                if (pageExists) {
                    scrapSite(siteURL, idx + 1);
                }
            }
        }
    }