代码之家  ›  专栏  ›  技术社区  ›  kkesley

mrjob返回非零退出状态256

  •  0
  • kkesley  · 技术社区  · 7 年前

    我是地图还原的新手,我正在尝试使用 mrjob python包。但是,我遇到了以下错误:

    ERROR:mrjob.launch:Step 1 of 1 failed: Command '['/usr/bin/hadoop', 'jar', '/usr/lib/hadoop-mapreduce/hadoop-streaming.jar', '-files', 
    'hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/mrjob.zip#mrjob.zip,
    hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/setup-wrapper.sh#setup-wrapper.sh,
    hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/word_count.py#word_count.py', '-archives', 
    'hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/word_count_ccmr.tar.gz#word_count_ccmr.tar.gz', '-D', 
    'mapreduce.job.maps=4', '-D', 'mapreduce.job.reduces=4', '-D', 'mapreduce.map.java.opts=-Xmx1024m', '-D', 'mapreduce.map.memory.mb=1200', '-D', 
    'mapreduce.output.fileoutputformat.compress=true', '-D', 'mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec', '-D', 
    'mapreduce.reduce.java.opts=-Xmx1024m', '-D', 'mapreduce.reduce.memory.mb=1200', '-input', 'hdfs:///user/hadoop/test-1.warc', '-output', 
    'hdfs:///user/hadoop/gg', '-mapper', 'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --mapper', '-combiner', 
    'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --reducer']' 
    returned non-zero exit status 256
    

    我试过在本地用 python ./word_count.py input/test-1.warc > output

    我在用

    1. python 2.7.14
    2. Hadoop 2.8.3-amzn-1
    3. pip 18.0
    4. mrjob 0.6.4

    这是我运行mapreduce作业时的命令。我是从 cc-mrjob run_hadoop.sh 我用 chmod +x run_hadoop.sh

    #!/bin/sh
    
    JOB="$1"
    INPUT="$2"
    OUTPUT="$3"
    
    sudo chmod +x $JOB.py
    
    if [ -z "$JOB" ] || [ -z "$INPUT" ] || [ -z "$OUTPUT" ]; then
        echo "Usage: $0 <job> <input> <outputdir>"
        echo "  Run a CommonCrawl mrjob on Hadoop"
        echo
        echo "Arguments:"
        echo "  <job>     CCJob implementation"
        echo "  <input>   input path"
        echo "  <output>  output path (must not exist)"
        echo
        echo "Example:"
        echo "  $0 word_count input/test-1.warc  hdfs:///.../output/"
        echo
        echo "Note: don't forget to adapt the number of maps/reduces and the memory requirements"
        exit 1
    fi
    
    # strip .py from job name
    JOB=${JOB%.py}
    
    # wrap Python files for deployment, cf. below option --setup,
    # see for details
    # http://pythonhosted.org/mrjob/guides/setup-cookbook.html#putting-your-source-tree-in-pythonpath
    tar cvfz ${JOB}_ccmr.tar.gz *.py
    
    # number of maps resp. reduces 
    NUM_MAPS=4
    NUM_REDUCES=4
    
    if [ -n "$S3_LOCAL_TEMP_DIR" ]; then
        S3_LOCAL_TEMP_DIR="--s3_local_temp_dir=$S3_LOCAL_TEMP_DIR"
    else
        S3_LOCAL_TEMP_DIR=""
    fi
    python $JOB.py \
           -r hadoop \
           --jobconf "mapreduce.map.memory.mb=1200" \
           --jobconf "mapreduce.map.java.opts=-Xmx1024m" \
           --jobconf "mapreduce.reduce.memory.mb=1200" \
           --jobconf "mapreduce.reduce.java.opts=-Xmx1024m" \
           --jobconf "mapreduce.output.fileoutputformat.compress=true" \
           --jobconf "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec" \
           --jobconf "mapreduce.job.reduces=$NUM_REDUCES" \
           --jobconf "mapreduce.job.maps=$NUM_MAPS" \
           --setup 'export PYTHONPATH=$PYTHONPATH:'${JOB}'_ccmr.tar.gz#/' \
           --no-output \
           --cleanup NONE \
           $S3_LOCAL_TEMP_DIR \
           --output-dir "$OUTPUT" \
           "hdfs:///user/hadoop/$INPUT"
    

    我和你一起跑 ./run_hadoop.sh word_count test-1.warc output

    哪里

    • word_count 作业(文件)名为 word_count.py )
    • test-1.warc 输入(位于 hdfs:///user/hadoop/test-1.warc )
    • output hdfs:///user/hadoop/output )我还确保我总是对不同的作业使用不同的输出,以防止重复的文件夹)

    *更新*

    我查看了系统日志的界面。有一个错误

    org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1536113332062_0001_r_000003_0

    这和我犯的错误有关吗?

    我在一次地图尝试中也得到了这个

    /bin/sh: run_prestart: line 1: syntax error: unexpected end of file

    No module named boto3

    1 回复  |  直到 7 年前
        1
  •  1
  •   kkesley    7 年前

    我是通过关注这个博客来实现的

    http://benjamincongdon.me/blog/2018/02/02/MapReduce-on-Python-is-better-with-MRJob-and-EMR/

    必须在hadoop中包含runner的.conf文件。例如 mrjob.conf

    在那个文件里,用这个

    runners:
      hadoop:
        setup:
          - 'set -e'
          - VENV=/tmp/$mapreduce_job_id
          - if [ ! -e $VENV ]; then virtualenv $VENV; fi
          - . $VENV/bin/activate
          - 'pip install boto3'
          - 'pip install warc'
          - 'pip install https://github.com/commoncrawl/gzipstream/archive/master.zip'
        sh_bin: '/bin/bash -x'
    

    run_hadoop.sh

    python $JOB.py \
            --conf-path mrjob.conf \ <---- OUR CONFIG FILE
            -r hadoop \
            --jobconf "mapreduce.map.memory.mb=1200" \
            --jobconf "mapreduce.map.java.opts=-Xmx1024m" \
            --jobconf "mapreduce.reduce.memory.mb=1200" \
            --jobconf "mapreduce.reduce.java.opts=-Xmx1024m" \
            --jobconf "mapreduce.output.fileoutputformat.compress=true" \
            --jobconf "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec" \
            --jobconf "mapreduce.job.reduces=$NUM_REDUCES" \
            --jobconf "mapreduce.job.maps=$NUM_MAPS" \
            --setup 'export PYTHONPATH=$PYTHONPATH:'${JOB}'_ccmr.tar.gz#/' \
            --cleanup NONE \
            $S3_LOCAL_TEMP_DIR \
            --output-dir "hdfs:///user/hadoop/$OUTPUT" \
            "hdfs:///user/hadoop/$INPUT"
    

    现在如果你打电话 ./run_hadoop.sh word_count input/test-1.warc output ,应该有用!