代码之家  ›  专栏  ›  技术社区  ›  Mikael Rousson

Tensorflow为aws新p3实例提供CUDA 9编译服务

  •  0
  • Mikael Rousson  · 技术社区  · 7 年前

    我能够从Amazon的修改源代码(在他们新的深度学习AMI中提供)重新编译Tensorflow。

    我现在正试图用Tensorflow“fork”编译tf服务,但我得到了一个错误:

    ERROR: /root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1: undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels':
    this rule is missing dependency declarations for the following files included by 'external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_rewrite.cc':
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/optimization_registry.h'
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device_set.h'
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device.h'
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/types.h'
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/costmodel.h'
      '/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/node_builder.h'
    INFO: Elapsed time: 20.377s, Critical Path: 19.47s
    FAILED: Build did NOT complete successfully
    

    更多信息:我正在使用Tensorflow服务(commit)的主分支 7a349752c2cbbe741edb91c6c6be1c571e91a5fb )Bazel版本0.7.0。

    我还对 tools/bazel.rc 要解决另一个编译错误,请执行以下操作:

    # git diff tools/bazel.rc 
    diff --git a/tools/bazel.rc b/tools/bazel.rc
    index 9397f97..28476f3 100644
    --- a/tools/bazel.rc
    +++ b/tools/bazel.rc
    @@ -1,4 +1,4 @@
    -build:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool
    +build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
     build:cuda --define=using_cuda=true --define=using_cuda_nvcc=true
    
     build --force_python=py2
    

    知道遗漏了什么吗?

    1 回复  |  直到 7 年前
        1
  •  1
  •   Chris Fregly    7 年前

    我通常禁用NCCL,因为它似乎从未正确构建:

    https://github.com/PipelineAI/pipeline/blob/6261c4f31105e40ab8b24ccc7834f9181f4e5aaf/package/tensorflow/16d39e9-d690fdd/Dockerfile.full-gpu#L160

    RUN \
      cd $TENSORFLOW_SERVING_HOME \
      # Remove NCCL since it isn't building properly
      && sed -i.bak '/nccl/d' tensorflow/tensorflow/contrib/BUILD \
      && bazel build -c opt --config=cuda \
          --verbose_failures \
          --spawn_strategy=standalone --genrule_strategy=standalone \
          --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 \
          --crosstool_top=@local_config_cuda//crosstool:toolchain \
           tensorflow_serving/... \
      && chmod a+x bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server \
      && cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ \
      && bazel clean --expunge