代码之家  ›  专栏  ›  技术社区  ›  reese0106

GCMLE NotFoundError:libnccl.so.2:无法打开共享对象文件

  •  0
  • reese0106  · 技术社区  · 7 年前

    我正在做一个GCMLE实验,我想用 MirroredStrategy train_distribute=tf.contrib.distribute.MirroredStrategy(num_gpus=4) 和我的配置文件来使用 complex_model_m_p100 机器thtaa应该有4个gpu。我收到一个警告 Error reported to Coordinator: libnccl.so.2: cannot open shared object file: No such file or directory 最后工作出错了 NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory issue 似乎暗示需要安装“NCCL2”。我能做些什么来避免这个错误,还是GCMLE后端的问题超出了我的控制?

    堆栈跟踪:

    The replica master 0 exited with a non-zero status of 1. 
    Traceback (most recent call last):
      [...]
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 368, in _batch_reduce
        value_destination_pairs)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 182, in batch_reduce
        return self._batch_reduce(aggregation, value_destination_pairs)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 524, in _batch_reduce
        [v[0] for v in value_destination_pairs])
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 556, in _batch_all_reduce
        device_grad_packs)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py", line 38, in aggregate_gradients_using_nccl
        agg_grads = nccl.all_sum(single_grads)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
        return _apply_all_reduce('sum', tensors)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 217, in _apply_all_reduce
        _validate_and_load_nccl_so()
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 288, in _validate_and_load_nccl_so
        _maybe_load_nccl_ops_so()
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 274, in _maybe_load_nccl_ops_so
        resource_loader.get_path_to_datafile('_nccl_ops.so'))
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/util/loader.py", line 56, in load_op_library
        ret = load_library.load_op_library(path)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
        lib_handle = py_tf.TF_LoadLibrary(library_filename)
    NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
    
    0 回复  |  直到 7 年前
    推荐文章