代码之家  ›  专栏  ›  技术社区  ›  Jonathan

如何强制tensorflow使用所有可用的GPU?

  •  9
  • Jonathan  · 技术社区  · 7 年前

    我有一个8 GPU群集,当我运行 piece of Tensorflow code from Kaggle (粘贴在下面),它只使用一个GPU,而不是全部8个GPU。我用 nvidia-smi

    # Set some parameters
    IMG_WIDTH = 256
    IMG_HEIGHT = 256
    IMG_CHANNELS = 3
    TRAIN_IM = './train_im/'
    TRAIN_MASK = './train_mask/'
    TEST_PATH = './test/'
    
    warnings.filterwarnings('ignore', category=UserWarning, module='skimage')
    num_training = len(os.listdir(TRAIN_IM))
    num_test = len(os.listdir(TEST_PATH))
    # Get and resize train images
    X_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
    Y_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
    print('Getting and resizing train images and masks ... ')
    sys.stdout.flush()
    
    #load training images
    for count, filename in tqdm(enumerate(os.listdir(TRAIN_IM)), total=num_training):
        img = imread(os.path.join(TRAIN_IM, filename))[:,:,:IMG_CHANNELS]
        img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
        X_train[count] = img
        name, ext = os.path.splitext(filename)
        mask_name = name + '_mask' + ext
        mask = cv2.imread(os.path.join(TRAIN_MASK, mask_name))[:,:,:1]
        mask = resize(mask, (IMG_HEIGHT, IMG_WIDTH))
        Y_train[count] = mask
    
    # Check if training data looks all right
    ix = random.randint(0, num_training-1)
    print(ix)
    imshow(X_train[ix])
    plt.show()
    imshow(np.squeeze(Y_train[ix]))
    plt.show()
    # Define IoU metric
    def mean_iou(y_true, y_pred):
        prec = []
        for t in np.arange(0.5, 1.0, 0.05):
            y_pred_ = tf.to_int32(y_pred > t)
            score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
            K.get_session().run(tf.local_variables_initializer())
            with tf.control_dependencies([up_opt]):
                score = tf.identity(score)
            prec.append(score)
        return K.mean(K.stack(prec), axis=0)
    
    # Build U-Net model
    inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
    s = Lambda(lambda x: x / 255) (inputs)
    width = 64
    c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
    c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
    p1 = MaxPooling2D((2, 2)) (c1)
    
    c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
    c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
    p2 = MaxPooling2D((2, 2)) (c2)
    
    c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
    c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
    p3 = MaxPooling2D((2, 2)) (c3)
    
    c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
    c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
    p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
    
    c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
    c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)
    
    u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
    u6 = concatenate([u6, c4])
    c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
    c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)
    
    u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
    u7 = concatenate([u7, c3])
    c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
    c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)
    
    u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
    u8 = concatenate([u8, c2])
    c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
    c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)
    
    u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
    u9 = concatenate([u9, c1], axis=3)
    c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
    c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)
    
    outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)
    
    model = Model(inputs=[inputs], outputs=[outputs])
    
    sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
    model.summary()
        
    # Fit model
    earlystopper = EarlyStopping(patience=20, verbose=1)
    checkpointer = ModelCheckpoint('nuclei_only.h5', verbose=1, save_best_only=True)
    results = model.fit(X_train, Y_train, validation_split=0.05, batch_size = 32, verbose=1, epochs=100, 
                    callbacks=[earlystopper, checkpointer])
    

    我想使用mxnet或其他方法在所有可用的GPU上运行此代码。然而,我不知道如何做到这一点。所有参考资料仅显示如何在mnist数据集上执行此操作。我有自己的数据集,我的阅读方式不同。因此,不太确定如何修改代码。

    1 回复  |  直到 4 年前
        1
  •  14
  •   Peter Szoldan    5 年前

    TL;博士 :使用 tf.distribute.MirroredStrategy() 作为范围,如

    strategy = tf.distribute.MirroredStrategy()
    with strategy.scope():
        [...create model as you would otherwise...]
    

    如果未指定任何参数, tf。分配镜像策略() 将使用所有可用的GPU。如果需要,还可以指定要使用哪些,如下所示: mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

    请参阅此 Distributed training with TensorFlow 实施细节和其他策略指南。

    早期答案(现已过时: deprecated, removed as of April 1, 2020 ): 使用 multi_gpu_model() 来自Keras。()


    TS;WM公司 :

    TensorFlow 2.0现在具有 tf.distribute 模块,“用于跨多个设备运行计算的库”。它建立在“分销策略”的概念之上。您可以指定分发策略,然后将其用作范围。TensorFlow将分割输入,并行计算,并基本透明地连接输出。反向传播也受此影响。由于所有处理现在都是在幕后完成的,您可能需要熟悉可用的策略及其参数,因为它们可能会对您的训练速度产生很大影响。例如,是否希望变量驻留在CPU上?然后使用 tf.distribute.experimental.CentralStorageStrategy() 。请参阅 TensorFlow分布式训练 更多信息指南。

    先前的答案(现已过时,留作参考):

    Tensorflow Guide :

    如果系统中有多个GPU,默认情况下将选择ID最低的GPU。

    如果您想使用多个GPU,很遗憾,您必须手动指定要在每个GPU上放置的张量

    with tf.device('/device:GPU:2'):
    

    更多信息请访问 Tensorflow Guide Using Multiple GPUs

    就如何在多个GPU上分布网络而言,主要有两种方法。

    1. 您可以在GPU上按层次分布网络。这更容易实现,但不会产生很多性能好处,因为GPU将等待彼此完成操作。

    2. 您可以在每个GPU上创建单独的网络副本,称为“塔”。当您向八元组网络馈送数据时,您将输入批分解为8个部分,并将其分发。让网络向前传播,然后求梯度之和,然后进行反向传播。这将导致 almost-linear speedup 具有GPU的数量。然而,实现起来要困难得多,因为您还必须处理与批规范化相关的复杂性,并且非常明智的做法是确保正确地随机化批。有 a nice tutorial here 。您还应该查看 Inception V3 code 参考那里的想法如何构造这样的东西。尤其地 _tower_loss() ,则, _average_gradients() 以及 train() 从开始 for i in range(FLAGS.num_gpus):

    如果您想尝试一下Keras,它现在通过 multi\u gpu\u模型() .它可以帮你做所有的重物。