代码之家  ›  专栏  ›  技术社区  ›  MJK

我想知道为什么我的代码在PyTorch中使用编译时性能较低

  •  0
  • MJK  · 技术社区  · 2 年前

    我有几个问题:

    1. 我认为应该使用Compile()方法来提高性能,但在我的代码中,如果我使用Compile,它几乎比不使用时多花费2%到30%的时间。你能告诉我我的代码有什么问题吗?

    2. 如果我在Bert预训练模型中使用编译,会出现以下错误,并且速度会减慢几十倍。你能告诉我是什么导致了这个错误吗?

    错误列表

    1. 火炬_发电机控制框架:[警告]火炬_发电机命中配置.cache_size_limit(64) 函数:'forward'(/home/mj/…/bert.py:287) 原因:__check_obj_id(本人139626116174448) 要诊断重新编译问题,请参阅 https://pytorch.org/docs/master/dynamo/troubleshooting.html .

    2. 火炬_inductor.utils:[警告]使用triton随机,期望与渴望的不同

    <- 错误列表

    import torch
    import torchvision
    from torch import nn
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms
    import random
    
    import time
    import numpy as np
    
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    torch.manual_seed(1234)
    np.random.seed(1234)
    
    class NeuralNetwork(nn.Module):
    
        def __init__(self):
            super(NeuralNetwork, self).__init__()
    
            self.sequential = nn.Sequential(
                nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=2),
                nn.ReLU(),
                nn.MaxPool2d(stride=2, kernel_size=3),
                nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=2),
                nn.ReLU(),
                nn.MaxPool2d(stride=2, kernel_size=3),
                nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=2),
                nn.ReLU(),
                nn.AdaptiveAvgPool2d(8),
                nn.Flatten(),
                nn.Linear(in_features=128*8*8, out_features=10))
    
    
        def forward(self, x):
            out = self.sequential(x)
            return out
    
    # model1 = NeuralNetwork()
    # model2 = NeuralNetwork()
    model1 = torchvision.models.resnet18()
    model2 = torchvision.models.resnet18()
    model1.to(device='cuda')
    model2.to(device='cuda')
    model1 = torch.compile(model1)
    model2 = torch.compile(model2)
    
    def cifar10_cnn():
        epochs = 5        
        batch_size = 64         
        report_period = 100 
        tr_count = 0           
        te_count = 0        
        data_root = "/data/" 
    
        torch.set_float32_matmul_precision('high')
    
        tr_dset = datasets.CIFAR10(root=data_root, train=True, download=True, transform=transforms.ToTensor())
        te_dset = datasets.CIFAR10(root=data_root, train=False, download=True, transform=transforms.ToTensor())
    
        tr_loader = DataLoader(tr_dset, batch_size=batch_size, shuffle=True)
        te_loader = DataLoader(te_dset, batch_size=batch_size, shuffle=False)
    
        loss_fn = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(model1.parameters(), lr=1e-3)
    
    
        start_time = time.time()
    
        for i in range(epochs):
            print(f"\nEpoch {i + 1}/{epochs}\n------------------------------")
            train(tr_loader, model1, loss_fn, optimizer, report_period, start_time)
            print(f"\nTest started with {len(te_loader)} data:")
            test(te_loader, model2, loss_fn, start_time)
    
    def train(dataloader, model1, loss_fn, optimizer, report_period, start_time):
        running_loss = 0.0
        train_loss= 0.0
        size = len(dataloader.dataset)
        for batch, (X, y) in enumerate(dataloader):
            X, y = X.to(device), y.to(device)
    
            pred = model1(X)
            loss = loss_fn(pred, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
            running_loss += loss.item()
            train_loss = running_loss / len(dataloader)
    
            if batch % 100 == 0:
                loss, current = loss.item(), batch * len(X)
                print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
    
        print(f"train_loss: {train_loss}")
    
    
    def test(dataloader, model2, loss_fn, start_time):
    
        size = len(dataloader.dataset)
        num_batches = len(dataloader)
        model2.eval()
        test_loss, correct = 0, 0
        with torch.no_grad():
            for X, y in dataloader:
                X, y = X.to(device), y.to(device)
    
                pred = model1(X)
                test_loss += loss_fn(pred, y).item()
    
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()
        test_loss /= num_batches
        correct /= size
    
        print(f"Test Error: \n Accuracy: {(100 * correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    
    
    if __name__ == "__main__":
        start = time.time()
        cifar10_cnn()
        print("Done!\n")
        print(f"running time: {time.time()-start}")
    

    首先,我试图编译一个模型并运行训练和测试,但出现了与微分(或梯度)有关的错误,所以我将训练的编译模型和测试的编译模型分开。

    其次,我尝试使用torchvision提供的模型,因为我想看看我实现的模型是否有问题,但它同样缓慢。

    第三,我尝试使用大模型,因为我认为使用小模型可能会因为不必要的编译时间而导致速度减慢,但同样的结果也会出现。

    0 回复  |  直到 2 年前
    推荐文章