The way to cut back “Cuda Memcpy Async” occasions and why it’s best to watch out for boolean masks operations
That is the third a part of a sequence of posts on the subject of analyzing and optimizing PyTorch fashions utilizing PyTorch Profiler and TensorBoard. Our intention has been to focus on the advantages of efficiency profiling and optimization of GPU-based coaching workloads and their potential affect on the velocity and value of coaching. Particularly, we want to show the accessibility of profiling instruments akin to PyTorch Profiler and TensorBoard to all ML builders. You don’t want to be a CUDA skilled with a purpose to derive significant efficiency positive aspects from making use of the strategies we focus on in our posts.
In our first post we demonstrated how the totally different views of the PyTorch Profiler TensorBoard plugin can be utilized to establish efficiency points and reviewed just a few widespread strategies for accelerating coaching. Within the second post we confirmed how the TensorBoard plugin Hint View can be utilized to establish when tensors are being copied from the CPU to the GPU, and again. Such motion of knowledge — which might trigger factors of synchronization and sluggish the velocity of coaching significantly — is commonly unintentional and may generally be simply prevented. The subject of this publish might be conditions wherein we encounter factors of synchronization between the GPU and CPU which are not related to tensor copies. As within the case of tensor copies, these may cause stagnation in your coaching step and sluggish the general time of your coaching significantly. We’ll show the existence of such occurrences, how they are often recognized utilizing PyTorch Profiler and the PyTorch Profiler TensorBoard plugin Hint View, and the potential efficiency advantages of constructing your mannequin in a approach that minimizes such synchronization occasions.
As in our earlier posts, we’ll outline a toy PyTorch mannequin after which iteratively profile its efficiency, establish bottlenecks, and try to repair them. We’ll run our experiments on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) and utilizing the official AWS PyTorch 2.0 Docker image. Understand that among the behaviors we describe might range between variations of PyTorch.
Within the following blocks we introduce a toy PyTorch mannequin that performs semantic segmentation on a 256×256 enter picture, i.e., it takes a 256×256 RGB picture and outputs a 256×256 map of “per-pixel” labels from a category of ten semantic classes.
import torch
import torch.nn as nn
import torch.nn.practical as F
import torch.optim
import torch.profiler
import torch.utils.information
from torch import Tensorclass Web(nn.Module):
def __init__(self, num_hidden=10, num_classes=10):
tremendous().__init__()
self.conv_in = nn.Conv2d(3, 10, 3, padding='similar')
hidden = []
for i in vary(num_hidden):
hidden.append(nn.Conv2d(10, 10, 3, padding='similar'))
hidden.append(nn.ReLU())
self.hidden = nn.Sequential(*hidden)
self.conv_out = nn.Conv2d(10, num_classes, 3, padding='similar')
def ahead(self, x):
x = F.relu(self.conv_in(x))
x = self.hidden(x)
x = self.conv_out(x)
return x
To coach our mannequin we’ll use the usual cross-entropy loss with just a few modifications:
- We’ll assume that the goal labels embrace an ignore worth indicating pixels that we need to exclude from the loss calculation.
- We’ll assume that one among semantic labels identifies sure pixels as belonging to the “background” of the picture. We outline our loss operate to deal with these as ignore labels.
- We’ll replace our mannequin weights solely after we encounter batches with targets tensors that embrace a minimum of two distinctive values.
Whereas we have now chosen these modifications for the needs of our demonstration, these kind of operations will not be unusual and might be discovered in lots of “commonplace” PyTorch fashions. Since we’re already “consultants” at efficiency profiling, we have now already gone forward and wrapped every of the operations in our loss operate with a torch.profiler.record_function context supervisor, (as described in our second post).
class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
tremendous().__init__()
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss()def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:
# create a boolean masks of legitimate labels
with torch.profiler.record_function('create masks'):
masks = goal != self.ignore_val
# permute the logits in preparation for masking
with torch.profiler.record_function('permute'):
permuted_pred = torch.permute(pred, [0, 2, 3, 1])
# apply the boolean masks to the targets and logits
with torch.profiler.record_function('masks'):
masked_target = goal[mask]
masked_pred = permuted_pred[mask.unsqueeze(-1).expand(-1, -1, -1,
self.num_classes)]
masked_pred = masked_pred.reshape(-1, self.num_classes)
# calculate the cross-entropy loss
with torch.profiler.record_function('calc loss'):
loss = self.loss(masked_pred, masked_target)
return loss
def ignore_background(self, goal: Tensor) -> Tensor:
# uncover all indices the place goal label is "background"
with torch.profiler.record_function('non_zero'):
inds = torch.nonzero(goal == self.num_classes - 1, as_tuple=True)
# reset all "background" labels to the ignore index
with torch.profiler.record_function('index task'):
goal[inds] = self.ignore_val
return goal
def ahead(self, pred: Tensor, goal: Tensor) -> Tensor:
# ignore background labels
goal = self.ignore_background(goal)
# retrieve a listing of distinctive components in goal
with torch.profiler.record_function('distinctive'):
distinctive = torch.distinctive(goal)
# test if the variety of distinctive objects go the edge
with torch.profiler.record_function('numel'):
ignore_loss = torch.numel(distinctive) < 2
# calculate the cross-entropy loss
loss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of distinctive components
# is beneath the edge
if ignore_loss:
loss = 0. * loss
return loss
Our loss operate appears harmless sufficient, proper? Incorrect! As we’ll see beneath, the loss operate consists of a variety of operations that set off host-device synchronization occasions that sluggish the velocity of coaching significantly — none of which contain copying tensors into or out of the GPU. As in our earlier publish, we problem you to attempt to establish three alternatives for efficiency optimization earlier than studying on.
For the needs of our demo, we use randomly generated photos and per-pixel label maps, as outlined beneath.
from torch.utils.information import Dataset# A dataset with random photos and label maps
class FakeDataset(Dataset):
def __init__(self, num_classes=10):
tremendous().__init__()
self.num_classes = num_classes
self.img_size = [256, 256]
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([3]+self.img_size, dtype=torch.float32)
rand_label = torch.randint(low=-1, excessive=self.num_classes,
measurement=self.img_size)
return rand_image, rand_label
train_set = FakeDataset()
train_loader = torch.utils.information.DataLoader(train_set, batch_size=256,
shuffle=True, num_workers=8, pin_memory=True)
Final, we outline our coaching step with the PyTorch Profiler configured to our need:
system = torch.system("cuda:0")
mannequin = Web().cuda(system)
criterion = MaskedLoss().cuda(system)optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)
mannequin.prepare()
# coaching loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, energetic=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/prof'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, information in enumerate(train_loader):
inputs = information[0].to(system=system, non_blocking=True)
labels = information[1].to(system=system, non_blocking=True)
if step >= (1 + 4 + 3) * 1:
break
outputs = mannequin(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
prof.step()
Should you had been to naively run this coaching script, you’ll in all probability see excessive GPU (~90%) utilization and never know that there was something mistaken with it. It’s only by profiling that we’re capable of establish the underlying efficiency bottlenecks and potential alternatives for coaching acceleration. So, with out additional ado, let’s see how our mannequin performs.
On this publish we’ll give attention to the Hint View of the PyTorch Profiler TensorBoard plugin. Please see our previous posts for tips about methods to use among the different views supported by the plugin.
Within the picture beneath we present the Hint View of a single coaching step of our toy mannequin.
We will clearly see that our 1.3 second lengthy coaching step is utterly dominated by the torch.nonzero operator within the first line of our loss operate. All the opposite operations seem bunched collectively on both aspect of the massive cudaMemcpyAsyn occasion. What’s going on??!! Why would such a seemingly harmless operation trigger such an enormous eyesore?
Maybe we shouldn’t be so stunned, because the torch.nonzero documentation does embrace the next observe: “When enter
is on CUDA, torch.nonzero()
causes host-device synchronization.” The necessity for synchronization arises from the truth that, opposite to different widespread PyTorch ops, the dimensions of the tensor that’s returned by torch.nonzero is not pre-determined. The CPU doesn’t know what number of non-zero components there are within the enter tensor forward of time. It wants to attend for the sync occasion from the GPU with a purpose to carry out the suitable GPU reminiscence allocation and appropriately put together the next PyTorch ops.
Notice that the size of cudaMempyAsync isn’t indicative of the complexity of the torch.nonzero op, however moderately displays the period of time that the CPU wants to attend for the GPU to complete all the earlier kernels that the CPU launched. For instance, had been we to make an extra torch.nonzero name instantly after our first one, our second cudaMempyAsync occasion would seem considerably shorter than the primary because the CPU and GPU are already roughly “in sync”. (Understand that this clarification is coming from a non-CUDA skilled, so make of it what you’ll…)
Now that we perceive the supply of the bottleneck, the problem turns into discovering an alternate sequence of operations that performs the identical logic however that does not set off a host-device synchronization occasion. Within the case of our loss operate, we are able to simply accomplish this utilizing the torch.where operator as proven within the code block beneath:
def ignore_background(self, goal: Tensor) -> Tensor:
with torch.profiler.record_function('replace background'):
goal = torch.the place(goal==self.num_classes-1,
-1*torch.ones_like(goal),goal)
return goal
Within the picture beneath we present the Hint View following this transformation.
Whereas we have now succeeded in eradicating the cudaMempyAsync coming from the torch.nonzero op, it has been instantly changed with one coming from the torch.unique op, and our step time has not budged. Right here the PyTorch documentation is much less variety, however based mostly on our earlier expertise we are able to assume that, as soon as once more, we’re affected by a host-device synchronization occasion as a consequence of our use of tensors with undetermined measurement.
Changing the torch.unique operator with an equal different isn’t all the time potential. Nonetheless, in our case we don’t really must know the values of the distinctive labels, we have to know solely the quantity of distinctive labels. This may be calculated by making use of the torch.sort op on the flattened goal tensor and counting the variety of steps within the resultant step operate.
def ahead(self, pred: Tensor, goal: Tensor) -> Tensor:# ignore background labels
goal = self.ignore_background(goal)
# kind the listing of labels
with torch.profiler.record_function('kind'):
sorted,_ = torch.kind(goal.flatten())
# indentify the steps of the resultant step operate
with torch.profiler.record_function('deriv'):
deriv = sorted[1:]-sorted[:-1]
# rely the variety of steps
with torch.profiler.record_function('count_nonzero'):
num_unique = torch.count_nonzero(deriv)+1
# calculate the cross-entropy loss
loss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of distinctive components
# is beneath the edge
with torch.profiler.record_function('the place'):
loss = torch.the place(num_unique<2, 0.*loss, loss)
return loss
Within the picture beneath we seize the Hint View following our second optimization:
As soon as once more, we have now solved one bottleneck solely to be confronted with a brand new one, this time coming from the boolean masks routine.
Boolean masking is a routine we generally use with a purpose to cut back the general variety of machine operations which are required. In our case, our intention was to cut back the quantity of computation by eradicating the “ignore” pixels and limiting the cross-entropy calculation to the pixels of curiosity. Clearly, this has backfired. As earlier than, making use of a boolean masks ends in a tensor of undetermined measurement, and the cudaMempyAsync that it triggers drastically overshadows any of the financial savings from excluding the “ignore” pixels.
In our case, fixing this problem is moderately easy because the PyTorch CrossEntropyLoss has a built-in possibility for setting an ignore_index.
class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
tremendous().__init__()
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss(ignore_index=-1)def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:
with torch.profiler.record_function('calc loss'):
loss = self.loss(pred, goal)
return loss
Within the picture beneath we present the resultant Hint View:
Holy cow!! Our step time has dropped all the way in which down to five.4 milliseconds. That’s 240 (!!) occasions quicker than what we began with. By merely altering round just a few operate calls and with none modification to the loss operate logic, we had been capable of optimize the efficiency of the coaching step dramatically.
Essential Notice: Within the toy instance we have now chosen, the steps that we took to cut back the quantity cudaMempyAsync occasions had a transparent affect on the coaching step time. Nonetheless, there could also be conditions the place the identical varieties of modifications will hurt efficiency moderately than enhance it. For instance, within the case of boolean masking, if our masks is extraordinarily sparse and the unique tensors extraordinarily giant, the financial savings in computation from making use of the masks would possibly outweigh the worth of the host-device synchronization. Importantly, the affect of every optimization must be evaluated on a case-by-case foundation.
On this publish we have now targeted on efficiency points in coaching functions which are brought on by host-device synchronization occasions. We noticed a number of examples of PyTorch operators that set off such occasions — the widespread property of all of them being that the measurement of the tensors that they output are depending on the enter. You may also encounter synchronization occasions from different operators, not coated on this publish. We demonstrated how efficiency analyzers akin to PyTorch Profiler and its related TensorBoard plugin can be utilized to establish these sorts of occasions.
Within the case of our toy instance, we had been capable of finding equal options to the problematic operators that use fastened sized tensors and keep away from the necessity for synchronization occasions. These led to a big enchancment in coaching time. Nonetheless, in apply you would possibly discover it a lot more durable — even not possible — to unravel these sorts of bottlenecks. Generally, overcoming them would possibly require redesigning elements of your mannequin.