DDP GPU utilization problem #10670

dragondx · 2021-11-22T09:32:08Z

dragondx
Nov 22, 2021

Training MLP using DDP with 2gpus. Pretty standard code.
We observe an effect where the training speed slows down over time. At initial iterations, we get 1s/it. Both GPUs gets 100% utilization with occasional low utilization presumably it is doing gathering/backprop ops. As training goes on for a few thousands iterations, we get 2s/it. One gpu get 100% utilization consistently (the master?), the other gpu waits a long time with low utilization (1-2%) before getting short burst of 100% utilization. We also noticed that CPU utilization (80% along all cores) tends to be much higher at earlier iterations. After a few thousand iterations, CPU utilization is low (20%, with occasional 90% in some cores). Is this some sort of timing problem for dataloading? There is no preprocessing for CPU other than reading data from disk.

We use a custom dataloader for our use case, since we have a lot of data. Using numpy memmap to get the datapoint.
Dataloder params:
torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, persistent_workers=True, pin_memory=True, num_workers=16, prefetch_factor=128)

We wonder what is causing this. Any help will be greatly appreciated.

tchaton · 2021-11-22T12:54:39Z

tchaton
Nov 22, 2021
Maintainer

Dear @dragondx,

Any chance you could provide a reproducible code snippet with mocked data?

Best,
T.c

4 replies

dragondx Nov 22, 2021
Author

Sure.

Background: We have multiple .npy under a folder structure. (at most 3 layers deep for now) Each npy has (variable) multiple samples with shape of like (10mil, number_of_features), as our samples are in a sliding window fashion. I wrote a custom tree structure to collect the paths for the file as well as provide a way to fetch the correct sample based on a given index. Note that there are at least 400k npy files.

  class Node():
  
      def __init__(self, num_of_samples=[], nodes=[]):
          
          # if leaf, we add paths to node
          self.num_of_samples = np.array(num_of_samples, dtype='int64')
          self.nodes = nodes
  
          #dependent to prev
          self.cum_sum = np.cumsum(self.num_of_samples)
          self.total = np.sum(self.num_of_samples)
  
      def get_path_from_index(self, index):
          # recursive
          assert index < self.total # out of index
  
          child_idx = np.argmax(self.cum_sum > index)
  
          # remove leftside accum val
          if child_idx != 0:
              new_index = index - self.cum_sum[child_idx - 1]
          else:
              new_index = index
          
          if type(self.nodes[child_idx]) == str:
              return self.nodes[child_idx], new_index # path, window index
          else:
              return self.nodes[child_idx].get_path_from_index(new_index)

This is then used in the custom Dataset. We hardcoded the number of samples as any samples over 100 mil will cause the program to not start training altogether, which is weird. This is also an issue (any guidance would be appreciated). I suspect there is something to do with the shuffle=True of the dataloader.

  class CustomDataset(torch.utils.data.Dataset):
      def __init__(self, root_file):
          self.root = root_file
  
      def __len__(self):
          return 80000000
          #return self.root.total
  
      def __getitem__(self, index):
          path, window = self.root.get_path_from_index(index)
  
          arry = np.load(path, mmap_mode='r+')
          
          return arry[window: window+context_window, 5:], arry[window+1: window+context_window+1,:5]

The core code is essentially just BERT. Really standard. There is also no dangling Tensor within the forward pass. (at least not that I am aware of)

Other boilerplate codes:

  # index file containing the whole dataset tree
  with open('./index.pk','rb') as f:
      root = pickle.load(f)

  dataset = CustomDataset(root)
  
  # random split
  total_data = len(dataset)
  print("Total data: ", total_data)
  
  # 8 2 split train val
  train_dataset, valid_dataset = torch.utils.data.random_split(dataset, [int(total_data*0.8), total_data-int(total_data*0.8)])
  
  print("Splitted Dataset")
  
  # dataloader
  train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, persistent_workers=True, pin_memory=True, num_workers=16, prefetch_factor=128)
  valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, persistent_workers=True,pin_memory=True, num_workers=12)
  
  print("Loaded data")
  
  
  # train code
  model = Model()
  trainer = Trainer(gpus=2, profiler='simple', precision=16, max_epochs=2, strategy="ddp", num_sanity_val_steps=0)
  trainer.fit(model, train_dataloader, valid_dataloader)

The behaviour is really quite strange. The effect is not as noticeable when training less samples. Any help would be appreciated.

angadkalra Dec 1, 2021

I also notice a slowdown during my training (3D ResNet 101, training on 3D medical images, GCP VM using 4 V100 GPUs). I'm only using 4 workers (experimented with more and found this is sweet spot) and the first epoch goes at 1.75s/it using global batch size of 8, and after a few epochs it progressively slows down until its 4-5s/it.

dragondx Dec 2, 2021
Author

@angadkalra For me, I circumvented the problem by moving the whole dataset to a local nvme ssd. It is substantially quicker to load the data this way. If you are using GCP, it might be good to look at the prefetch_factor for dataloader if network is the bottleneck. You might have to tune the shared memory allocation though after changing the prefetch_factor.

Note: I'm not sure whether I missed it the first time, sometimes it is it/s not s/it e.g. when it gets faster than 1s/it, it goes >1 it/s. Please let me know if that helps. I'm quite curious at the performance of GCP VM.

angadkalra Dec 4, 2021

I fixed the problem by using a VM with much more memory and now every epoch is approx. 1.5s/it and it's consistent throughout training. The data is still on attached network disk. Thanks for help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP GPU utilization problem #10670

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DDP GPU utilization problem #10670

dragondx Nov 22, 2021

Replies: 1 comment · 4 replies

tchaton Nov 22, 2021 Maintainer

dragondx Nov 22, 2021 Author

angadkalra Dec 1, 2021

dragondx Dec 2, 2021 Author

angadkalra Dec 4, 2021

dragondx
Nov 22, 2021

Replies: 1 comment 4 replies

tchaton
Nov 22, 2021
Maintainer

dragondx Nov 22, 2021
Author

dragondx Dec 2, 2021
Author