-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory use continuously increasing #486
Comments
Hi, |
Hi, Thanks for your quick response. Can you say more about the disk cache being trashed at the end of the epoch? I've never successfully reached the end of an epoch with the loader (it stalls at 90% when the memory consumption saturates). Does this mean I need to figure out how to trash the disk cache before the end of the epoch? Any suggestions on how I can do that? Thanks! |
I mean that OS keeps data form HD in RAM file caches. I guess in your case, data set may not fully fit into your RAM and at the end, files are accessed not from RAM cache, but from HD directly. |
Ah, ok. I'm not 100% sure how to check RAM file cache usage. Do you know how to do that from within python so I can verify? I'm surprised the entire dataset has to fit into RAM - can't memory be released after each iteration/batch? When the iteration slows, it doesn't completely stop (but it's very very slow, and instead of finishing in 15 min, it would finish hours later). I have been able to iterate through a smaller data set without any issues. If you have any tips on how to monitor RAM cache usage within python (or via linux terminal), that would be greatly appreciated. In the meantime, I'm making a copy of the image net training set, reducing the file sizes. I will test whether I can iterate through this smaller (in GB) training set, and post a comment here. Thanks! |
Hi,
I mean that if you have data on normal HD, then access to it is rather slow comparing to SSD, and OS tries to cache his accesses to make it faster. So I just wonder that if HD cache is full then OS is not providing you data from RAM but from HD directly and this may be the source of the slow down, but it is just my ques. |
Thanks again for all of your help with this. First I resized the images in the ImageNet ILSRC2012 training set to be 256x256x3 (first resizing the shortest edge to 256px, preserving aspect ratio, then center cropping to 256x256). Now the loader iterates at over 21,000 images/s, going through the entire training set in about 60s (no model training, just iterating through images). So, from a practical perspective, the problem is solved. Nevertheless, I was curious about the slowdown I was having with the original, so I followed the stackoverflow link and monitored disk io using the command "sar -u 1 2". For this test, I iterated through the training set with no model training (just looping through images). sar output before the slowdown: Linux 4.4.0-135-generic (nolan) 01/31/2019 x86_64 (16 CPU) sar output after the slowdown: Linux 4.4.0-135-generic (nolan) 01/31/2019 x86_64 (16 CPU) The %iowait jumped from 0.00% to >11.0%, which seems to confirm your guess? The drive is a 4TB HDD, with only a 64MB cache. I might try replacing the drive to one with a larger cache (256 MB) to see if that improves things. I don't know enough about how the HDD caching system depends on the actual size of the files, but as noted above, everything runs spectacularly fast after I reduced the file sizes. Thanks for your time and effort helping address this issue (sorry it turned out likely to be a hardware issue!). |
Hi, |
hI @JanuszL , I seem to run into the same error. Meanwhile the system got stuck and response very slowly. My environment:
Part of my code is:
Updated: The RAM has been eaten up after 5 epochs iteration! I hope these information will help you guys localize the bug and make DALI better and stronger. |
Meet the same error, running on GPU. I use e pipeline almost the same with official example code for pytorch. GPU card 0 is used for DALI pipeline, while GPU card 1-7 are used for training. I train ResNet18 on ImageNet dataset with batch size 1792(256*7). The GPU memory used by card 0 increases continuously until a "out of memory" error. The memory usage increase at about 35MB/epoch. class HybridTrainPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False):
super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id)
self.input = ops.FileReader(file_root=data_dir, shard_id=0, num_shards=1, random_shuffle=True)
#let user decide which pipeline works him bets for RN version he runs
if dali_cpu:
dali_device = "cpu"
self.decode = ops.HostDecoderRandomCrop(device=dali_device, output_type=types.RGB,
random_aspect_ratio=[0.8, 1.25],
random_area=[0.1, 1.0],
num_attempts=100)
else:
dali_device = "gpu"
# This padding sets the size of the internal nvJPEG buffers to be able to handle all images from full-sized ImageNet
# without additional reallocations
self.decode = ops.nvJPEGDecoderRandomCrop(device="mixed", output_type=types.RGB, device_memory_padding=211025920, host_memory_padding=140544512,
random_aspect_ratio=[0.8, 1.25],
random_area=[0.1, 1.0],
num_attempts=100)
self.res = ops.Resize(device=dali_device, resize_x=crop, resize_y=crop, interp_type=types.INTERP_TRIANGULAR)
self.cmnp = ops.CropMirrorNormalize(device="gpu",
output_dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(crop, crop),
image_type=types.RGB,
mean=[0.485 * 255,0.456 * 255,0.406 * 255],
std=[0.229 * 255,0.224 * 255,0.225 * 255])
self.coin = ops.CoinFlip(probability=0.5)
print('DALI "{0}" variant'.format(dali_device))
def define_graph(self):
rng = self.coin()
self.jpegs, self.labels = self.input(name="Reader")
images = self.decode(self.jpegs)
images = self.res(images)
#output = self.cmnp(images.gpu(), mirror=rng)
output = self.cmnp(images.gpu(), mirror=rng)
return [output, self.labels]
def ImageNet(batch_sz, num_workers=16):
world_size = 1
rootdir = '/home/futian.zp/data/imagenet/'
normalize = transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
pipe = HybridTrainPipe(batch_size=batch_sz, num_threads=num_workers, device_id=0,data_dir=rootdir + 'train', crop=224, dali_cpu=False)
pipe.build()
train_loader = DALIClassificationIterator(pipe, size=int(pipe.epoch_size("Reader")/world_size))
train_loader.num_classes = 1000
return train_loader |
Hi, |
Hi, |
@JinyangGuo - please look into RN50 example - https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/pytorch/resnet50/pytorch-resnet50.html. |
I meet the same problem, the GPU memory usage increased with time step by, and finally, lead to one of my processes crashed down because of out-of-memory under data parallel distributed training. Here are some error logs:
I think it may be out-of-memory caused |
@un-knight - DALI doesn't free memory after each step/epoch as allocation on the GPU is very time-consuming. What DALI does is lazy reallocation only when currently available memory is not sufficient. |
Yep, I have tried a smaller num_threads to avoid out-of-memory. thanks for your reply! |
I'm working from the tutorials for integrating DALI with pytorch, aiming to train models on ImageNet. But I think I'm running into the "memory leak" / "continuously growing memory" issues mentioned in (#344, and #278), although none of the suggestions in those issues solved my problem.
I'm using Nvidia Dali 0.6.1, with ubuntu 16.04, cuda10.0, cudnn7.4.1, pytorch v1.0.0
I'm using a hybrid pipline and the DALIGenericIterator from the pytorch plugin.
When I iterate through the dataset (not model training, just iterating), things go blazingly fast (5000 images/s), but only up until about 90% of the dataset has been loaded (so close!), at which things slow to a near standstill. During that time, my RAM useage steadily increases by 6-7 GB, (e.g., starting from 5GB to about 12.5GB). I'm not sure why things stall at 12.5GB (the machine has 128 GB of RAM), but this is consistent across many attempted runs.
I made my own copy of DALIGenericIterator to determine the source of the issues. It seems that calling p._share_outputs() increases the memory use. If I "simuluate" iterations without this function call (by calling p._share_outputs once during the first batch, storing the outputs, and just working with the same outputs on each iteration), then the memory doesn't grow.
Is it expected that memory use would grow on each iteration/call to p._share_outputs()?
Is it possible that p._release_outputs() is not releasing memory?
Since _share_outputs, _release_outputs are core functions, I wasn't sure how to further debug this issue.
Many thanks in advance for your help.
The text was updated successfully, but these errors were encountered: