create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

the-hof · 2016-06-21T13:32:14Z

Steps to reproduce:

from azure.storage.blob import BlockBlobService

class BlobInteraction:
def init(self, ACCOUNT_NAME, ACCOUNT_KEY):
self.account_name = ACCOUNT_NAME
self.account_key = ACCOUNT_KEY
self.blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

def put(self, container_name, blob_name, local_filename):
if self.blob_service is None:
self.blob_service = BlockBlobService(account_name=self.account_name, account_key=self.account_key)
self.blob_service.create_blob_from_path(
container_name,
blob_name,
local_filename
)

BLOB_ACCOUNT_NAME = "MY_ACCOUNT_NAME"
BLOB_CONTAINER_NAME = "MY_CONTAINER_NAME"
BLOB_ACCOUNT_KEY = "MY_KEY"

blob = BlobInteraction(BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY)
blob.put(BLOB_CONTAINER_NAME,
'small_blob.csv',
'path/to/small.csv')
blob.put(BLOB_CONTAINER_NAME,
'large_blob.csv',
'path/to/large.csv')

Intended behavior: small_blob.csv and large_blob.csv appear in my blob storage
What happens: small_blob.csv appears in blob storage, code hangs and can't be terminated after second call to create_blob_from_path.

I tried setting the max sizes to something smaller to see if the "small_blob.csv" file failed to upload and the process hung, and it does:

added this to def init(self, ACCOUNT_NAME, ACCOUNT_KEY):

self.blob_service.MAX_SINGLE_PUT_SIZE = 32 * 1024
self.blob_service.MAX_BLOCK_SIZE = 4 * 1024

the-hof · 2016-06-21T13:46:51Z

I tried the same experiment reading the file into a text string and then using create_blob_from_text and experienced the same behavior.

the-hof · 2016-06-21T15:28:02Z

setting max_connections = 1 seems to work around what I'm seeing, so it may just be a problem with importing concurrent.futures with python 2.7?

emgerner-msft · 2016-06-21T16:45:56Z

I'm not able to repro this.

Each time we release we run all of our tests in 2.7 and we have explicit tests for every API in both parallel and non-parallel mode. The tests for this particular API are here and we actually use the same trick you did to make them run faster -- reducing put size and block size. I just tried them in both 2.7 and 3.5 and they pass. I also validated in Fiddler to confirm that they were indeed running in parallel and saw multiple requests, as expected.

Could you make sure all of your packages are up to date and try again?
Could you gather more information on where things are hanging?

marcelvb · 2016-10-20T16:34:50Z

I have a problem and I'm not sure if it is related, but when I upload a large file (18 GB) using create_blob_from_path(), my memory usage goes through the roof. Eventually I run out of ram and Linux kills my process. I upload multiple files concurrently in 16 threads. I'm using Python 2.7.6 with azure-storage 0.33.0.

tjprescott · 2016-10-20T17:05:37Z

And we just got this bug report which seems to be related exactly to this:
Azure/azure-cli#1105

I'm using python3, but have seen the same issue with python2.
Image is 30G (although only 1,5 G sparse).
Eventually I managed to upload using --max-connections 1.
I traced the issue down to this little piece of code:
file: storage/blob/_upload_chunking.py, line:70
if max_connections > 1:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor(max_connections)
range_ids = list(executor.map(uploader.process_chunk, uploader.get_chunk_streams()))
else:
range_ids = [uploader.process_chunk(result) for result in uploader.get_chunk_streams()]
Problem is with the line
range_ids = list(executor.map(uploader.process_chunk, uploader.get_chunk_streams()))
calling executor.map with parameter uploader.get_chunk_streams() creates a list with all the element that are yielded in get_chunk_streams().
This list holds all 30G of file data and is built in memory before passing it on to executor.map().

So, if you want to upload with maxconnections > 1, basically, you need at least (memory+swap) larger than file you wish to upload...

marcelvb · 2016-10-20T22:28:55Z

I think this is indeed the issue that I'm experiencing as well.

rambho · 2016-10-21T11:14:35Z

Thanks guys, I will investigate these RAM issues further, but our upcoming release will resolve this issue as we are reworking the upload strategy.

@marcelvb, have you tried reducing your max_connections (threads) closer to 1?
However, if it is indeed the same scenario as @tjprescott referenced, then the workaround for the time-being would be to disable parallelization.

marcelvb · 2016-10-21T11:19:45Z

I set max_connections=1 when calling create_blob_from_path() and the problem went away. Since I already do my own threading, this workaround is fine for me. Maybe max_connections=1 should be the default, for now at least?

matthchr · 2017-01-06T18:03:21Z

@rambo-msft @tjprescott Any update on this? Updating to use max_connections = 1 if you aren't doing your own threading is probably a major perf impact.

rambho · 2017-02-18T02:16:54Z

@matthchr @marcelvb @tjprescott
Our latest release has fixed the buffering issue and also added a new memory-optimized upload algorithm for more efficiency for the larger block sizes we now support.

Feel free to open a new issue if you run into any problems with the new version.

Thanks!

tjprescott mentioned this issue Oct 20, 2016

Out of memory when running az storage blob upload Azure/azure-cli#1105

Closed

rambho modified the milestone: v0.34.0 Oct 25, 2016

rambho added the bug label Oct 25, 2016

tjprescott mentioned this issue Dec 15, 2016

[Storage] Workaround for blob upload SDK bug Azure/azure-cli#1580

Merged

rambho closed this as completed Feb 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

the-hof commented Jun 21, 2016

the-hof commented Jun 21, 2016

the-hof commented Jun 21, 2016

emgerner-msft commented Jun 21, 2016

marcelvb commented Oct 20, 2016

tjprescott commented Oct 20, 2016 •

edited

Loading

marcelvb commented Oct 20, 2016

rambho commented Oct 21, 2016 •

edited

Loading

marcelvb commented Oct 21, 2016

matthchr commented Jan 6, 2017

rambho commented Feb 18, 2017

create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

Comments

the-hof commented Jun 21, 2016

added this to def init(self, ACCOUNT_NAME, ACCOUNT_KEY):

the-hof commented Jun 21, 2016

the-hof commented Jun 21, 2016

emgerner-msft commented Jun 21, 2016

marcelvb commented Oct 20, 2016

tjprescott commented Oct 20, 2016 • edited Loading

marcelvb commented Oct 20, 2016

rambho commented Oct 21, 2016 • edited Loading

marcelvb commented Oct 21, 2016

matthchr commented Jan 6, 2017

rambho commented Feb 18, 2017

tjprescott commented Oct 20, 2016 •

edited

Loading

rambho commented Oct 21, 2016 •

edited

Loading