Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_blob_from_path hangs if file is larger than MAX_SINGLE_PUT_SIZE #190

Closed
the-hof opened this issue Jun 21, 2016 · 10 comments
Closed
Labels
Milestone

Comments

@the-hof
Copy link

the-hof commented Jun 21, 2016

Steps to reproduce:

from azure.storage.blob import BlockBlobService

class BlobInteraction:
def init(self, ACCOUNT_NAME, ACCOUNT_KEY):
self.account_name = ACCOUNT_NAME
self.account_key = ACCOUNT_KEY
self.blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

def put(self, container_name, blob_name, local_filename):
if self.blob_service is None:
self.blob_service = BlockBlobService(account_name=self.account_name, account_key=self.account_key)
self.blob_service.create_blob_from_path(
container_name,
blob_name,
local_filename
)

BLOB_ACCOUNT_NAME = "MY_ACCOUNT_NAME"
BLOB_CONTAINER_NAME = "MY_CONTAINER_NAME"
BLOB_ACCOUNT_KEY = "MY_KEY"

blob = BlobInteraction(BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY)
blob.put(BLOB_CONTAINER_NAME,
'small_blob.csv',
'path/to/small.csv')
blob.put(BLOB_CONTAINER_NAME,
'large_blob.csv',
'path/to/large.csv')

Intended behavior: small_blob.csv and large_blob.csv appear in my blob storage
What happens: small_blob.csv appears in blob storage, code hangs and can't be terminated after second call to create_blob_from_path.

I tried setting the max sizes to something smaller to see if the "small_blob.csv" file failed to upload and the process hung, and it does:

added this to def init(self, ACCOUNT_NAME, ACCOUNT_KEY):

self.blob_service.MAX_SINGLE_PUT_SIZE = 32 * 1024
self.blob_service.MAX_BLOCK_SIZE = 4 * 1024

@the-hof
Copy link
Author

the-hof commented Jun 21, 2016

I tried the same experiment reading the file into a text string and then using create_blob_from_text and experienced the same behavior.

@the-hof
Copy link
Author

the-hof commented Jun 21, 2016

setting max_connections = 1 seems to work around what I'm seeing, so it may just be a problem with importing concurrent.futures with python 2.7?

@emgerner-msft
Copy link
Member

I'm not able to repro this.

Each time we release we run all of our tests in 2.7 and we have explicit tests for every API in both parallel and non-parallel mode. The tests for this particular API are here and we actually use the same trick you did to make them run faster -- reducing put size and block size. I just tried them in both 2.7 and 3.5 and they pass. I also validated in Fiddler to confirm that they were indeed running in parallel and saw multiple requests, as expected.

  1. Could you make sure all of your packages are up to date and try again?
  2. Could you gather more information on where things are hanging?

@marcelvb
Copy link

I have a problem and I'm not sure if it is related, but when I upload a large file (18 GB) using create_blob_from_path(), my memory usage goes through the roof. Eventually I run out of ram and Linux kills my process. I upload multiple files concurrently in 16 threads. I'm using Python 2.7.6 with azure-storage 0.33.0.

@tjprescott
Copy link
Member

tjprescott commented Oct 20, 2016

And we just got this bug report which seems to be related exactly to this:
Azure/azure-cli#1105

I'm using python3, but have seen the same issue with python2.
Image is 30G (although only 1,5 G sparse).
Eventually I managed to upload using --max-connections 1.
I traced the issue down to this little piece of code:
file: storage/blob/_upload_chunking.py, line:70
if max_connections > 1:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor(max_connections)
range_ids = list(executor.map(uploader.process_chunk, uploader.get_chunk_streams()))
else:
range_ids = [uploader.process_chunk(result) for result in uploader.get_chunk_streams()]
Problem is with the line
range_ids = list(executor.map(uploader.process_chunk, uploader.get_chunk_streams()))
calling executor.map with parameter uploader.get_chunk_streams() creates a list with all the element that are yielded in get_chunk_streams().
This list holds all 30G of file data and is built in memory before passing it on to executor.map().

So, if you want to upload with maxconnections > 1, basically, you need at least (memory+swap) larger than file you wish to upload...

@marcelvb
Copy link

I think this is indeed the issue that I'm experiencing as well.

@rambho
Copy link

rambho commented Oct 21, 2016

Thanks guys, I will investigate these RAM issues further, but our upcoming release will resolve this issue as we are reworking the upload strategy.

@marcelvb, have you tried reducing your max_connections (threads) closer to 1?
However, if it is indeed the same scenario as @tjprescott referenced, then the workaround for the time-being would be to disable parallelization.

@marcelvb
Copy link

I set max_connections=1 when calling create_blob_from_path() and the problem went away. Since I already do my own threading, this workaround is fine for me. Maybe max_connections=1 should be the default, for now at least?

@matthchr
Copy link
Member

matthchr commented Jan 6, 2017

@rambo-msft @tjprescott Any update on this? Updating to use max_connections = 1 if you aren't doing your own threading is probably a major perf impact.

@rambho
Copy link

rambho commented Feb 18, 2017

@matthchr @marcelvb @tjprescott
Our latest release has fixed the buffering issue and also added a new memory-optimized upload algorithm for more efficiency for the larger block sizes we now support.

Feel free to open a new issue if you run into any problems with the new version.

Thanks!

@rambho rambho closed this as completed Feb 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants