-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39297: [C++][FS]: Inform caller of container not-existing when checking for HNS support #39298
Conversation
|
if (hns_support == HNSSupport::kContainerNotFound || | ||
hns_support == HNSSupport::kEnabled) { | ||
// If the hierarchical namespace is enabled, then the storage account will | ||
// have explicit directories. Neither a file nor a directory was found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These refactorings preserve the existing semantics as the goal of this PR is just rewriting the HNS check, but I'm changing the semantics of directory operations in a follow-up PR.
Is there a specific reason to combine everything back into a single file? Personally, I thought we should probably move more stuff into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a some things I liked about the old implementation but as a C++ noob my opinion probably doesn't have much weight here.
@@ -798,7 +864,7 @@ class AzureFileSystem::Impl { | |||
|
|||
std::unique_ptr<DataLake::DataLakeServiceClient> datalake_service_client_; | |||
std::unique_ptr<Blobs::BlobServiceClient> blob_service_client_; | |||
internal::HierarchicalNamespaceDetector hns_detector_; | |||
HNSSupport cached_hns_support_ = HNSSupport::kUnknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I liked having the separate detector class so that the cached value could be kept private from the implementation of the filesystem to prevent inadvertent misuse of the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This split separates the stateful part of HNS checking from the request and error handling logic. To counter the possibility of inadvertent misuse of the value, I renamed it to cached_...
. There are valid uses for this value directly and the name describes that it's a cached value that could contain much more than the two states of a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are valid uses for this value directly
Can you give some examples? I can't think of any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An operation could decide to use the DataLakeFileSystemClient
if cached_hns_support
in kUnknown/kContainerNotFound/kEnabled
and set the value of cached_hns_support_
based on the error/success handling of that operation before falling back to the Blob API. This would save the mandatory extra request in the uncached case.
Another scenario is if we were to add threads to the mix, we would like to avoid having multiple HNS check requests going in parallel by having cached_hns_support_
be some kind of atomic variable or protected by a mutex that also protects other member variables in the AzureFileSystem::Impl
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I'm not really convinced by either of these, but it probably doesn't matter.
Technically you could avoid the call to check for hierarchical NS but I don't think it really makes sense because of the very small performance cost and the high added complexity cost. DataLakeFileSystemClient
largely works on flat NS accounts so it would require extra error handling on every call and as we know from the hierarchical namespace detection code Azure often gives strange responses and its difficult to determine if the failure is genuine or if the error is just because hierarchical namespace is detected.
I don't really see why we would need to protect other member variables of AzureFileSystem::Impl
behind the same mutex. I would have kept the separate detector class which could have just one mutex. When AzureFileSystem::Impl
tries to check whether hierarchical NS is enabled in concurrently it would hit the mutex but otherwise I would want AzureFileSystem::Impl
to be free to make other calls to blob storage without any mutexs getting in the way. Additionally, I was under the impression that AzureFileSystem
would never be used in multiple threads. I think only RandomAccessFile::ReadAt
is used concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will consider extracting the memoized HierarchicalNamespaceSupport()
check to a separate class (added to my TODO list together with splitting the azurefs.cc
file), but I maintain that having an underlying function that doesn't do any mutable state manipulation (just the request to the backend) makes it easier to reason about the correctness and cost of the calls.
AzureFileSystem would never be used in multiple threads. I think only RandomAccessFile::ReadAt is used concurrently.
I meant AzureFileSystem
itself would be using multiple threads to run the operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...its difficult to determine if the failure is genuine or if the error is just because hierarchical namespace is detected.
I noticed this and I'm working on changes that puts us in a position where we never have to make that distinction. Because we can never cover all possible cases and Azurite being very broken when we make Data Lake Storage API calls to it makes it even worse.
@Tom-Newton This split caused 2 |
I think personally I would have moved |
@@ -117,6 +118,54 @@ struct ARROW_EXPORT AzureOptions { | |||
MakeDataLakeServiceClient() const; | |||
}; | |||
|
|||
namespace internal { | |||
|
|||
enum class HNSSupport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I don't feel strongly, but I would not have abbreviated so much. Yes, the docstring explains in detail but someone who just sees that name and Googles "HNS" will find themselves learning about the Croatian Football Federation 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I abbreviated the enum class because it gets repeated a lot, but kept the function names with the full name :-)
HNS is used in Microsoft documentation together with NFS, SFTP... so it's not so bad.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-feature-support-in-storage-accounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have the full name as well, even if unfortunately long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I at least alias it in azurefs.cc
?
We don't need to say "Azure error" because ExceptionToStatus will do that for us at the end of the message. We don't need to say "unexpected" because if we are reporting the error, it is unexpected.
@kou can you look again? I had forgotten to push the unit test fixing commit before. Note that I added some error message cleanups as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think that we can merge this after @felipecrv and @Tom-Newton reach a consensus (and apply necessary changes if needed).
I don't have strong opinion for on going discussion.
I would have done this differently but I don't feel strongly. @felipecrv feel free to merge |
/// account. | ||
/// \return kEnabled/kDisabled/kContainerNotFound (kUnknown is never | ||
/// returned). | ||
Result<HNSSupport> CheckIfHierarchicalNamespaceIsEnabled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is exposed in azurefs.h
? Normally the user wouldn't interact with Azure SDK types directly, only through the FileSystem
abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is called from tests. But note that it's in the internal
namespace.
I removed azurefs_internal.h
and moved everything to azurefs.cc
to minimize the export of utilities that mention Azure SDK types and this was the only one left in the internal
namespace because it's called directly from unit tests (cc @Tom-Newton).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the types are only forward-declared here. No header-bloat from the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would have rather kept the azurefs_internal.{h,cc}
as that's a useful separation, and it makes reading the headers easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if that requires exposing more symbols that are now private to azurefs.cc
? I will need to expose:
ExceptionToStatus
(which is a template on my fork at the moment)IsDfsEmulator
IsContainerNotFound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution to avoiding exposing extra symbols is moving the CheckIfHierarchicalNamespaceIsEnabled
to azurefs_internal.h
, but implementing it in azurefs.cc
. I will have that done in the next PR I push.
@pitrou @Tom-Newton follow-up PR addressing your feedback and simplifying error messages #39323 |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7265689. There were 6 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…en checking for HNS support (apache#39298) ### Rationale for this change An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist. ### What changes are included in this PR? - Removal of the `azurefs_intern.h/cc` files - Implementation of the check as a free-function instead of a class - Memoization of the result in the `AzureFileSystem` class ### Are these changes tested? Yes. The tests were improved to cover all cases. * Closes: apache#39297 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…en checking for HNS support (apache#39298) ### Rationale for this change An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist. ### What changes are included in this PR? - Removal of the `azurefs_intern.h/cc` files - Implementation of the check as a free-function instead of a class - Memoization of the result in the `AzureFileSystem` class ### Are these changes tested? Yes. The tests were improved to cover all cases. * Closes: apache#39297 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist.
What changes are included in this PR?
azurefs_intern.h/cc
filesAzureFileSystem
classAre these changes tested?
Yes. The tests were improved to cover all cases.