Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39297: [C++][FS]: Inform caller of container not-existing when checking for HNS support #39298

Merged
merged 5 commits into from
Dec 20, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Dec 19, 2023

Rationale for this change

An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist.

What changes are included in this PR?

  • Removal of the azurefs_intern.h/cc files
  • Implementation of the check as a free-function instead of a class
  • Memoization of the result in the AzureFileSystem class

Are these changes tested?

Yes. The tests were improved to cover all cases.

Copy link

⚠️ GitHub issue #39297 has been automatically assigned in GitHub to PR creator.

if (hns_support == HNSSupport::kContainerNotFound ||
hns_support == HNSSupport::kEnabled) {
// If the hierarchical namespace is enabled, then the storage account will
// have explicit directories. Neither a file nor a directory was found.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These refactorings preserve the existing semantics as the goal of this PR is just rewriting the HNS check, but I'm changing the semantics of directory operations in a follow-up PR.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 19, 2023
@felipecrv felipecrv requested a review from pitrou December 19, 2023 15:00
@felipecrv
Copy link
Contributor Author

@kou @Tom-Newton

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Dec 19, 2023

Is there a specific reason to combine everything back into a single file? Personally, I thought we should probably move more stuff into azurefs_internal.cc given the size of azurefs.cc.

Copy link
Contributor

@Tom-Newton Tom-Newton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a some things I liked about the old implementation but as a C++ noob my opinion probably doesn't have much weight here.

@@ -798,7 +864,7 @@ class AzureFileSystem::Impl {

std::unique_ptr<DataLake::DataLakeServiceClient> datalake_service_client_;
std::unique_ptr<Blobs::BlobServiceClient> blob_service_client_;
internal::HierarchicalNamespaceDetector hns_detector_;
HNSSupport cached_hns_support_ = HNSSupport::kUnknown;
Copy link
Contributor

@Tom-Newton Tom-Newton Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I liked having the separate detector class so that the cached value could be kept private from the implementation of the filesystem to prevent inadvertent misuse of the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This split separates the stateful part of HNS checking from the request and error handling logic. To counter the possibility of inadvertent misuse of the value, I renamed it to cached_.... There are valid uses for this value directly and the name describes that it's a cached value that could contain much more than the two states of a boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are valid uses for this value directly

Can you give some examples? I can't think of any

Copy link
Contributor Author

@felipecrv felipecrv Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An operation could decide to use the DataLakeFileSystemClient if cached_hns_support in kUnknown/kContainerNotFound/kEnabled and set the value of cached_hns_support_ based on the error/success handling of that operation before falling back to the Blob API. This would save the mandatory extra request in the uncached case.

Another scenario is if we were to add threads to the mix, we would like to avoid having multiple HNS check requests going in parallel by having cached_hns_support_ be some kind of atomic variable or protected by a mutex that also protects other member variables in the AzureFileSystem::Impl class.

Copy link
Contributor

@Tom-Newton Tom-Newton Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not really convinced by either of these, but it probably doesn't matter.

Technically you could avoid the call to check for hierarchical NS but I don't think it really makes sense because of the very small performance cost and the high added complexity cost. DataLakeFileSystemClient largely works on flat NS accounts so it would require extra error handling on every call and as we know from the hierarchical namespace detection code Azure often gives strange responses and its difficult to determine if the failure is genuine or if the error is just because hierarchical namespace is detected.

I don't really see why we would need to protect other member variables of AzureFileSystem::Impl behind the same mutex. I would have kept the separate detector class which could have just one mutex. When AzureFileSystem::Impl tries to check whether hierarchical NS is enabled in concurrently it would hit the mutex but otherwise I would want AzureFileSystem::Impl to be free to make other calls to blob storage without any mutexs getting in the way. Additionally, I was under the impression that AzureFileSystem would never be used in multiple threads. I think only RandomAccessFile::ReadAt is used concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider extracting the memoized HierarchicalNamespaceSupport() check to a separate class (added to my TODO list together with splitting the azurefs.cc file), but I maintain that having an underlying function that doesn't do any mutable state manipulation (just the request to the backend) makes it easier to reason about the correctness and cost of the calls.

AzureFileSystem would never be used in multiple threads. I think only RandomAccessFile::ReadAt is used concurrently.

I meant AzureFileSystem itself would be using multiple threads to run the operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...its difficult to determine if the failure is genuine or if the error is just because hierarchical namespace is detected.

I noticed this and I'm working on changes that puts us in a position where we never have to make that distinction. Because we can never cover all possible cases and Azurite being very broken when we make Data Lake Storage API calls to it makes it even worse.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 19, 2023
@felipecrv
Copy link
Contributor Author

felipecrv commented Dec 19, 2023

Is there a specific reason to combine everything back into a single file? Personally, I thought we should probably move more stuff into azurefs_internal.cc given the size of azurefs.cc.

@Tom-Newton This split caused 2 ExceptionToStatus functions to exist and would force me to expose IsDfsEmulator and IsContainerNotFound in the azurefs_internal.h header since I need them to implement the HNS check and other filesystem operations that now live in azurefs.cc. I'm not opposed to partitioning azurefs.cc into smaller files, but that split should be one that minimizes the shared interfaces between the partitions that have to be announced in a header -- a more natural split will be more evident when we finish the implementation.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 20, 2023
@Tom-Newton
Copy link
Contributor

@Tom-Newton This split caused 2 ExceptionToStatus functions to exist and would force me to expose IsDfsEmulator and IsContainerNotFound in the azurefs_internal.h header since I need them to implement the HNS check and other filesystem operations that now live in azurefs.cc. I'm not opposed to partitioning azurefs.cc into smaller files, but that split should be one that minimizes the shared interfaces between the partitions that have to be announced in a header -- a more natural split will be more evident when we finish the implementation.

I think personally I would have moved IsDfsEmulator and IsContainerNotFound in the azurefs_internal.h but I don't feel strongly.

@@ -117,6 +118,54 @@ struct ARROW_EXPORT AzureOptions {
MakeDataLakeServiceClient() const;
};

namespace internal {

enum class HNSSupport {
Copy link
Contributor

@Tom-Newton Tom-Newton Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I don't feel strongly, but I would not have abbreviated so much. Yes, the docstring explains in detail but someone who just sees that name and Googles "HNS" will find themselves learning about the Croatian Football Federation 😄

Copy link
Contributor Author

@felipecrv felipecrv Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I abbreviated the enum class because it gets repeated a lot, but kept the function names with the full name :-)

HNS is used in Microsoft documentation together with NFS, SFTP... so it's not so bad.

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-feature-support-in-storage-accounts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have the full name as well, even if unfortunately long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I at least alias it in azurefs.cc?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 20, 2023
We don't need to say "Azure error" because ExceptionToStatus will do
that for us at the end of the message.

We don't need to say "unexpected" because if we are reporting the error,
it is unexpected.
@felipecrv felipecrv requested a review from kou December 20, 2023 01:50
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 20, 2023
@felipecrv
Copy link
Contributor Author

@kou can you look again? I had forgotten to push the unit test fixing commit before. Note that I added some error message cleanups as well.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I think that we can merge this after @felipecrv and @Tom-Newton reach a consensus (and apply necessary changes if needed).
I don't have strong opinion for on going discussion.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Dec 20, 2023
@Tom-Newton
Copy link
Contributor

Tom-Newton commented Dec 20, 2023

I would have done this differently but I don't feel strongly. @felipecrv feel free to merge

@felipecrv felipecrv merged commit 7265689 into apache:main Dec 20, 2023
33 of 35 checks passed
@felipecrv felipecrv removed the awaiting merge Awaiting merge label Dec 20, 2023
@felipecrv felipecrv deleted the hns_check branch December 20, 2023 13:43
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Dec 20, 2023
/// account.
/// \return kEnabled/kDisabled/kContainerNotFound (kUnknown is never
/// returned).
Result<HNSSupport> CheckIfHierarchicalNamespaceIsEnabled(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is exposed in azurefs.h? Normally the user wouldn't interact with Azure SDK types directly, only through the FileSystem abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is called from tests. But note that it's in the internal namespace.

I removed azurefs_internal.h and moved everything to azurefs.cc to minimize the export of utilities that mention Azure SDK types and this was the only one left in the internal namespace because it's called directly from unit tests (cc @Tom-Newton).

Copy link
Contributor Author

@felipecrv felipecrv Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the types are only forward-declared here. No header-bloat from the SDK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would have rather kept the azurefs_internal.{h,cc} as that's a useful separation, and it makes reading the headers easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if that requires exposing more symbols that are now private to azurefs.cc? I will need to expose:

  • ExceptionToStatus (which is a template on my fork at the moment)
  • IsDfsEmulator
  • IsContainerNotFound

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution to avoiding exposing extra symbols is moving the CheckIfHierarchicalNamespaceIsEnabled to azurefs_internal.h, but implementing it in azurefs.cc. I will have that done in the next PR I push.

@felipecrv
Copy link
Contributor Author

@pitrou @Tom-Newton follow-up PR addressing your feedback and simplifying error messages #39323

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7265689.

There were 6 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…en checking for HNS support (apache#39298)

### Rationale for this change

An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist.

### What changes are included in this PR?

 - Removal of the `azurefs_intern.h/cc` files
 - Implementation of the check as a free-function instead of a class
 - Memoization of the result in the `AzureFileSystem` class

### Are these changes tested?

Yes. The tests were improved to cover all cases.
* Closes: apache#39297

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…en checking for HNS support (apache#39298)

### Rationale for this change

An operation checking for Hierarchical Namespace support shouldn't fail completely when the reason for the check failing is the container not existing. We can allow the caller to decide what to do in that situation by returning a result that indicates the check didn't succeed because the container doesn't exist.

### What changes are included in this PR?

 - Removal of the `azurefs_intern.h/cc` files
 - Implementation of the check as a free-function instead of a class
 - Memoization of the result in the `AzureFileSystem` class

### Are these changes tested?

Yes. The tests were improved to cover all cases.
* Closes: apache#39297

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Inform caller of Azure Storage container not-existing when checking for HNS support
4 participants