Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically inherit Azure credentials from environment #267

Open
daviewales opened this issue Feb 14, 2025 · 6 comments
Open

Automatically inherit Azure credentials from environment #267

daviewales opened this issue Feb 14, 2025 · 6 comments

Comments

@daviewales
Copy link

With Polars, which also uses the object_store crate, I can authenticate to Azure automatically, based on credentials found in my environment.
For example, if a client secret environment variable exists, it will use that. If I am authenticated with Azure CLI, it will use that.
This means that my code looks like this, and works both for local testing and on the server with environment credentials:

import polars as pl
pl.read_parquet('abfs://[email protected]/path/to/file.parquet')

This is handy, because it means that I can write code which is completely agnostic to the location of the file.
The code above works just as well if I swap out the filepath for a local file, because I didn't need to specify any Azure-specific parameters.

obstore is getting much closer to this ideal.
However, for Azure at least, I still need to give it a hint to detect my Azure CLI credentials:

import obstore
from obstore.store import from_url
store = from_url('abfs://[email protected]/path/to/directory', azure_use_azure_cli=True)
obstore.list(store).collect()

Ideally, I would be able to do this instead:

import obstore
from obstore.store import from_url
store = from_url('abfs://[email protected]/path/to/directory')
obstore.list(store).collect()

Note that Polars uses azure.identity.DefaultAzureCredential as the default credential provider for Azure, as it automatically finds available credentials. This ensures consistent credential resolution across tools.

(It's not a hard dependency on azure.identity. An import error is only raised if you try to use Azure credential functionality and don't have azure.identity installed.)

@kylebarron
Copy link
Member

kylebarron commented Feb 14, 2025

I'm on my phone but you should be able to set AZURE_USE_AZURE_CLI=True as an environment variable. (I'll read more on your other azure links later)

@kylebarron
Copy link
Member

We could implement something like https://developmentseed.org/obstore/latest/api/store/aws/#obstore.store.S3Store.from_session for Azure that uses that azure Python library you mentioned. (I'm less familiar with azure myself)

@daviewales
Copy link
Author

daviewales commented Feb 14, 2025

Polars provides a credential_provider argument as a standard interface for managing arbitrary credentials. It accepts a function with a specified return signature, which allows you to map in arbitrary credentials. Polars also provides a default wrapper class for the the different clouds to avoid needing to create your own custom credential_provider function in most cases. (Search the link above for CredentialProviderAzure for example)

This gets us half way. If I know I'm using Azure, I can initialise an Azure credential, then pass it in to the credential_provider argument. So I can use DefaultAzureCredential, and it will automatically pick up an appropriate credential.

The second step is to setup defaults for each provider. For example in Polars, if it detects that a given URL is an Azure URL, then it automatically sets up a DefaultAzureCredential, without the user needing to setup or specify a credential_provider. (The credential_provider is still useful as an escape hatch when you want to override DefaultAzureCredential.

I'm most familiar with, but I'm guessing that this is generalisable to some extent, as Polars seems to be attempting to generalise for the common default case.

@kylebarron kylebarron changed the title Automatically inherit credentials from environment Automatically inherit Azure credentials from environment Feb 14, 2025
@kylebarron
Copy link
Member

Ok now I'm back at my computer. First, thanks for the issue! Especially with Azure I'm not as familiar.

With Polars, which also uses the object_store crate,

I know Polars does use object_store in some capacity, but their docs also mention that some operations may require fsspec, s3fs, adlfs, gcsfs. I'm curious what they use object_store natively for, and what goes through fsspec. Maybe it's any data source that gets passed to pyarrow? (FWIW I commented on an issue there about fsspec and obstore).

based on credentials found in my environment

We do have some docs on how environment variables are found and applied.

If I am authenticated with Azure CLI, it will use that.
This means that my code looks like this, and works both for local testing and on the server with environment credentials:

I think you can set AZURE_USE_AZURE_CLI=TRUE in your local env, and then it should accurately handle auth when running both locally and remotely.

Polars provides a credential_provider argument as a standard interface for managing arbitrary credentials.

Yes, we'll have something similar soon: #234. The PR works (with either sync or async callbacks!), I just haven't implemented the caching yet (so that credentials aren't attempted to be fetched on every request)

Note that Polars uses azure.identity.DefaultAzureCredential as the default credential provider for Azure, as it automatically finds available credentials.

As mentioned above, we can add something like S3Store.from_session for Azure. I haven't really used Azure before, so I'm not as familiar with this ecosystem.

(It's not a hard dependency on azure.identity. An import error is only raised if you try to use Azure credential functionality and don't have azure.identity installed.)

Does this mean that any time you use any Azure data sources, you need to have azure.identity installed? We don't want that; I think it's important for obstore to work without any external dependencies. This is why S3Store.from_session is a separate constructor, so that users can opt-in to using boto3.

This gets us half way. If I know I'm using Azure, I can initialise an Azure credential, then pass it in to the credential_provider argument. So I can use DefaultAzureCredential, and it will automatically pick up an appropriate credential.

In #234 we can also add credential_provider as a keyword argument to obstore.store.from_url. This credential provider will be passed down to whichever store ends up getting constructed.

The second step is to setup defaults for each provider. For example in Polars, if it detects that a given URL is an Azure URL, then it automatically sets up a DefaultAzureCredential, without the user needing to setup or specify a credential_provider.

I'm inclined for the defaults to stick to whatever object_store implements under the hood, but users can opt-in to credential handling with third party libraries like azure.identity

@daviewales
Copy link
Author

Regarding fsspec, the Polars docs might be outdated: pola-rs/polars#15043

Regarding environment variables, thanks for the link. I can see that you automatically detect the standard AZURE_CLIENT_SECRET, AZURE_CLIENT_ID and AZURE_TENANT_ID variables, which are the first things that DefaultAzureCredential checks.

I believe that object_store may also automatically detect IMDS (managed?) identities, which I think corresponds to the third step for DefaultAzureCredential.

The missing piece for me is automatically trying azure_use_azure_cli if neither of these are found.

DefaultAzureCredential has a few more resolution steps than this, but this would at least solve my own usecase, without needing DefaultAzureCredential installed. (It might be possible to also try some of the other resolution steps.)

One option, which would avoid the hard requirement for azure.identity, would be to use it if it's installed, but fall back gracefully to the default object_store credential process if it's not installed. However, this may be more confusing than helpful, as there are then two behaviours, depending on whether a dependency is installed.

@kylebarron
Copy link
Member

kylebarron commented Feb 14, 2025

Regarding environment variables, thanks for the link. I can see that you automatically detect the standard AZURE_CLIENT_SECRET, AZURE_CLIENT_ID and AZURE_TENANT_ID variables, which are the first things that DefaultAzureCredential checks.

Any env variable starting with AZURE_ is evaluated, so it includes many more config values than those three.

The missing piece for me is automatically trying azure_use_azure_cli if neither of these are found.

I think you should consider making an upstream issue in the object_store crate for that. I don't want to stray from the defaults that object_store provides. But we can document that users can put AZURE_USE_AZURE_CLI=True in their env to ensure it gets checked.

One option, which would avoid the hard requirement for azure.identity, would be to use it if it's installed, but fall back gracefully to the default object_store credential process if it's not installed. However, this may be more confusing than helpful, as there are then two behaviours, depending on whether a dependency is installed.

I don't think we want to automatically change behavior depending on whether a dependency is installed. But based on the polars API, I prototyped an opt-in API, which you can see in #269 and #234:

import obstore as obs
from obstore.google.auth import GoogleAuthAsyncCredentialProvider

credential_provider = GoogleAuthAsyncCredentialProvider()

store = GCSStore("bucket", _credential_provider=credential_provider)
list_result = await obs.list(store).collect_async()

We could do something similar for AzureIdentityCredentialProvider that uses azure.identity under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants