-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Create obstore store in fsspec on demand #198
base: main
Are you sure you want to change the base?
[FEAT] Create obstore store in fsspec on demand #198
Conversation
constructe store with from_url using protocol and bucket name
obstore/python/obstore/fsspec.py
Outdated
@@ -45,6 +47,9 @@ def __init__( | |||
self, | |||
store: obs.store.ObjectStore, | |||
*args, | |||
config: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow these, store should be optional?
And before merge we should enable typing overloads for better typing. You can see how from_url is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use store here for deciding the store Interface (whether it is S3Store, GCSStore, ...), so that in AsyncFsspecStore we don't need to decide the interface based on the protocol.
Maybe there's a better way of deciding the store interface?
obstore_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={
"endpoint": "http://localhost:30002",
"access_key_id": "minio",
"secret_access_key": "miniostorage",
"virtual_hosted_style_request": True, # path contain bucket name
},
client_options={"timeout": "99999s", "allow_http": "true"},
retry_config={
"max_retries": 2,
"backoff": {
"base": 2,
"init_backoff": timedelta(seconds=2),
"max_backoff": timedelta(seconds=16),
},
"retry_timeout": timedelta(minutes=3),
},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at the typing later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's confusing because store is the type of the class and not an instance.
We should be able to use the from_url top level function directly here?
obstore/python/obstore/fsspec.py
Outdated
file_path = "/".join(path_li[1:]) | ||
return (bucket, file_path) | ||
|
||
@lru_cache(maxsize=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this cache size could be user specified but we can come back to it
Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances? |
Based on what I know, to use fsspec, we will do: fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore) Each will have their own AsyncFsspecStore innstance already. To config, we can use (based on my current implementation): s3_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={...}
)
gcs_fs: AsyncFsspecStore = fsspec.filesystem(
"gs",
store=GCSStore,
config={...}
) |
It would be nice to take out the |
34f79f0
to
29464a7
Compare
Specify protocol s3, gs, and abfs
I use obstore/obstore/python/obstore/fsspec.py Lines 272 to 279 in a0d9e1d
|
Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols |
The latest PRs allow you to access the |
I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: https://github.com/fsspec/s3fs/blob/023aecf00b5c6243ff5f8a016dac8b6af3913c6b/s3fs/core.py#L277 I think abfs, adlfs, and az have different implementation so that they exports different classes. If we use them in obstore, I think we can define a class with protocol (abfs, adlfs, az), but need to test is they all work |
obstore/python/obstore/fsspec.py
Outdated
@@ -104,6 +104,12 @@ def _split_path(self, path: str) -> Tuple[str, str]: | |||
# no bucket name in path | |||
return "", path | |||
|
|||
if path.startswith(self.protocol + "://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this function will always receive something a URL like s3://mybucket/path/to/file
, I'm inclined for this function to use urlparse
instead of manually handling the parts of the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not always be s3://mybucket/path/to/file
, but may be without protocol like mybucket/path/to/file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use urlparse like this here, which works for both s3://mybucket/path/to/file
and mybucket/path/to/file
obstore/obstore/python/obstore/fsspec.py
Lines 108 to 112 in 75c738e
res = urlparse(path) | |
if res.scheme: | |
if res.scheme != self.protocol: | |
raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}") | |
path = res.netloc + res.path |
Oh cool! That seems to indicate that we could have a single class that defines supported protocols as: protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs") Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each? |
I don't think we can put all the protocols together into a class, as when using obstore/obstore/python/obstore/fsspec.py Lines 122 to 124 in 6614906
I think the better way is to create |
I did a quick look through your PR; it's really good progress but a few thoughts:
|
Thanks for the suggestion! I just added ruff linter and remove error for path. I also add the check for validating the two bucket name from two path are the same.
For
Yes! I will update the test in the next few days |
Check if AsyncFsspecStore is registered and test invalid types pass into register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Tests are failing; would you be able to fix that? |
Oh sorry it's caused by run out of memory. I'll fix the test then |
If register multiple time, and each of them have their instance, the cache does not work and will end up with multiple instances with same config
Hi @kylebarron |
obstore/python/obstore/fsspec.py
Outdated
@@ -296,3 +468,66 @@ def read(self, length: int = -1) -> Any: | |||
data = self.fs.cat_file(self.path, self.loc, self.loc + length) | |||
self.loc += length | |||
return data | |||
|
|||
|
|||
def register(protocol: str | list[str], asynchronous: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def register(protocol: str | list[str], asynchronous: bool = False) -> None: | |
def register(protocol: str | list[str], *, asynchronous: bool = False) -> None: |
I'm a big proponent of having a small number of positional parameters. This also allows us to make asynchronous
a positional argument in the future without it being breaking
@@ -296,3 +467,50 @@ def read(self, length: int = -1) -> Any: | |||
data = self.fs.cat_file(self.path, self.loc, self.loc + length) | |||
self.loc += length | |||
return data | |||
|
|||
|
|||
def register(protocol: str | Iterable[str], *, asynchronous: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a follow-up PR, but we should support calling register
with no arguments, which registers for all supported backends.
I have to run, but I can do a final review later today or tomorrow |
I just added the type checking in |
I tried a basic list operation and couldn't get it to work with this PR. A successful list with s3fs: import s3fs
fs = s3fs.S3FileSystem(anon=True)
path = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A"
fs.ls(path) prints ['sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/AOT.tif',
'sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/B01.tif',
... Trying to do this with this PR: # Neither of these two work
store = AsyncFsspecStore(config={"skip_signature": True})
store = AsyncFsspecStore("s3", config={"skip_signature": True})
path = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A"
store.ls(path) raises with
You're checking |
"AWS_SKIP_SIGNATURE": "True", | ||
"AWS_ALLOW_HTTP": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"AWS_SKIP_SIGNATURE": "True", | |
"AWS_ALLOW_HTTP": "true", | |
"AWS_SKIP_SIGNATURE": True, | |
"AWS_ALLOW_HTTP": True, |
|
||
|
||
@pytest.fixture | ||
def s3_store_config(s3: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def s3_store_config(s3: str): | |
def s3_store_config(s3: str) -> S3ConfigInput: |
args: positional arguments passed on to the `fsspec.asyn.AsyncFileSystem` | ||
constructor. | ||
Keyword Args: | ||
asynchronous: Set to `True` if this instance is meant to be be called using | ||
the fsspec async API. This should only be set to true when running | ||
within a coroutine. | ||
max_cache_size (int, optional): The maximum number of stores the cache | ||
should keep. A cached store is kept internally for each bucket name. | ||
Defaults to 10. | ||
loop: since both fsspec/python and tokio/rust may be using loops, this | ||
should be kept `None` for now, and will not be used. | ||
batch_size: some operations on many files will batch their requests; if you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example below in this docstring is no longer valid. Can you update it?
|
||
def __init__( | ||
def __init__( # noqa: PLR0913 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you want a protocol
parameter here, which sets the value of protocol
onto self
?
""" | ||
protocol_with_bucket = ["s3", "s3a", "gcs", "gs", "abfs", "https", "http"] | ||
|
||
if self.protocol not in protocol_with_bucket: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are examples of protocols that we support that are not any of the above?
(That is, why are we even doing this check?)
if "/" not in path: | ||
return path, "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you returning a tuple of (bucket, file_path)
? Then returning path, ""
doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we'd want to error in this case? There's no way for us to infer the bucket to use.
As it stands you're searching through the string twice, once here and again below in path.split
.
Instead, you can call path.split("/", 1)
once. If the result is a list of length 1, then you know a "/"
wasn't in the path, and then you can error.
path_li = path.split("/") | ||
bucket = path_li[0] | ||
file_path = "/".join(path_li[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much simpler than this is to split on only the first /
character:
path = "bucket/path/to/file.txt"
path.split("/", 1)
# ["bucket", "path/to/file.txt"]
Then you don't need to split and rejoin the path
super().__init__( | ||
*args, | ||
asynchronous=asynchronous, | ||
loop=loop, | ||
batch_size=batch_size, | ||
) | ||
|
||
def _split_path(self, path: str) -> tuple[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing this function uses from self
is self.protocol
. Let's move _split_path
into global scope, and then we can test _split_path
specifically from the test file.
We should validate that we can split the path both for URLs with a protocol and for "paths" without the protocol.
loop = asyncio.get_running_loop() | ||
return await loop.run_in_executor(None, super().info, path, **_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you calling super().info
? You can call super()._info
, which is async, and not need to touch the running event loop at all.
return await loop.run_in_executor(None, super().info, path, **_kwargs) | ||
|
||
@staticmethod | ||
def _fill_bucket_name(path: str, bucket: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in two places. Can we just copy the f-string above and delete this helper function?
I'd rather not. I'd rather just rely on the static type checker (at least for the |
Construct the obstore store instance on demand in fsspec when calling methods. This allows automatic store creation for reads/writes across different buckets, aligning usage with fsspec conventions