[FEAT] Create obstore store in fsspec on demand #198

machichima · 2025-02-03T15:11:37Z

Construct the obstore store instance on demand in fsspec when calling methods. This allows automatic store creation for reads/writes across different buckets, aligning usage with fsspec conventions

constructe store with from_url using protocol and bucket name

kylebarron · 2025-02-03T15:17:15Z

obstore/python/obstore/fsspec.py

@@ -45,6 +47,9 @@ def __init__(
        self,
        store: obs.store.ObjectStore,
        *args,
+        config: dict[str, Any] = {},


If we allow these, store should be optional?

And before merge we should enable typing overloads for better typing. You can see how from_url is implemented

I use store here for deciding the store Interface (whether it is S3Store, GCSStore, ...), so that in AsyncFsspecStore we don't need to decide the interface based on the protocol.

Maybe there's a better way of deciding the store interface?

obstore_fs: AsyncFsspecStore = fsspec.filesystem( "s3", store=S3Store, config={ "endpoint": "http://localhost:30002", "access_key_id": "minio", "secret_access_key": "miniostorage", "virtual_hosted_style_request": True, # path contain bucket name }, client_options={"timeout": "99999s", "allow_http": "true"}, retry_config={ "max_retries": 2, "backoff": { "base": 2, "init_backoff": timedelta(seconds=2), "max_backoff": timedelta(seconds=16), }, "retry_timeout": timedelta(minutes=3), }, )

I'll have a look at the typing later on

Oh that's confusing because store is the type of the class and not an instance.

We should be able to use the from_url top level function directly here?

kylebarron · 2025-02-03T15:18:15Z

obstore/python/obstore/fsspec.py

+            file_path = "/".join(path_li[1:])
+            return (bucket, file_path)
+
+    @lru_cache(maxsize=10)


It would be nice if this cache size could be user specified but we can come back to it

kylebarron · 2025-02-03T15:20:27Z

Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances?

machichima · 2025-02-03T15:36:26Z

Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances?

Based on what I know, to use fsspec, we will do:

fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)

Each will have their own AsyncFsspecStore innstance already. To config, we can use (based on my current implementation):

s3_fs: AsyncFsspecStore = fsspec.filesystem(
    "s3",
    store=S3Store,
    config={...}
)

gcs_fs: AsyncFsspecStore = fsspec.filesystem(
    "gs",
    store=GCSStore,
    config={...}
)

kylebarron · 2025-02-03T23:27:32Z

It would be nice to take out the store arg and use from_url directly. from_url will automatically construct the correct store based on the url protocol

Specify protocol s3, gs, and abfs

machichima · 2025-02-04T15:19:32Z

I use from_url and remove store in the newest commit. However, by doing this, we need to specify the protocol by inherit the AsyncFsspecStore class for each store instance. I added here

obstore/obstore/python/obstore/fsspec.py

Lines 272 to 279 in a0d9e1d

    
           class S3FsspecStore(AsyncFsspecStore): 
        
               protocol = "s3" 
        
           class GCSFsspecStore(AsyncFsspecStore): 
        
               protocol = "gs" 
        
           class AzureFsspecStore(AsyncFsspecStore): 
        
               protocol = "abfs"

kylebarron · 2025-02-04T16:11:00Z

Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols abfs, adlfs and az, but it looks like adlfs exports three separate classes.

kylebarron · 2025-02-05T00:30:16Z

The latest PRs allow you to access the config back out of a store, which may be useful to you? You can validate that you already have a store in your cache for a specific bucket

machichima · 2025-02-06T14:18:44Z

Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols abfs, adlfs and az, but it looks like adlfs exports three separate classes.

I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: https://github.com/fsspec/s3fs/blob/023aecf00b5c6243ff5f8a016dac8b6af3913c6b/s3fs/core.py#L277

I think abfs, adlfs, and az have different implementation so that they exports different classes. If we use them in obstore, I think we can define a class with protocol (abfs, adlfs, az), but need to test is they all work

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-06T15:41:14Z

obstore/python/obstore/fsspec.py

@@ -104,6 +104,12 @@ def _split_path(self, path: str) -> Tuple[str, str]:
            # no bucket name in path
            return "", path

+        if path.startswith(self.protocol + "://"):


Assuming that this function will always receive something a URL like s3://mybucket/path/to/file, I'm inclined for this function to use urlparse instead of manually handling the parts of the URL

It will not always be s3://mybucket/path/to/file, but may be without protocol like mybucket/path/to/file

I use urlparse like this here, which works for both s3://mybucket/path/to/file and mybucket/path/to/file

obstore/obstore/python/obstore/fsspec.py

Lines 108 to 112 in 75c738e

res = urlparse(path)

if res.scheme:

if res.scheme != self.protocol:

raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}")

path = res.netloc + res.path

kylebarron · 2025-02-06T15:43:35Z

I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: fsspec/s3fs@023aecf/s3fs/core.py#L277

Oh cool! That seems to indicate that we could have a single class that defines supported protocols as:

    protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs")

Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each?

machichima · 2025-02-07T14:25:41Z

Oh cool! That seems to indicate that we could have a single class that defines supported protocols as:
    protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs")
Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each?

I don't think we can put all the protocols together into a class, as when using fsspec.register_implementation("s3", AsyncFsspecStore), fsspec wouldn't tell AsyncFsspecStore what the protocol is, so that when constructing store instance, we cannot get the protocol

obstore/obstore/python/obstore/fsspec.py

Lines 122 to 124 in 6614906

    
           def _construct_store(self, bucket: str): 
        
               return from_url( 
        
                   url=f"{self.protocol}://{bucket}",

I think the better way is to create obstore.fsspec.register("protocol"), that wraps around the fsspec.register and directly set the protocol for AsyncFsspecStore (like what mentioned in this comment), then we do not need more classes. Let me have a try.

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-07T16:35:16Z

I did a quick look through your PR; it's really good progress but a few thoughts:

There are a bunch of cases where bucket, path = self._split_path(path) doesn't work because path is not in scope. E.g. in _cp_file where path1 and path2 are in scope
in _cp_file we need to validate that the bucket of the source and destination paths are the same
We need some tests for edits that happen in this PR
It's not clear how BufferedFileSimple works, because that subclasses from an upstream fsspec.spec.AbstractBufferedFile but doesn't touch obstore apis at all
If you don't already, I'd highly suggest using a linter like https://docs.astral.sh/ruff/ in your editor, so that you can catch some of these issue before hitting CI

machichima · 2025-02-08T08:20:29Z

If you don't already, I'd highly suggest using a linter like https://docs.astral.sh/ruff/ in your editor, so that you can catch some of these issue before hitting CI

Thanks for the suggestion! I just added ruff linter and remove error for path. I also add the check for validating the two bucket name from two path are the same.

It's not clear how BufferedFileSimple works, because that subclasses from an upstream fsspec.spec.AbstractBufferedFile but doesn't touch obstore apis at all

For BufferedFileSimple, when self.fs.cat_file() is called, it will direct to the _cat_file() function in AsyncFsspecStore.

We need some tests for edits that happen in this PR

Yes! I will update the test in the next few days

Check if AsyncFsspecStore is registered and test invalid types pass into register

Future-Outlier

lgtm

kylebarron · 2025-02-21T16:38:08Z

Tests are failing; would you be able to fix that?

machichima · 2025-02-21T16:50:41Z

~~The failing tests are XFailed ones, I don't know why github now catching those. Should I take them out?~~

Oh sorry it's caused by run out of memory. I'll fix the test then

If register multiple time, and each of them have their instance, the cache does not work and will end up with multiple instances with same config

machichima · 2025-02-23T06:19:39Z

Tests are failing; would you be able to fix that?

Hi @kylebarron
I just add the code to clean up after each test. The OOM error is because the file system instances are not cleaned up.

kylebarron · 2025-02-24T16:07:19Z

obstore/python/obstore/fsspec.py

@@ -296,3 +468,66 @@ def read(self, length: int = -1) -> Any:
            data = self.fs.cat_file(self.path, self.loc, self.loc + length)
            self.loc += length
        return data
+
+
+def register(protocol: str | list[str], asynchronous: bool = False) -> None:


Suggested change

def register(protocol: str | list[str], asynchronous: bool = False) -> None:

def register(protocol: str | list[str], *, asynchronous: bool = False) -> None:

I'm a big proponent of having a small number of positional parameters. This also allows us to make asynchronous a positional argument in the future without it being breaking

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-24T16:15:04Z

obstore/python/obstore/fsspec.py

@@ -296,3 +467,50 @@ def read(self, length: int = -1) -> Any:
            data = self.fs.cat_file(self.path, self.loc, self.loc + length)
            self.loc += length
        return data
+
+
+def register(protocol: str | Iterable[str], *, asynchronous: bool = False) -> None:


This can be a follow-up PR, but we should support calling register with no arguments, which registers for all supported backends.

kylebarron · 2025-02-24T16:17:13Z

I have to run, but I can do a final review later today or tomorrow

machichima · 2025-02-25T14:15:04Z

I just added the type checking in register, as setting the type for parameter in python only works for static check and not runtime, so we should explicitly check types here

kylebarron · 2025-02-26T04:31:30Z

I tried a basic list operation and couldn't get it to work with this PR.

A successful list with s3fs:

import s3fs
fs = s3fs.S3FileSystem(anon=True)
path = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A"
fs.ls(path)

prints

['sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/AOT.tif',
 'sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/B01.tif',
...

Trying to do this with this PR:

# Neither of these two work
store = AsyncFsspecStore(config={"skip_signature": True})
store = AsyncFsspecStore("s3", config={"skip_signature": True})
path = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A"
store.ls(path)

raises with

GenericError: Generic URL error: Unable to recognise URL "abstract://"

You're checking self.protocol in multiple places but it's never getting set.

kylebarron · 2025-02-26T04:34:20Z

tests/conftest.py

+        "AWS_SKIP_SIGNATURE": "True",
+        "AWS_ALLOW_HTTP": "true",


Suggested change

"AWS_SKIP_SIGNATURE": "True",

"AWS_ALLOW_HTTP": "true",

"AWS_SKIP_SIGNATURE": True,

"AWS_ALLOW_HTTP": True,

kylebarron · 2025-02-26T04:34:27Z

tests/conftest.py

+
+
+@pytest.fixture
+def s3_store_config(s3: str):


Suggested change

def s3_store_config(s3: str):

def s3_store_config(s3: str) -> S3ConfigInput:

kylebarron · 2025-02-26T04:56:12Z

obstore/python/obstore/fsspec.py

            args: positional arguments passed on to the `fsspec.asyn.AsyncFileSystem`
                constructor.

        Keyword Args:
            asynchronous: Set to `True` if this instance is meant to be be called using
                the fsspec async API. This should only be set to true when running
                within a coroutine.
+            max_cache_size (int, optional): The maximum number of stores the cache
+                should keep. A cached store is kept internally for each bucket name.
+                Defaults to 10.
            loop: since both fsspec/python and tokio/rust may be using loops, this
                should be kept `None` for now, and will not be used.
            batch_size: some operations on many files will batch their requests; if you


This example below in this docstring is no longer valid. Can you update it?

kylebarron · 2025-02-26T04:59:07Z

obstore/python/obstore/fsspec.py


-    def __init__(
+    def __init__(  # noqa: PLR0913


It seems you want a protocol parameter here, which sets the value of protocol onto self?

kylebarron · 2025-02-26T05:01:09Z

obstore/python/obstore/fsspec.py

+        """
+        protocol_with_bucket = ["s3", "s3a", "gcs", "gs", "abfs", "https", "http"]
+
+        if self.protocol not in protocol_with_bucket:


What are examples of protocols that we support that are not any of the above?

(That is, why are we even doing this check?)

kylebarron · 2025-02-26T05:03:18Z

obstore/python/obstore/fsspec.py

+        if "/" not in path:
+            return path, ""


Aren't you returning a tuple of (bucket, file_path)? Then returning path, "" doesn't make sense.

It seems we'd want to error in this case? There's no way for us to infer the bucket to use.

As it stands you're searching through the string twice, once here and again below in path.split.

Instead, you can call path.split("/", 1) once. If the result is a list of length 1, then you know a "/" wasn't in the path, and then you can error.

kylebarron · 2025-02-26T05:05:31Z

obstore/python/obstore/fsspec.py

+        path_li = path.split("/")
+        bucket = path_li[0]
+        file_path = "/".join(path_li[1:])


Much simpler than this is to split on only the first / character:

path = "bucket/path/to/file.txt" path.split("/", 1) # ["bucket", "path/to/file.txt"]

Then you don't need to split and rejoin the path

kylebarron · 2025-02-26T05:08:41Z

obstore/python/obstore/fsspec.py

        super().__init__(
            *args,
            asynchronous=asynchronous,
            loop=loop,
            batch_size=batch_size,
        )

+    def _split_path(self, path: str) -> tuple[str, str]:


The only thing this function uses from self is self.protocol. Let's move _split_path into global scope, and then we can test _split_path specifically from the test file.

We should validate that we can split the path both for URLs with a protocol and for "paths" without the protocol.

kylebarron · 2025-02-26T05:13:03Z

obstore/python/obstore/fsspec.py

+            loop = asyncio.get_running_loop()
+            return await loop.run_in_executor(None, super().info, path, **_kwargs)


Why are you calling super().info? You can call super()._info, which is async, and not need to touch the running event loop at all.

kylebarron · 2025-02-26T05:15:17Z

obstore/python/obstore/fsspec.py

+            return await loop.run_in_executor(None, super().info, path, **_kwargs)
+
+    @staticmethod
+    def _fill_bucket_name(path: str, bucket: str) -> str:


This is only used in two places. Can we just copy the f-string above and delete this helper function?

kylebarron · 2025-02-26T05:16:38Z

I just added the type checking in register, as setting the type for parameter in python only works for static check and not runtime, so we should explicitly check types here

I'd rather not. I'd rather just rely on the static type checker (at least for the register function) and keep the code more concise. Especially where register isn't a confusing API. It shouldn't be surprising that if you call register(123), the code won't work.

feat: split bucket from path + construct store

909b5b0

constructe store with from_url using protocol and bucket name

machichima mentioned this pull request Feb 3, 2025

Support obstore as storage for df.to_parquet() #164

Open

kylebarron reviewed Feb 3, 2025

View reviewed changes

feat: remove store + add protocol + apply to all methods

29464a7

machichima force-pushed the obstore-instance-in-fsspec branch from 34f79f0 to 29464a7 Compare February 4, 2025 14:06

feat: inherit from AsyncFsspecStore to specify protocol

a0d9e1d

Specify protocol s3, gs, and abfs

fix: correctly split protocol if exists in path

6614906

machichima mentioned this pull request Feb 6, 2025

[WIP] support df.to_parquet and df.read_parquet() #165

Open

kylebarron reviewed Feb 6, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

kylebarron reviewed Feb 6, 2025

View reviewed changes

feat: use urlparse to extract protocol

75c738e

kylebarron reviewed Feb 7, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

kylebarron added 2 commits February 7, 2025 10:49

Merge branch 'main' into obstore-instance-in-fsspec

2209839

update typing

46c6b59

machichima added 2 commits February 8, 2025 16:09

fix: unbounded error

9ab35e1

fix: remove redundant import

cb80495

machichima added 3 commits February 8, 2025 16:45

feat: add register() to register AsyncFsspecStore for provided protocol

b6a3d3a

feat: add validation for protocol in register()

68cdff9

test: for register()

fa5b539

Check if AsyncFsspecStore is registered and test invalid types pass into register

machichima added 2 commits February 19, 2025 21:51

fix: enable cp folders

c804a18

lint

6c2c513

machichima requested a review from kylebarron February 21, 2025 14:00

Future-Outlier approved these changes Feb 21, 2025

View reviewed changes

machichima added 2 commits February 23, 2025 14:14

fix: clobber=False to prevent re-register and cause memory leak

c6392f2

If register multiple time, and each of them have their instance, the cache does not work and will end up with multiple instances with same config

test: clean up after each test to prevent memory leak

347e63e

Merge branch 'main' into obstore-instance-in-fsspec

69bbed6

kylebarron reviewed Feb 24, 2025

View reviewed changes

Simplify protocol registration

9e423f5

kylebarron reviewed Feb 24, 2025

View reviewed changes

kylebarron mentioned this pull request Feb 24, 2025

Support fsspec.register with no arguments. #279

Open

fix+test: register check types

096845c

small edits

5ea2ba8

kylebarron reviewed Feb 26, 2025

View reviewed changes

	res = urlparse(path)
	if res.scheme:
	if res.scheme != self.protocol:
	raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}")
	path = res.netloc + res.path

	def register(protocol: str \| list[str], asynchronous: bool = False) -> None:
	def register(protocol: str \| list[str], *, asynchronous: bool = False) -> None:

	def s3_store_config(s3: str):
	def s3_store_config(s3: str) -> S3ConfigInput:

		loop = asyncio.get_running_loop()
		return await loop.run_in_executor(None, super().info, path, **_kwargs)

[FEAT] Create obstore store in fsspec on demand #198

Are you sure you want to change the base?

[FEAT] Create obstore store in fsspec on demand #198

Conversation

machichima commented Feb 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Feb 3, 2025

machichima commented Feb 3, 2025 • edited Loading

kylebarron commented Feb 3, 2025

machichima commented Feb 4, 2025

kylebarron commented Feb 4, 2025

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

kylebarron commented Feb 6, 2025

machichima commented Feb 7, 2025

kylebarron commented Feb 7, 2025

machichima commented Feb 8, 2025

Future-Outlier left a comment

Choose a reason for hiding this comment

kylebarron commented Feb 21, 2025

machichima commented Feb 21, 2025 • edited Loading

machichima commented Feb 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Feb 24, 2025

machichima commented Feb 25, 2025 • edited Loading

kylebarron commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Feb 26, 2025

machichima commented Feb 3, 2025 •

edited

Loading

machichima commented Feb 6, 2025 •

edited

Loading

machichima Feb 7, 2025 •

edited

Loading

machichima commented Feb 21, 2025 •

edited

Loading

machichima commented Feb 23, 2025 •

edited

Loading

machichima commented Feb 25, 2025 •

edited

Loading

kylebarron Feb 26, 2025 •

edited

Loading

kylebarron Feb 26, 2025 •

edited

Loading