Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a default for "Zarr dir" #1934

Closed
lorenzocerrone opened this issue Oct 16, 2024 · 11 comments · Fixed by #1990
Closed

Set a default for "Zarr dir" #1934

lorenzocerrone opened this issue Oct 16, 2024 · 11 comments · Fixed by #1990

Comments

@lorenzocerrone
Copy link

Since the user groups have a project directory assigned, setting a reasonable default when people create a dataset would be great.
Something like this:
Screenshot 2024-10-16 at 11 49 34

This would be great because many users found this very cumbersome to fill.

Some ideas for default values:

  • /shares/{project_name}/{user_name}/{dataset_name}_{dataset_id}
  • /shares/{project_name}/{user_name}/datasets/{dataset_name}_{dataset_id}
  • /shares/{project_name}/datasets/{user_name}/{dataset_name}_{dataset_id}

I don't have any strong opinions on them, but it would be good if we picked something unlikely to change.
@tcompa @jluethi

@tcompa
Copy link
Collaborator

tcompa commented Oct 16, 2024

I can definitely see the use for this feature (which probably would live frontend-side rather than here - I'm now transferring the issue). We'd still need to define it a bit better:

  1. A user can be associated to 1 or N user groups - which one takes priority?
  2. Each user group can be associated to 0, 1 or N paths - which one takes priority?
  3. There is no (required) username associated to users (at least for the moment). What do we do if username=null?
  4. For the moment the dataset name can be edited by the user.

Once we have these or similar questions defined, we can come up with a "hint". Additional details:

  • The hint could be either a pre-filled text field or it could be part of an info message like "E.g. you can use /something/example/....
  • And the feature should fail gracefully if the hint is not found: e.g. if my user groups are not associated to any path, I simply won't see any hint.

I don't have any strong opinions on them, but it would be good if we picked something unlikely to change.

Several attributes are stable only in practice but not immutable by design: user/usergroup associations, usergroup paths, dataset names. If we aim at guaranteeing that these folders won't change, then we'll need to introduce an appropriate attribute like user.project_folder (and I guess this will be per-user, rather than per-group).

@tcompa
Copy link
Collaborator

tcompa commented Oct 16, 2024

(keeping the issue backend-side in case we want to introduce a new database column in user settings, otherwise I think it will be handled client-side)

@jluethi
Copy link
Collaborator

jluethi commented Oct 16, 2024

I'm a big fan of adding something like this when we have a good default we know for users. While direct access to pick a directory is harder and will take a while to get to, this we can do earlier on.

Since the user groups have a project directory assigned

That would be their Vizarr allowed base directory. Maybe we should consider renaming it to remove the Vizarr part, because we may use it as a default base path, and we may use it to stream data to napari eventually :) What is it called atm @tcompa ? It's shown as "Viewer paths" in the group management

Re what our default should be:
For the science cloud beta deployment, I'd suggest:
$Viewer_paths[0]/{user_name}/fractal/{dataset_id}_{dataset_name}

Things to consider:

  1. Users could have multiple viewer_paths. Let's use the first path from the first group they are assigned to as their default
  2. Handling whether viewer paths end in / or not
  3. Only doing this when a viewer_paths variable is set
  4. Let's add a fractal folder in between => if user also manually add stuff to their folder in the cluster, they would show up somewhere else
  5. Order-wise, I'd go dataset_id first, because then it may just order by time? I think users rarely delete their datasets. Do we actually reuse lower IDs when they become available? If yes, there is a small potential for a conflict with an existing folder.

We may only do that for the Science Cluster or come up with a way that would also work for other deployments.

For UZH Pelkmans lab:
I'd want to achieve something like
/data/active/{slurm_user}/fractal/{dataset_id}_{dataset_name}

FMI Liberali lab:
I'd want to achieve something like /base_path/{slurm_user}/fractal/{dataset_id}_{dataset_name}

Where the base path is the same for all users

@jluethi
Copy link
Collaborator

jluethi commented Oct 16, 2024

Very good questions Tommaso!

A user can be associated to 1 or N user groups - which one takes priority?

I'd say their first group

Each user group can be associated to 0, 1 or N paths - which one takes priority?

0 paths => no default shown (or maybe next group checked??)

There is no (required) username associated to users (at least for the moment). What do we do if username=null?

Great question! Maybe this would be a case for where the username starts to become relevant

For the moment the dataset name can be edited by the user.

That seems fine to me. We're not forcing data to go there, we just provide a default. They can put it anywhere or change the dataset name later without changing the paths.

Re hint vs default: If we can make it work well enough, I'd strongly prefer a default


Several attributes are stable only in practice but not immutable by design: user/usergroup associations, usergroup paths, dataset names. If we aim at guaranteeing that these folders won't change, then we'll need to introduce an appropriate attribute like user.project_folder (and I guess this will be per-user, rather than per-group).

That's an interesting idea as well to avoid a lot of heuristics above and make it work for different deployments.
In general, I'm not worried about whether the default proposal changes over time, whether a user would get the same base path a month later etc.
But more that we suggest a reasonable default to the user

@tcompa
Copy link
Collaborator

tcompa commented Oct 18, 2024

An additional detail here: We cannot naively rely on the dataset ID, since this is only known after the dataset is created.

  1. This would be no problem if we decided (unlikely) to defer the folder choice to the backend. In that case, I'd suggest we define the proper database attributes (user.base_zarr_folder, required) and rules (e.g. anything like base_folder + "dataset_id" + "dataset_name") and proceed with this.
  2. It's technically easy to also expose the two options in the API: trust the server (they will pick a default zarr-dir for you, but you can't know it in advance) or give your own path.
  3. It's not clear at the moment how we could use this rule as part of a "hint", namely something that is known frontend-side but can still be modified by the user.

I don't have a clear solution in mind right now - TBD

@jluethi
Copy link
Collaborator

jluethi commented Oct 18, 2024

An additional detail here: We cannot naively rely on the dataset ID, since this is only known after the dataset is created.

A, good catch! True, that makes it trickier. Let's discuss at the Monday call whether we find a good direction for this

@jluethi
Copy link
Collaborator

jluethi commented Oct 18, 2024

It's technically easy to also expose the two options in the API: trust the server (they will pick a default zarr-dir for you, but you can't know it in advance) or give your own path.

This is something we can consider from my perspective. The result is: Setting a zarr_dir becomes optional, because the server will have a way to create one for you if you did not do it

@jluethi
Copy link
Collaborator

jluethi commented Oct 21, 2024

Discussion results

Reusing IDs: Not intended to happen => @tcompa will double check

Goal: Users do not need to specify a zarr_dir by default. If they don't set one, Fractal creates it.

It would get put into this folder:
Favored: user.project_folder/fractal/{project_id}{project_name}/{dataset_id}{dataset_name}
user.project_folder/fractal/{dataset_id}_{dataset_name}

user.project_folder sounds like a very useful thing to have:

  • It should be user owned
  • It could be used in all deployments
  • It will likely be optional. If unset => can't create standard dataset folders etc.
  • If it is set, we could create a unique subfolder in user.project_folder

What would user.project_folder get set to:

Science Cluster

/shares/project_share/username

Pelkmans lab

/data/active/slurm_user

Liberali lab

/path/to/gliberal/Users/slurm_user

Open questions:

  • Project sharing? Users may want to specify their own path in those cases, deployment dependent. Write access to a shared folder will get tricky => let's use documentation for this for the time being
  • How does it match with the user_group.viewer_paths? These are per user group. Do we add the user.project_folder to the things a user has access to? Some potential complexities in whether users would have access to all their zarr_dirs
  • Can the user change the user.project_folder? Gets tricky on service-user setups on what we allow => let's not expose this at least in the beginning. Admins set them up
  • How does it interact with the cache directory? => use project_folder as the cache directory as well

@tcompa
Copy link
Collaborator

tcompa commented Oct 30, 2024

  • Reusing IDs: Not intended to happen => to check/test
  • zarr_dir becomes optional in request body
  • If the endpoint receives a request with zarr_dir, there should be a single db.commit.
  • If the endpoint receives a request without zarr_dir and user_settgins.project_dir is unset -> 422
  • If the endpoint receives a request without zarr_dir and user_settgins.project_dir is set -> create the zarr dir attribute server-side. This requires two db.commits. The first one sets zarr_dir="PLACEHOLDER", and the second one to
{user.project_folder}/fractal/{project_id}_{project_name}/{dataset_id}_{dataset_name}
  • Open fractal-vizarr-viewer issue about

How does it match with the user_group.viewer_paths? These are per user group. Do we add the user.project_folder to the things a user has access to? Some potential complexities in whether users would have access to all their zarr_dirs

fractal-analytics-platform/fractal-vizarr-viewer#44

  • Open fractal-server issue about

How does it interact with the cache directory? => use project_folder as the cache directory as well

@tcompa
Copy link
Collaborator

tcompa commented Oct 30, 2024

cc @jluethi

Reusing IDs: Not intended to happen => @tcompa will double check

@ychiucco checked this, and it's only true with postgres. With sqlite, IDs of deleted objects can be re-used (see #1991), which could lead to duplicate folders. We could not rapidly find a way to change the sqlite configuration.

The way we are proceeding for the moment is to just disable this option on sqlite instances. When using sqlite, you would only be able to create a dataset if you specify zarr_dir.

For the record, this change comes together with writing tests that depend on the specific database that is being used. And none of the two changes adds any value our to active (postgres-based) instances.

@jluethi
Copy link
Collaborator

jluethi commented Nov 1, 2024

Thanks for checking. Another strong reason to drop sqlite then.

If this is too much effort to write for sqlite, we can accept the edge-case of mistakenly reusing folders in sqlite then until we deprecate sqlite.

@tcompa tcompa added the sqlite label Nov 14, 2024
@tcompa tcompa removed the flexibility Support more workflow-execution use cases label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants