Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Automate Data Retrieval Requests #2898

Open
yuvipanda opened this issue Oct 18, 2021 · 14 comments
Open

[EPIC] Automate Data Retrieval Requests #2898

yuvipanda opened this issue Oct 18, 2021 · 14 comments
Assignees
Labels
enhancement Issues around improving existing functionality jira Issue tracked in JIRA

Comments

@yuvipanda
Copy link
Contributor

Currently, students need to raise a request on GitHub to get a copy of their archived files (#2866). This also creates manual work for @felder, and there are also way more requests than I had thought.

In the text file telling students what to do, we ask them to open this issue here. Instead, we can just provide the signed URL automatically there - so they can self-serve themselves the files without having to bother us. This eliminates an entire class of service requests we need to handle.

@felder
Copy link
Contributor

felder commented Oct 18, 2021

@yuvipanda my understanding for signed urls is their max duration is 7 days.

https://cloud.google.com/storage/docs/gsutil/commands/signurl

We'd really need something that could based on user auth produce these on the fly in order to automate.

@yuvipanda
Copy link
Contributor Author

ah damn.

Yeah, in that case I agree it means we've to write some code here. I don't think it needs us to introduce another layer of auth tho, we can just implement signed URLs ourselves with GCP KMS

  1. During archival, create a URL contains in it all the info needed to figure out how to fetch the file
  2. Sign this URL with a KMS key, and add this sign as a query param or something (gotta do this carefully)
  3. Write a simple service that when accessed will check this signature, validate it, and then let the user download the actual file

@balajialg
Copy link
Contributor

@yuvipanda It will be fantastic to automate such requests considering the FERPA requirements and the additional throttle on the bandwidth of @felder!

How do you see the complexity of writing this service that does this automation? I am conscious of our backlogs and want to avoid adding more requests at your end currently. Let me know!

@felder
Copy link
Contributor

felder commented Oct 18, 2021

@yuvipanda my concern here is that unless the URL obfuscated (not a big fan of security by obscurity though) other students may be able to figure out how to gain access to data they should not be able to gain access to. That's the reasoning behind me saying we may want to tie it to auth of some sort.

@yuvipanda
Copy link
Contributor Author

@felder absolutely agree it shouldn't be security by obscurity, it should be fairly strong crypto. I think a simple signature where we keep the key private would be good enough. If people can guess those signed URLs most of the crypto we rely on would be considered broken.

Good question on complexity, @balajialg. I'll try investigate that.

@felder
Copy link
Contributor

felder commented Oct 18, 2021

@yuvipanda yeah I wouldn't expect people to guess the signedurls themselves! I'm referring more to the query string parameters that would be used to generate them

@yuvipanda
Copy link
Contributor Author

yeah, i think the signing means it doesn't matter what the user can guess.

However, I think given my current workload, I won't be able to build this anytime soon. So please don't block on it if other privacy preserving workflow changes need to happen.

@balajialg
Copy link
Contributor

@yuvipanda @felder We have a couple of options in the short term,

  1. Continue serving the same way hoping that such requests are automated in the future/ explore creating private issues in GitHub
  2. Shift to a ticketing system of choice for such requests alone.

I am inclining towards 2 for such requests alone. What do you both think?

@felder
Copy link
Contributor

felder commented Oct 29, 2021

@balajialg I'm inclined toward 2 as well. However, I do not believe these requests should be considered in a vacuum. We may opt to move these requests first, but ultimately we should consider it as a trial run for a general support process for Berkeley specific operational issues.

@balajialg
Copy link
Contributor

balajialg commented Oct 30, 2021

@felder When you mean support process, you mean for the regular requests we get right? Package requests, admin access, RAM elevation, etc.. or are you also considering bugs being reported?

If it is a bug, I wonder how issues such as this would be fixed as they have an upstream dependency and would require interaction with other developers/admins! Lets discuss more during sprint planning meeting (Lets see whether we will be able to wrap this discussion in time)

@yuvipanda
Copy link
Contributor Author

I think it might be helpful to have something else that contains possible private information - but I'd love for most things to stay as public as possible here.

@felder
Copy link
Contributor

felder commented Nov 1, 2021

@balajialg @yuvipanda Anything reported by a student or regarding a specific student where FERPA would apply.

Basically I'd like to start thinking about datahub the UCB specific service vs datahub the opensource software project (not to be confused with datahub the proposed building 😃), with service related requests having a private ticketing system. Note that requests that require development resources to resolve can have github issues created for them.

I understand that transparency is important, but I do think there are plenty of support requests that don't really require any development resources to fix and probably would not be of that much interest to anyone else.

Individual issues regarding say rstudio not launching would fall into this as well, as opposed to generalized solutions such as terminating rstudio gracefully on logout which would remain here in github.

@balajialg
Copy link
Contributor

balajialg commented Nov 1, 2021

@felder Got it! I wonder whether reporting bugs through different systems (based on the nature of the bugs) will be a cumbersome support experience for the users as most users would not care to know whether their issue should be raised via Github or a ticketing system based on the nature of the bug. For eg: The rstudio usecase highlighted by you.

I am personally aligned with moving chores to a support system (if that is something you feel strongly about) but keeping the feature enhancements and issues being reported (since many issues are correlated with package requests) with Github considering that they may require upstream dependency. Thoughts?

@felder @yuvipanda Did some analysis on the distribution of requests that we get every month. This is how it looks like for the past three months,

August:
Package Requests:11
Admin Requests: 4
Issues:1
File Requests: 0
RAM: 0

September:
Package Requests: 11
File Requests: 4
Issues:3
Admin:1
RAM:1

October:
Package Requests: 8
File Requests: 7
Issues: 2
Admin: 1
RAM: 1

Based on the frequency and volume, the routine support requests that really matter are the "package installation/upgrade" and the "retrieval of the file" requests.

@balajialg balajialg changed the title Provide signed URLs in archive announcement texts so students can download their home directories without involving us Automate Archival Requests: Provide signed URLs in archive announcement texts so students can download their home directories without involving us Feb 1, 2023
@balajialg balajialg added the enhancement Issues around improving existing functionality label Feb 3, 2023
@ryanlovett
Copy link
Collaborator

It might be useful to create a service that is proxied by the user's server which can generate these URLs or invoke various APIs. For example it could be a tornado/flask app that runs on a random port in the user's pod and is proxied by jupyter-server-proxy. It would be behind the hub's authentication. There could be a launcher in retro's New > dropdown or in the Lab launcher that invokes /user/{username}/data-retrieval, or we could advertise a URL of the form https://{hub}.datahub.berkeley.edu/user-redirect/data-retrieval which would redirect to that user's service.

I'm not sure about the full details of the signed URL and retrieval process, so this idea might require iteration.

@balajialg balajialg changed the title Automate Archival Requests: Provide signed URLs in archive announcement texts so students can download their home directories without involving us Automate Archival Requests Feb 3, 2023
@balajialg balajialg changed the title Automate Archival Requests [EPIC] Automate Retrieval Requests Feb 3, 2023
@balajialg balajialg changed the title [EPIC] Automate Retrieval Requests [EPIC] Automate Data Retrieval Requests Mar 23, 2023
@balajialg balajialg added the jira Issue tracked in JIRA label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality jira Issue tracked in JIRA
Projects
None yet
Development

No branches or pull requests

5 participants