[EPIC] Automate Data Retrieval Requests #2898

yuvipanda · 2021-10-18T19:47:01Z

Currently, students need to raise a request on GitHub to get a copy of their archived files (#2866). This also creates manual work for @felder, and there are also way more requests than I had thought.

In the text file telling students what to do, we ask them to open this issue here. Instead, we can just provide the signed URL automatically there - so they can self-serve themselves the files without having to bother us. This eliminates an entire class of service requests we need to handle.

felder · 2021-10-18T19:52:07Z

@yuvipanda my understanding for signed urls is their max duration is 7 days.

https://cloud.google.com/storage/docs/gsutil/commands/signurl

We'd really need something that could based on user auth produce these on the fly in order to automate.

yuvipanda · 2021-10-18T19:59:03Z

ah damn.

Yeah, in that case I agree it means we've to write some code here. I don't think it needs us to introduce another layer of auth tho, we can just implement signed URLs ourselves with GCP KMS

During archival, create a URL contains in it all the info needed to figure out how to fetch the file
Sign this URL with a KMS key, and add this sign as a query param or something (gotta do this carefully)
Write a simple service that when accessed will check this signature, validate it, and then let the user download the actual file

balajialg · 2021-10-18T20:06:23Z

@yuvipanda It will be fantastic to automate such requests considering the FERPA requirements and the additional throttle on the bandwidth of @felder!

How do you see the complexity of writing this service that does this automation? I am conscious of our backlogs and want to avoid adding more requests at your end currently. Let me know!

felder · 2021-10-18T21:22:39Z

@yuvipanda my concern here is that unless the URL obfuscated (not a big fan of security by obscurity though) other students may be able to figure out how to gain access to data they should not be able to gain access to. That's the reasoning behind me saying we may want to tie it to auth of some sort.

yuvipanda · 2021-10-18T21:31:52Z

@felder absolutely agree it shouldn't be security by obscurity, it should be fairly strong crypto. I think a simple signature where we keep the key private would be good enough. If people can guess those signed URLs most of the crypto we rely on would be considered broken.

Good question on complexity, @balajialg. I'll try investigate that.

felder · 2021-10-18T21:33:16Z

@yuvipanda yeah I wouldn't expect people to guess the signedurls themselves! I'm referring more to the query string parameters that would be used to generate them

yuvipanda · 2021-10-26T20:48:47Z

yeah, i think the signing means it doesn't matter what the user can guess.

However, I think given my current workload, I won't be able to build this anytime soon. So please don't block on it if other privacy preserving workflow changes need to happen.

balajialg · 2021-10-27T02:34:11Z

@yuvipanda @felder We have a couple of options in the short term,

Continue serving the same way hoping that such requests are automated in the future/ explore creating private issues in GitHub
Shift to a ticketing system of choice for such requests alone.

I am inclining towards 2 for such requests alone. What do you both think?

felder · 2021-10-29T19:58:54Z

@balajialg I'm inclined toward 2 as well. However, I do not believe these requests should be considered in a vacuum. We may opt to move these requests first, but ultimately we should consider it as a trial run for a general support process for Berkeley specific operational issues.

balajialg · 2021-10-30T01:23:19Z

@felder When you mean support process, you mean for the regular requests we get right? Package requests, admin access, RAM elevation, etc.. or are you also considering bugs being reported?

If it is a bug, I wonder how issues such as this would be fixed as they have an upstream dependency and would require interaction with other developers/admins! Lets discuss more during sprint planning meeting (Lets see whether we will be able to wrap this discussion in time)

yuvipanda · 2021-10-30T06:37:22Z

I think it might be helpful to have something else that contains possible private information - but I'd love for most things to stay as public as possible here.

felder · 2021-11-01T19:02:04Z

@balajialg @yuvipanda Anything reported by a student or regarding a specific student where FERPA would apply.

Basically I'd like to start thinking about datahub the UCB specific service vs datahub the opensource software project (not to be confused with datahub the proposed building 😃), with service related requests having a private ticketing system. Note that requests that require development resources to resolve can have github issues created for them.

I understand that transparency is important, but I do think there are plenty of support requests that don't really require any development resources to fix and probably would not be of that much interest to anyone else.

Individual issues regarding say rstudio not launching would fall into this as well, as opposed to generalized solutions such as terminating rstudio gracefully on logout which would remain here in github.

balajialg · 2021-11-01T21:24:45Z

@felder Got it! I wonder whether reporting bugs through different systems (based on the nature of the bugs) will be a cumbersome support experience for the users as most users would not care to know whether their issue should be raised via Github or a ticketing system based on the nature of the bug. For eg: The rstudio usecase highlighted by you.

I am personally aligned with moving chores to a support system (if that is something you feel strongly about) but keeping the feature enhancements and issues being reported (since many issues are correlated with package requests) with Github considering that they may require upstream dependency. Thoughts?

@felder @yuvipanda Did some analysis on the distribution of requests that we get every month. This is how it looks like for the past three months,

August:
Package Requests:11
Admin Requests: 4
Issues:1
File Requests: 0
RAM: 0

September:
Package Requests: 11
File Requests: 4
Issues:3
Admin:1
RAM:1

October:
Package Requests: 8
File Requests: 7
Issues: 2
Admin: 1
RAM: 1

Based on the frequency and volume, the routine support requests that really matter are the "package installation/upgrade" and the "retrieval of the file" requests.

ryanlovett · 2023-02-03T00:24:53Z

It might be useful to create a service that is proxied by the user's server which can generate these URLs or invoke various APIs. For example it could be a tornado/flask app that runs on a random port in the user's pod and is proxied by jupyter-server-proxy. It would be behind the hub's authentication. There could be a launcher in retro's New > dropdown or in the Lab launcher that invokes /user/{username}/data-retrieval, or we could advertise a URL of the form https://{hub}.datahub.berkeley.edu/user-redirect/data-retrieval which would redirect to that user's service.

I'm not sure about the full details of the signed URL and retrieval process, so this idea might require iteration.

balajialg mentioned this issue Oct 19, 2022

Shane Onboarding: Datahub Infra Access Checklist #3820

Closed

9 tasks

balajialg added the enhancement Issues around improving existing functionality label Feb 3, 2023

balajialg changed the title ~~Automate Archival Requests: Provide signed URLs in archive announcement texts so students can download their home directories without involving us~~ Automate Archival Requests Feb 3, 2023

balajialg changed the title ~~Automate Archival Requests~~ [EPIC] Automate Retrieval Requests Feb 3, 2023

balajialg assigned cesarvh Mar 23, 2023

balajialg changed the title ~~[EPIC] Automate Retrieval Requests~~ [EPIC] Automate Data Retrieval Requests Mar 23, 2023

balajialg added the jira Issue tracked in JIRA label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Automate Data Retrieval Requests #2898

[EPIC] Automate Data Retrieval Requests #2898

yuvipanda commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 26, 2021

balajialg commented Oct 27, 2021

felder commented Oct 29, 2021

balajialg commented Oct 30, 2021 •

edited

Loading

yuvipanda commented Oct 30, 2021

felder commented Nov 1, 2021 •

edited

Loading

balajialg commented Nov 1, 2021 •

edited

Loading

ryanlovett commented Feb 3, 2023

[EPIC] Automate Data Retrieval Requests #2898

[EPIC] Automate Data Retrieval Requests #2898

Comments

yuvipanda commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

balajialg commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 18, 2021

felder commented Oct 18, 2021

yuvipanda commented Oct 26, 2021

balajialg commented Oct 27, 2021

felder commented Oct 29, 2021

balajialg commented Oct 30, 2021 • edited Loading

yuvipanda commented Oct 30, 2021

felder commented Nov 1, 2021 • edited Loading

balajialg commented Nov 1, 2021 • edited Loading

ryanlovett commented Feb 3, 2023

balajialg commented Oct 30, 2021 •

edited

Loading

felder commented Nov 1, 2021 •

edited

Loading

balajialg commented Nov 1, 2021 •

edited

Loading