Dataflow job logs router #150

cisaacstern · 2022-09-30T01:24:01Z

No description provided.

cisaacstern · 2022-09-30T02:03:33Z

With aed5d97 this will close #145 on merge

andersy005 · 2022-09-30T03:29:03Z

docs/README.md

+To query logs for recipe runs, see the API docs at the `/docs#/logs` route of the deployment
+(https://api.pangeo-forge.org/docs#/logs for the production deployment).
+These routes are protected, due to the risk of leaking secrets in the logs. The
+`PANGEO_FORGE_API_KEY` is available to admins via in the deployments secrets config:


@cisaacstern, i'm curious... does this mean that we won't be exposing the logs on the front-end in the near future? Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

@andersy005, based on @yuvipanda's comment #145 (comment), I thought the careful thing would be to keep these protected to start, until we've had a chance to think through if/what to do with them.

Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

Yes, I was imagining that this might be the approach initially. But to be honest, I've been doing this for a while myself now (albeit without the help of an Orchestrator API route)... and it can be toilsome.

So perhaps the best thing is to just make these public to start with?

A third option would be leave the raw logs served here protected, but add another layer/route that applies some type of filtering to them, and the filtered/sanitized result could be passed to the frontend?

@andersy005, for context, this is an example of the text that this route would return:

https://gist.github.com/cisaacstern/2a79707feaf27c5c0a2d4d93e5738fe5

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Basically only the last ~4 lines are relevant to actually debugging the recipe.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity. Though conceivably we could do some string formatting to provide just a short error trace from the end of these logs, and display it on the frontend if recipe_run.conclusion = failed?

Yes, I was imagining that this might be the approach initially.

I see... Perhaps focusing on de-risking our implementation is the right approach for a first try. Once we are ready to make access to the logs fully public, we can revisit this.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity.

Are performance concerns the main reason for not supporting "stream-like" events or is this "stream" feature not available via gcloud logging read? I was wondering if this could be helpful

https://amittallapragada.github.io/docker/fastapi/python/2020/12/23/server-side-events.html

???

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Debugging these long tracebacks will probably be a challenge. Nonetheless, I think providing easy access to the logs would be a great feature. Can these logs be filtered and structured (by another service/application) so they can be easily consumed by the front-end?

Can these logs be filtered and structured ... so they can be easily consumed by the front-end?

@andersy005, in the latest commits I've added a public /logs/trace route which truncates the last portion of the traceback (for failed runs only) and serves it over a public route for the frontend to consume. Your feedback pushed me to figure out some way to make this useful now, and this is what I came up with. I totally agree that this is an essential feature to lighten the load on maintainers (I've been there!) and empower recipe contributors to be more self-sufficient.

There will no doubt be edge cases to deal with, but hopefully this feature will useful in its current form, at least in certain circumstances. Once this PR is merged, I'll make a recommendation for next steps for the logging feature in a new issue.

cisaacstern · 2022-09-30T03:49:45Z

With ef92a15, this will also close #146 on merge.

cisaacstern · 2022-09-30T08:38:08Z

With 22093ed, job_id and job_name are now written to recipe_run.message when jobs are submitted (e.g., this one):

{
    "recipe_id": "noaa-oisst-avhrr-only",
    "bakery_id": 1,
    "feedstock_id": 1,
    "head_sha": "f0e28b410c0fc9bdcaeffea1f1676a623c85e30b",
    "version": "",
    "started_at": "2022-09-30T08:21:28",
    "completed_at": null,
    "conclusion": null,
    "status": "in_progress",
    "is_test": true,
    "dataset_type": "zarr",
    "dataset_public_url": null,
    "message": "{\"job_name\": \"a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2504\", \"job_id\": \"2022-09-30_01_21_42-16154033814144718767\"}",
    "id": 4,
    "bakery": {
        "region": "foo",
        "name": "pangeo-ldeo-nsf-earthcube",
        "description": "bar",
        "id": 1
    },
    "feedstock": {
        "spec": "pforgetest/test-staged-recipes",
        "provider": "github",
        "id": 1
    }
}

cisaacstern · 2022-09-30T08:44:17Z

I'm now somewhat blocked because I've deployed a few jobs from the review app, to test this feature, but they all appear to be stalled. These are all test runs of recipes, which should not require more than ~20 mins max, but they have all been running for +/- 2 hrs with no obvious progress. Too tired to figure out why this is now, so I'll revisit tomorrow.

cisaacstern · 2022-10-01T01:29:16Z

I've converted this back into a draft, to indicate that getting sidetracked by #156 prevented me from finishing it this week. If I'm not yet on parental leave on Monday, I'll pick it up then. If I am, and someone else would like to finish it (which you would be more than welcome to), here's what I was planning to do next:

Copy https://github.com/yuvipanda/mybinder-analytics/blob/main/logs.py into this repo somewhere (making sure to credit @yuvipanda in the module-level docstring)

Rewrite get_logs (below) to subprocess out to python3.9 logs.py (where logs.py is Yuvi's script☝️ ) rather than call gcloud logging directly

pangeo-forge-orchestrator/pangeo_forge_orchestrator/routers/logs.py

Lines 38 to 56 in 22093ed

    
           def get_logs( 
        
               job_id: str, 
        
               severity: str, 
        
               limit: int, 
        
           ): 
        
               log_name_prefix = "projects/pangeo-forge-4967/logs/dataflow.googleapis.com" 
        
               query = ( 
        
                   'resource.type="dataflow_step" ' 
        
                   f'AND resource.labels.job_id="{job_id}" ' 
        
                   f'AND logName=("{log_name_prefix}%2Fjob-message" OR "{log_name_prefix}%2Flauncher") ' 
        
                   f'AND severity="{severity}"' 
        
               ) 
        
               cmd = "gcloud logging read".split() + [query] 
        
               if limit: 
        
                   cmd += [f"--limit={limit}"] 
        
               logs = subprocess.check_output(cmd) 
        
               return logs

Think about how to pre-process the logs returned by python3.9 logs.py so we can have some confidence that they will not leak secrets. One idea I had was to dump them to a local tempdir and then subprocess out to https://github.com/zricethezav/gitleaks in "directory mode" on the tempdir. (Note: this would require installing gitleaks in the Dockerfile at the top level of this repo.)
Adjust the query parameters for the routes in https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/logs-router/pangeo_forge_orchestrator/routers/logs.py to match those provided by the python3.9 logs.py script (so they can be passed through to that script)
Assuming there's a reasonable solution for item 3 (the security point) above, make all logs routes in this PR public

cisaacstern · 2022-10-06T03:18:25Z

This PR remains incomplete. Here are a few notes for anyone who may be interested in completing it (which you would be welcome to do). If it remains incomplete in early November, I will revisit it at that time.

I was playing around with this PR by putting the log fetching log in this PR, as this executable. I've come to feel this should really live in pangeo-forge-runner, as the solution to Log fetching pangeo-forge-runner#20.
With log fetching logic in pangeo-forge-runner, the subprocess command in the logs router could become something like:
```
cmd = ["pangeo-forge-runner", "get-logs", job_name, "--bakery-class=DataflowBakery", "--source={source}"]
```
🚨 The question of how to prevent secrets from leaking in the logs remains unresolved. I have some ideas sketched here, including the beginning of an idea for checking for specific secrets, as well as using gitleaks as a backup/general check. While I think some version of these ideas may be useful, they remain untested and not ready for production.
The logging routes added by this PR look roughly as I would expect them to, and could probably be deployed (assuming the remainder of the above points are resolved) with relatively little adjustment.

cisaacstern added 6 commits September 29, 2022 13:30

logs router first commit

4b73e73

logs router WIP

e057902

logs router cont.

4167666

test logs router first commit

057760f

test get_logs helper

550bd86

test get logs route

79204ae

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 01:24 Inactive

cisaacstern added 2 commits September 29, 2022 18:46

drop defaults from get_logs helper

94283aa

on run, save job_name & job_id in recipe_run message

aed5d97

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 02:00 Inactive

cisaacstern added 3 commits September 29, 2022 19:12

fix issue comment test

b7244ac

preserve existing message, if there is one

4c85405

fix existing message syntax

9198d9b

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 02:21 Inactive

add query job logs doc

ef92a15

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 03:20 Inactive

cisaacstern marked this pull request as ready for review September 30, 2022 03:28

andersy005 reviewed Sep 30, 2022

View reviewed changes

cisaacstern added 2 commits September 29, 2022 20:30

Merge remote-tracking branch 'origin/main' into logs-router

c345f99

update secrets + bakeries config for pr 150

a484fab

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 03:48 Inactive

tweak heroku.yml

cb2ffb4

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 04:09 Inactive

Merge remote-tracking branch 'origin/main' into logs-router

e3da9da

cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 04:37 Inactive

cisaacstern added 3 commits September 29, 2022 21:48

factor logs query into plain funcs

297b16f

factor logs routes to return recipe_run as intermediate output

b66cff1

add public trace routes WIP

7391f50

This was referenced Sep 30, 2022

Add oisst-0 pforgetest/test-staged-recipes#15

Open

Why are dataflow jobs hanging on startup? #156

Closed

cisaacstern marked this pull request as draft October 1, 2022 01:05

cisaacstern mentioned this pull request Oct 4, 2022

Recipes for daily eNATL60 at depth data pangeo-forge/staged-recipes#189

Merged

cisaacstern added 12 commits October 4, 2022 12:09

Merge remote-tracking branch 'origin/main' into logs-router

448aff0

add fetch_logs.py executable

ccb545b

dump dataflow logs to json

2cd67c5

comment out time.sleep

cbe42a7

check for secrets in logs, rewrite logs router WIP

fb22a87

log router tests rewrite WIP

f26949a

get logs router tests to pass

6be7691

setup config for pr 150 review app

fdb634f

logs route tweaks

32ebbff

move logs test params into shared list

34fb7ee

fix surface secrets func and test it

f3a081a

add gitleaks to Dockerfile

f1bd920

cisaacstern temporarily deployed to pforge-pr-150 October 5, 2022 00:45 Inactive

cisaacstern mentioned this pull request Oct 6, 2022

Log fetching pangeo-forge/pangeo-forge-runner#20

Open

andersy005 added 5 commits October 10, 2022 14:22

Merge branch 'main' into logs-router

c7dd84e

Merge branch 'main' into logs-router

3f4bc3e

Merge branch 'main' into logs-router

a5d8848

Merge branch 'main' into logs-router

87264eb

Merge branch 'main' into logs-router

8805e0f

This was referenced Nov 16, 2022

Replace homespun config.py with traitlets #195

Closed

AWS NOAA WHOI pangeo-forge/staged-recipes#221

Merged

cisaacstern closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow job logs router #150

Dataflow job logs router #150

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

andersy005 Sep 30, 2022

cisaacstern Sep 30, 2022 •

edited

Loading

cisaacstern Sep 30, 2022

cisaacstern Sep 30, 2022

cisaacstern Sep 30, 2022 •

edited

Loading

andersy005 Sep 30, 2022 •

edited

Loading

cisaacstern Sep 30, 2022

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

cisaacstern commented Oct 1, 2022 •

edited

Loading

cisaacstern commented Oct 6, 2022

Dataflow job logs router #150

Dataflow job logs router #150

Conversation

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

andersy005 Sep 30, 2022

Choose a reason for hiding this comment

cisaacstern Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

cisaacstern Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

andersy005 Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

cisaacstern commented Sep 30, 2022

cisaacstern commented Oct 1, 2022 • edited Loading

cisaacstern commented Oct 6, 2022

cisaacstern Sep 30, 2022 •

edited

Loading

cisaacstern Sep 30, 2022 •

edited

Loading

andersy005 Sep 30, 2022 •

edited

Loading

cisaacstern commented Oct 1, 2022 •

edited

Loading