Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Dataflow job logs router #150

Closed
wants to merge 44 commits into from
Closed

Dataflow job logs router #150

wants to merge 44 commits into from

Conversation

cisaacstern
Copy link
Member

No description provided.

@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 01:24 Inactive
@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 02:00 Inactive
@cisaacstern
Copy link
Member Author

With aed5d97 this will close #145 on merge

@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 02:21 Inactive
@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 03:20 Inactive
@cisaacstern cisaacstern marked this pull request as ready for review September 30, 2022 03:28
docs/README.md Outdated
Comment on lines 646 to 649
To query logs for recipe runs, see the API docs at the `/docs#/logs` route of the deployment
(https://api.pangeo-forge.org/docs#/logs for the production deployment).
These routes are protected, due to the risk of leaking secrets in the logs. The
`PANGEO_FORGE_API_KEY` is available to admins via in the deployments secrets config:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cisaacstern, i'm curious... does this mean that we won't be exposing the logs on the front-end in the near future? Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

Copy link
Member Author

@cisaacstern cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005, based on @yuvipanda's comment #145 (comment), I thought the careful thing would be to keep these protected to start, until we've had a chance to think through if/what to do with them.

Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

Yes, I was imagining that this might be the approach initially. But to be honest, I've been doing this for a while myself now (albeit without the help of an Orchestrator API route)... and it can be toilsome.

So perhaps the best thing is to just make these public to start with?

A third option would be leave the raw logs served here protected, but add another layer/route that applies some type of filtering to them, and the filtered/sanitized result could be passed to the frontend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005, for context, this is an example of the text that this route would return:

https://gist.github.com/cisaacstern/2a79707feaf27c5c0a2d4d93e5738fe5

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically only the last ~4 lines are relevant to actually debugging the recipe.

Copy link
Member Author

@cisaacstern cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity. Though conceivably we could do some string formatting to provide just a short error trace from the end of these logs, and display it on the frontend if recipe_run.conclusion = failed?

Copy link
Member

@andersy005 andersy005 Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was imagining that this might be the approach initially.

I see... Perhaps focusing on de-risking our implementation is the right approach for a first try. Once we are ready to make access to the logs fully public, we can revisit this.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity.

Are performance concerns the main reason for not supporting "stream-like" events or is this "stream" feature not available via gcloud logging read? I was wondering if this could be helpful

https://amittallapragada.github.io/docker/fastapi/python/2020/12/23/server-side-events.html

???

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Debugging these long tracebacks will probably be a challenge. Nonetheless, I think providing easy access to the logs would be a great feature. Can these logs be filtered and structured (by another service/application) so they can be easily consumed by the front-end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these logs be filtered and structured ... so they can be easily consumed by the front-end?

@andersy005, in the latest commits I've added a public /logs/trace route which truncates the last portion of the traceback (for failed runs only) and serves it over a public route for the frontend to consume. Your feedback pushed me to figure out some way to make this useful now, and this is what I came up with. I totally agree that this is an essential feature to lighten the load on maintainers (I've been there!) and empower recipe contributors to be more self-sufficient.

There will no doubt be edge cases to deal with, but hopefully this feature will useful in its current form, at least in certain circumstances. Once this PR is merged, I'll make a recommendation for next steps for the logging feature in a new issue.

@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 03:48 Inactive
@cisaacstern
Copy link
Member Author

With ef92a15, this will also close #146 on merge.

@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 04:09 Inactive
@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 September 30, 2022 04:37 Inactive
@cisaacstern
Copy link
Member Author

With 22093ed, job_id and job_name are now written to recipe_run.message when jobs are submitted (e.g., this one):

{
    "recipe_id": "noaa-oisst-avhrr-only",
    "bakery_id": 1,
    "feedstock_id": 1,
    "head_sha": "f0e28b410c0fc9bdcaeffea1f1676a623c85e30b",
    "version": "",
    "started_at": "2022-09-30T08:21:28",
    "completed_at": null,
    "conclusion": null,
    "status": "in_progress",
    "is_test": true,
    "dataset_type": "zarr",
    "dataset_public_url": null,
    "message": "{\"job_name\": \"a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2504\", \"job_id\": \"2022-09-30_01_21_42-16154033814144718767\"}",
    "id": 4,
    "bakery": {
        "region": "foo",
        "name": "pangeo-ldeo-nsf-earthcube",
        "description": "bar",
        "id": 1
    },
    "feedstock": {
        "spec": "pforgetest/test-staged-recipes",
        "provider": "github",
        "id": 1
    }
}

@cisaacstern
Copy link
Member Author

I'm now somewhat blocked because I've deployed a few jobs from the review app, to test this feature, but they all appear to be stalled. These are all test runs of recipes, which should not require more than ~20 mins max, but they have all been running for +/- 2 hrs with no obvious progress. Too tired to figure out why this is now, so I'll revisit tomorrow.

Screen Shot 2022-09-30 at 1 42 18 AM

@cisaacstern
Copy link
Member Author

cisaacstern commented Oct 1, 2022

I've converted this back into a draft, to indicate that getting sidetracked by #156 prevented me from finishing it this week. If I'm not yet on parental leave on Monday, I'll pick it up then. If I am, and someone else would like to finish it (which you would be more than welcome to), here's what I was planning to do next:

  1. Copy https://github.com/yuvipanda/mybinder-analytics/blob/main/logs.py into this repo somewhere (making sure to credit @yuvipanda in the module-level docstring)

  2. Rewrite get_logs (below) to subprocess out to python3.9 logs.py (where logs.py is Yuvi's script☝️ ) rather than call gcloud logging directly

    def get_logs(
    job_id: str,
    severity: str,
    limit: int,
    ):
    log_name_prefix = "projects/pangeo-forge-4967/logs/dataflow.googleapis.com"
    query = (
    'resource.type="dataflow_step" '
    f'AND resource.labels.job_id="{job_id}" '
    f'AND logName=("{log_name_prefix}%2Fjob-message" OR "{log_name_prefix}%2Flauncher") '
    f'AND severity="{severity}"'
    )
    cmd = "gcloud logging read".split() + [query]
    if limit:
    cmd += [f"--limit={limit}"]
    logs = subprocess.check_output(cmd)
    return logs

  3. Think about how to pre-process the logs returned by python3.9 logs.py so we can have some confidence that they will not leak secrets. One idea I had was to dump them to a local tempdir and then subprocess out to https://github.com/zricethezav/gitleaks in "directory mode" on the tempdir. (Note: this would require installing gitleaks in the Dockerfile at the top level of this repo.)

  4. Adjust the query parameters for the routes in https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/logs-router/pangeo_forge_orchestrator/routers/logs.py to match those provided by the python3.9 logs.py script (so they can be passed through to that script)

  5. Assuming there's a reasonable solution for item 3 (the security point) above, make all logs routes in this PR public

@cisaacstern cisaacstern temporarily deployed to pforge-pr-150 October 5, 2022 00:45 Inactive
@cisaacstern
Copy link
Member Author

This PR remains incomplete. Here are a few notes for anyone who may be interested in completing it (which you would be welcome to do). If it remains incomplete in early November, I will revisit it at that time.

  • I was playing around with this PR by putting the log fetching log in this PR, as this executable. I've come to feel this should really live in pangeo-forge-runner, as the solution to Log fetching pangeo-forge-runner#20.
  • With log fetching logic in pangeo-forge-runner, the subprocess command in the logs router could become something like:
    cmd = ["pangeo-forge-runner", "get-logs", job_name, "--bakery-class=DataflowBakery", "--source={source}"]
  • 🚨 The question of how to prevent secrets from leaking in the logs remains unresolved. I have some ideas sketched here, including the beginning of an idea for checking for specific secrets, as well as using gitleaks as a backup/general check. While I think some version of these ideas may be useful, they remain untested and not ready for production.
  • The logging routes added by this PR look roughly as I would expect them to, and could probably be deployed (assuming the remainder of the above points are resolved) with relatively little adjustment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants