Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Dataflow job logs router #150

Closed
wants to merge 44 commits into from
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
4b73e73
logs router first commit
cisaacstern Sep 29, 2022
e057902
logs router WIP
cisaacstern Sep 29, 2022
4167666
logs router cont.
cisaacstern Sep 29, 2022
057760f
test logs router first commit
cisaacstern Sep 30, 2022
550bd86
test get_logs helper
cisaacstern Sep 30, 2022
79204ae
test get logs route
cisaacstern Sep 30, 2022
94283aa
drop defaults from get_logs helper
cisaacstern Sep 30, 2022
aed5d97
on run, save job_name & job_id in recipe_run message
cisaacstern Sep 30, 2022
b7244ac
fix issue comment test
cisaacstern Sep 30, 2022
4c85405
preserve existing message, if there is one
cisaacstern Sep 30, 2022
9198d9b
fix existing message syntax
cisaacstern Sep 30, 2022
ef92a15
add query job logs doc
cisaacstern Sep 30, 2022
c345f99
Merge remote-tracking branch 'origin/main' into logs-router
cisaacstern Sep 30, 2022
a484fab
update secrets + bakeries config for pr 150
cisaacstern Sep 30, 2022
cb2ffb4
tweak heroku.yml
cisaacstern Sep 30, 2022
e3da9da
Merge remote-tracking branch 'origin/main' into logs-router
cisaacstern Sep 30, 2022
297b16f
factor logs query into plain funcs
cisaacstern Sep 30, 2022
b66cff1
factor logs routes to return recipe_run as intermediate output
cisaacstern Sep 30, 2022
7391f50
add public trace routes WIP
cisaacstern Sep 30, 2022
0161d09
finish trace route tests
cisaacstern Sep 30, 2022
82b1dad
different assert for trace tests
cisaacstern Sep 30, 2022
0e45aa3
update logs docs
cisaacstern Sep 30, 2022
fd37371
add note about logs retention
cisaacstern Sep 30, 2022
2bf7d8a
ok, i guess drop the details assert
cisaacstern Sep 30, 2022
dae96e9
remove duplicate subprocess call. huge issue.
cisaacstern Sep 30, 2022
dbd4b98
Merge remote-tracking branch 'origin/main' into logs-router
cisaacstern Sep 30, 2022
22093ed
patch for 132
cisaacstern Sep 30, 2022
448aff0
Merge remote-tracking branch 'origin/main' into logs-router
cisaacstern Oct 4, 2022
ccb545b
add fetch_logs.py executable
cisaacstern Oct 4, 2022
2cd67c5
dump dataflow logs to json
cisaacstern Oct 4, 2022
cbe42a7
comment out time.sleep
cisaacstern Oct 4, 2022
fb22a87
check for secrets in logs, rewrite logs router WIP
cisaacstern Oct 4, 2022
f26949a
log router tests rewrite WIP
cisaacstern Oct 4, 2022
6be7691
get logs router tests to pass
cisaacstern Oct 4, 2022
fdb634f
setup config for pr 150 review app
cisaacstern Oct 4, 2022
32ebbff
logs route tweaks
cisaacstern Oct 5, 2022
34fb7ee
move logs test params into shared list
cisaacstern Oct 5, 2022
f3a081a
fix surface secrets func and test it
cisaacstern Oct 5, 2022
f1bd920
add gitleaks to Dockerfile
cisaacstern Oct 5, 2022
c7dd84e
Merge branch 'main' into logs-router
andersy005 Oct 10, 2022
3f4bc3e
Merge branch 'main' into logs-router
andersy005 Oct 19, 2022
a5d8848
Merge branch 'main' into logs-router
andersy005 Oct 20, 2022
87264eb
Merge branch 'main' into logs-router
andersy005 Oct 24, 2022
8805e0f
Merge branch 'main' into logs-router
andersy005 Oct 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ external to the above-listed teams to contribute code to this repository.
compute & storage backends.
- [Bakeries: job status monitoring](#bakeries-job-status-monitoring) - This is how the FastAPI app
knows when compute jobs have concluded. Terraform for this infra is run on each Heroku deploy.
- [Bakeries: querying job logs](#bakeries-querying-job-logs) - Query bakery job logs for recipe runs.
Especially useful for diagnosing failed jobs.
- [Security](#security) - Some notes on security.
- [Testing](#testing) - Automated testing in both local envs and containerized contexts.
- [GitHub App: manual API calls](#github-app-manual-api-calls) - How to manually call the GitHub API _as a GitHub App_.
Expand Down Expand Up @@ -639,6 +641,18 @@ This project layout is based directly on https://blog.gruntwork.io/how-to-manage

See also [Resource locations](#resource-locations) for a table referencing these terraform envs.

# Bakeries: querying job logs

To query logs for recipe runs, see the API docs at the `/docs#/logs` route of the deployment
(https://api.pangeo-forge.org/docs#/logs for the production deployment).
These routes are protected, due to the risk of leaking secrets in the logs. The
`PANGEO_FORGE_API_KEY` is available to admins via in the deployments secrets config:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cisaacstern, i'm curious... does this mean that we won't be exposing the logs on the front-end in the near future? Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

Copy link
Member Author

@cisaacstern cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005, based on @yuvipanda's comment #145 (comment), I thought the careful thing would be to keep these protected to start, until we've had a chance to think through if/what to do with them.

Is the idea that if something is broken, as an admin I can query the API to figure out what's going on but any other user won't be able to see these logs?

Yes, I was imagining that this might be the approach initially. But to be honest, I've been doing this for a while myself now (albeit without the help of an Orchestrator API route)... and it can be toilsome.

So perhaps the best thing is to just make these public to start with?

A third option would be leave the raw logs served here protected, but add another layer/route that applies some type of filtering to them, and the filtered/sanitized result could be passed to the frontend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005, for context, this is an example of the text that this route would return:

https://gist.github.com/cisaacstern/2a79707feaf27c5c0a2d4d93e5738fe5

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically only the last ~4 lines are relevant to actually debugging the recipe.

Copy link
Member Author

@cisaacstern cisaacstern Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity. Though conceivably we could do some string formatting to provide just a short error trace from the end of these logs, and display it on the frontend if recipe_run.conclusion = failed?

Copy link
Member

@andersy005 andersy005 Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was imagining that this might be the approach initially.

I see... Perhaps focusing on de-risking our implementation is the right approach for a first try. Once we are ready to make access to the logs fully public, we can revisit this.

One more thought: this route doesn't offer any time of "follow"/"tail" functionality, and I'm not sure that it would perform especially well in a "stream logs to the frontend" capacity.

Are performance concerns the main reason for not supporting "stream-like" events or is this "stream" feature not available via gcloud logging read? I was wondering if this could be helpful

https://amittallapragada.github.io/docker/fastapi/python/2020/12/23/server-side-events.html

???

So entirely aside from leaking secrets, there's the question of how to format this into something that a frontend user would actually find useful (99% of it is apache beam boilerplate)...

Debugging these long tracebacks will probably be a challenge. Nonetheless, I think providing easy access to the logs would be a great feature. Can these logs be filtered and structured (by another service/application) so they can be easily consumed by the front-end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these logs be filtered and structured ... so they can be easily consumed by the front-end?

@andersy005, in the latest commits I've added a public /logs/trace route which truncates the last portion of the traceback (for failed runs only) and serves it over a public route for the frontend to consume. Your feedback pushed me to figure out some way to make this useful now, and this is what I came up with. I totally agree that this is an essential feature to lighten the load on maintainers (I've been there!) and empower recipe contributors to be more self-sufficient.

There will no doubt be edge cases to deal with, but hopefully this feature will useful in its current form, at least in certain circumstances. Once this PR is merged, I'll make a recommendation for next steps for the logging feature in a new issue.


```console
$ sops -d -i secrets/config.${PANGEO_FORGE_DEPLOYMENT}.yaml
# cat secrets/config.${PANGEO_FORGE_DEPLOYMENT}.yaml
```

# Security

## Database API
Expand Down
2 changes: 2 additions & 0 deletions pangeo_forge_orchestrator/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from .http import http_session
from .metadata import app_metadata
from .routers.github_app import github_app_router
from .routers.logs import logs_router
from .routers.model_router import router as model_router
from .routers.repr import repr_router
from .routers.stats import stats_router
Expand Down Expand Up @@ -41,6 +42,7 @@ def on_shutdown():
http_session.stop()


app.include_router(logs_router)
app.include_router(model_router)
app.include_router(stats_router)
app.include_router(github_app_router)
Expand Down
10 changes: 10 additions & 0 deletions pangeo_forge_orchestrator/routers/github_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,16 @@ async def run(
try:
out = subprocess.check_output(cmd)
logger.debug(f"Command output is {out.decode('utf-8')}")
out = subprocess.check_output(cmd)
for line in out.splitlines():
p = json.loads(line)
if p["status"] == "submitted":
existing_message = json.loads(recipe_run.message) if recipe_run.message else {}
recipe_run.message = json.dumps(
existing_message | dict(job_name=p["job_name"], job_id=p["job_id"])
)
db_session.add(recipe_run)
db_session.commit()
except subprocess.CalledProcessError as e:
for line in e.output.splitlines():
p = json.loads(line)
Expand Down
109 changes: 109 additions & 0 deletions pangeo_forge_orchestrator/routers/logs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import json
import subprocess

from fastapi import APIRouter, Depends, HTTPException, Query
from fastapi.responses import PlainTextResponse
from sqlmodel import Session, SQLModel, select
from starlette import status

from ..dependencies import check_authentication_header, get_session as get_database_session
from ..models import MODELS

logs_router = APIRouter()

DEFAULT_SEVERITY = Query(
"ERROR",
description=(
"A valid gcloud logging severity as defined in "
"https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry#LogSeverity"
),
)
DEFAULT_LIMIT = Query(1, description="Max number of log entries to return.")


def job_id_from_recipe_run(recipe_run: SQLModel) -> str:
try:
job_id = json.loads(recipe_run.message)["job_id"]
except (KeyError, json.JSONDecodeError) as e:
detail = (
f"Message field of {recipe_run = } missing 'job_id'."
if type(e) == KeyError
else f"Message field of {recipe_run = } not JSON decodable."
)
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=detail)

return job_id


def get_logs(
job_id: str,
severity: str,
limit: int,
):
log_name_prefix = "projects/pangeo-forge-4967/logs/dataflow.googleapis.com"
query = (
'resource.type="dataflow_step" '
f'AND resource.labels.job_id="{job_id}" '
f'AND logName=("{log_name_prefix}%2Fjob-message" OR "{log_name_prefix}%2Flauncher") '
f'AND severity="{severity}"'
)
cmd = "gcloud logging read".split() + [query]

if limit:
cmd += [f"--limit={limit}"]

logs = subprocess.check_output(cmd)
return logs


@logs_router.get(
"/recipe_runs/{id}/logs",
summary="Get job logs for a recipe_run, specified by database id.",
tags=["recipe_run", "logs", "admin"],
response_class=PlainTextResponse,
dependencies=[Depends(check_authentication_header)],
)
async def logs_from_recipe_run_id(
id: int,
*,
db_session: Session = Depends(get_database_session),
severity: str = DEFAULT_SEVERITY,
limit: int = DEFAULT_LIMIT,
):
recipe_run = db_session.exec(
select(MODELS["recipe_run"].table).where(MODELS["recipe_run"].table.id == id)
).one()
job_id = job_id_from_recipe_run(recipe_run)
logs = get_logs(job_id, severity, limit)
return logs


@logs_router.get(
"/feedstocks/{feedstock_spec:path}/{commit}/{recipe_id}/logs",
summary="Get job logs for a recipe run, specified by feedstock_spec, commit, and recipe_id.",
tags=["feedstock", "logs", "admin"],
response_class=PlainTextResponse,
dependencies=[Depends(check_authentication_header)],
)
async def logs_from_feedstock_spec_commit_and_recipe_id(
feedstock_spec: str,
commit: str,
recipe_id: str,
*,
db_session: Session = Depends(get_database_session),
severity: str = DEFAULT_SEVERITY,
limit: int = DEFAULT_LIMIT,
):
feedstock = db_session.exec(
select(MODELS["feedstock"].table).where(MODELS["feedstock"].table.spec == feedstock_spec)
).one()
statement = (
select(MODELS["recipe_run"].table)
.where(MODELS["recipe_run"].table.recipe_id == recipe_id)
.where(MODELS["recipe_run"].table.head_sha == commit)
.where(MODELS["recipe_run"].table.feedstock_id == feedstock.id)
)
recipe_run = db_session.exec(statement).one()
job_id = job_id_from_recipe_run(recipe_run)
logs = get_logs(job_id, severity, limit)
return logs
10 changes: 9 additions & 1 deletion tests/github_app/mock_pangeo_forge_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,15 @@ def mock_subprocess_check_output(cmd: List[str]):
'{"message": "Expansion complete", "status": "completed", "meta": {"title": "Global Precipitation Climatology Project", "description": "Global Precipitation Climatology Project (GPCP) Daily Version 1.3 gridded, merged ty satellite/gauge precipitation Climate data Record (CDR) from 1996 to present.\\n", "pangeo_forge_version": "0.9.0", "pangeo_notebook_version": "2022.06.02", "recipes": [{"id": "gpcp", "object": "recipe:recipe"}], "provenance": {"providers": [{"name": "NOAA NCEI", "description": "National Oceanographic & Atmospheric Administration National Centers for Environmental Information", "roles": ["host", "licensor"], "url": "https://www.ncei.noaa.gov/products/global-precipitation-climatology-project"}, {"name": "University of Maryland", "description": "University of Maryland College Park Earth System Science Interdisciplinary Center (ESSIC) and Cooperative Institute for Climate and Satellites (CICS).\\n", "roles": ["producer"], "url": "http://gpcp.umd.edu/"}], "license": "No constraints on data access or use."}, "maintainers": [{"name": "Ryan Abernathey", "orcid": "0000-0001-5999-4917", "github": "rabernat"}], "bakery": {"id": "pangeo-ldeo-nsf-earthcube"}}}\n'
)
elif cmd[1] == "bake":
return '{"message": ""}'.encode("utf-8")
loglines = [
{
"message": "Submitted job ABC for recipe run EFG",
"job_id": "ABC",
"job_name": "a132456",
"status": "submitted",
},
]
return "\n".join([json.dumps(line) for line in loglines]).encode("utf-8")
else:
raise NotImplementedError(f"Command {cmd} not implemented in tests.")
else:
Expand Down
36 changes: 31 additions & 5 deletions tests/github_app/test_helpers_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
This module tests the `run` function, which is a special case.
"""

import json
import subprocess
from typing import List

import pytest
import pytest_asyncio
Expand All @@ -15,10 +17,7 @@

from ..conftest import clear_database
from .fixtures import _MockGitHubBackend, get_mock_github_session
from .mock_pangeo_forge_runner import (
mock_subprocess_check_output,
mock_subprocess_check_output_raises_called_process_error,
)
from .mock_pangeo_forge_runner import mock_subprocess_check_output_raises_called_process_error


@pytest_asyncio.fixture(
Expand Down Expand Up @@ -107,11 +106,38 @@ async def run_fixture(


@pytest.mark.asyncio
async def test_run(mocker, run_fixture):
@pytest.mark.parametrize(
"job_name, job_id",
[
(
"a6170692e70616e67656f2d666f7267652e6f7267251162",
"2022-09-28_13_30_25-9666562633575851936",
)
],
)
async def test_run(mocker, run_fixture, job_name, job_id, async_app_client):
run_kws = run_fixture

def mock_subprocess_check_output(cmd: List[str]):
loglines = [
{
"message": f"Submitted job {job_id} for recipe casm-soil-moisture",
"recipe": "casm-soil-moisture",
"job_name": job_name,
"job_id": job_id,
"status": "submitted",
},
]
return "\n".join([json.dumps(line) for line in loglines]).encode("utf-8")

mocker.patch.object(subprocess, "check_output", mock_subprocess_check_output)
await run(**run_kws)

recipe_run = await async_app_client.get("recipe_runs/1")
recipe_run_message = recipe_run.json()["message"]
assert json.loads(recipe_run_message)["job_name"] == job_name
assert json.loads(recipe_run_message)["job_id"] == job_id


@pytest.mark.xfail(reason="https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/132")
@pytest.mark.asyncio
Expand Down
Loading