Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix enable_logging=True not working in DockerSwarmOperator #35677

Merged
merged 10 commits into from
Dec 6, 2023

Conversation

stavdav143
Copy link
Contributor

@stavdav143 stavdav143 commented Nov 16, 2023

Docker Swarm Operator was impractical as the Docker Service created was never finished in the airflow scheduler. Also logs were not received. With this fix we make docker swarm services finite and we track their logs properly.

The logging logic works like this:
For the current service we ask the docker API client to give us the logs with the follow=False. Otherwise, follow=True hangs for ever when the service has finished.
We poll the docker API in 2 minute intervals, throwing logs to the console and checking if the service has terminated in order to break the loop.

Closes: #28452


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@stavdav143 stavdav143 force-pushed the main branch 4 times, most recently from 4b7e9ff to f5d1bcf Compare November 16, 2023 19:09
@stavdav143 stavdav143 changed the title Fixes #28452: "TaskInstances do not succeed when using enable_logging… Fix #28452: Docker Swarm services succeed and give logs Nov 16, 2023
@eladkal eladkal changed the title Fix #28452: Docker Swarm services succeed and give logs Fix enable_logging=True not working in DockerSwarmOperator Nov 17, 2023
@eladkal
Copy link
Contributor

eladkal commented Nov 21, 2023

Can you add unit test to avoid regression?

@eladkal
Copy link
Contributor

eladkal commented Dec 1, 2023

Hi @stavdav143 please note that we are waiting for tests (or if code can't be tested easily then explanation why)
We rarely accept code changes without tests

@stavdav143
Copy link
Contributor Author

Hi @eladkal,

Thanks for reviewing. yes I understand, I ve been facing some difficulties while trying to provide a demo gif. I will look into if I can add tests, to me it is a bit complicated as Docker infrastructure is involved and these here docker_swarm_tests look more mock. But I will look into a bit deeper and see what I can provide.

@stavdav143
Copy link
Contributor Author

Hi @eladkal
I committed a unit test for logging.
There were small changes I needed to carry in the test file in order to comply with the logging algorithm we use.
These were committed separately for better explanation:

  1. Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API.
    Therefore in DockerSwarmOperator follow is always false
    041e4c

  2. service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit,
    the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.
    c8063a

  3. service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.
    af2428


And finally with 048ba1 we have the final commit marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

@stavdav143
Copy link
Contributor Author

Unfortunately @eladkal I had to revert my test case as the GitHub actions do not run it in a swarm node mode as expected in the local test environment with breeze. This test case highlights what this PR fixes.

235e565

Here is the error from the GitHub action which otherwise runs locally:
image

@stavdav143
Copy link
Contributor Author

This below explains how this PR fixes the Docker Swarm Operator issues:

Bug
Below the task never ends and the logs are not being streamed (before this fix):
output_bug

With fix
Below the task ends and the logs are being streamed sychronously (this fix):
output

@potiuk
Copy link
Member

potiuk commented Dec 4, 2023

Hmm. The way it is implemented, it will generate more and more API traffic, the longer the task is runing. Basically every two seconds we are attempting to downaload all logs. I think this is not good approach.

I do not know the CLI appi but that does not sound like a good solution

@stavdav143
Copy link
Contributor Author

stavdav143 commented Dec 4, 2023

Well this issue has been open for quite some time now.

The alternative would be to use follow=True. This will stop and hang the thread by definition. You'd need a parallel thread to check if the service has finished, as this always hangs no matter what the state of the service (Running, Started, Waiting, Done) . In the end you don't avoid polling. And you have to introduce additional code with parallelism which in Python isn't great, at least to my knowledge.

@potiuk
Copy link
Member

potiuk commented Dec 4, 2023

Well this issue has been open for quite some time now.

The alternative would be to use follow=True. This will stop and hang the thread by definition. You'd need a parallel thread to check if the service has finished, as this always hangs no matter what the state of the service (Running, Started, Waiting, Done) . In the end you don't avoid polling. And you have to introduce additional code with parallelism which in Python isn't great, at least to my knowledge.

We are already doing it for other operators - K8S and others - you do not have to poll Airflow API. In many cases - when remote logging is involved, loggers are just logging to a remote loogging service (cloudwatch for example) which takes care about streaming logs to UI for example - se yes, you can absolutely avoid polling.

You could likely use this:

since (datetime, int, or float) – Show logs since a given datetime, integer epoch (in seconds) or float (in nanoseconds)

And ask the logs to include timestamps and use them.

Especially if you use float (nanoseconds) you could record the last log nanosecond + maybe store last few lines and add a few nanoseconds of overlap and de-duplicate the overlapping lines.

This is very similar to your proposal but will avoid potentially huge, increasing traffic and potentially huge memory used to keep the logs in memory. The way you implemented it, might cause MB and even GB of memory wasted to keep whole log in-memory.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle memory and traffic better.

@potiuk
Copy link
Member

potiuk commented Dec 4, 2023

Well this issue has been open for quite some time now.

So yes. let's get it well implemented and merge it. But we cannot merge a change that we know has performance issues. The problem is that people WILL use it for long running operations and it WILL crash their workers. and they WILL come to us when it happens, so we better fix it now.

@stavdav143
Copy link
Contributor Author

Ok valid feedback 👍. I'll see if I can make it more slim using the "since" parameter and timestamps when asking for logs.

@stavdav143
Copy link
Contributor Author

There you go @potiuk
9434117

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! That is much better !

@stavdav143
Copy link
Contributor Author

Thank you!

@potiuk
Copy link
Member

potiuk commented Dec 5, 2023

Though - it needs static checks fixes. I recommmend pre-commit install and it would fix those while commiting .

…ogging=True option in DockerSwarmOperator"

It introduces logging of Docker Swarm services which was previously not
working.
…standard and provided upstream by the Docker API.

Therefore in DockerSwarmOperator follow is always false.
… API for logs. As we indicated in the previous commmit,

the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.
…6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.
…in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.
docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.")

Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence."

This reverts commit 048ba1e.
@potiuk potiuk merged commit 8821088 into apache:main Dec 6, 2023
Copy link

boring-cyborg bot commented Dec 6, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@potiuk
Copy link
Member

potiuk commented Dec 6, 2023

Nice :)

@stavdav143
Copy link
Contributor Author

Hey, thanks a lot !

@stavdav143
Copy link
Contributor Author

Dear @eladkal, @potiuk

After your email for testing, I found a small bug.
I opened this PR here: #36127

eladkal pushed a commit that referenced this pull request Dec 9, 2023
* Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator"
It introduces logging of Docker Swarm services which was previously not
working.

* tty=True/False to be chosen by the user as it was the case before this fix (#28452)

* Follow=true for logs will always result in tasks not ending. This is  standard and provided upstream by the Docker API.
Therefore in DockerSwarmOperator follow is always false.

* service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit,
the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.

* service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.

* Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

* Formatting ruff

* Reverting as Github actions don't run this test as a swarm node:

docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.")

Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence."

This reverts commit 048ba1e.

* Logging "since" timestamp to avoid memory issues
Fix for #28452

* Formatting - Fix for #28452

* Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator"
It introduces logging of Docker Swarm services which was previously not
working.

* tty=True/False to be chosen by the user as it was the case before this fix (#28452)

* Follow=true for logs will always result in tasks not ending. This is  standard and provided upstream by the Docker API.
Therefore in DockerSwarmOperator follow is always false.

* service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit,
the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.

* service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.

* Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

* Formatting ruff

* Reverting as Github actions don't run this test as a swarm node:

docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.")

Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence."

This reverts commit 048ba1e.

* Logging "since" timestamp to avoid memory issues
Fix for #28452

* Formatting - Fix for #28452

* Fix bugs (#28452): Correctly assign last_line_logged, last_timestamp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator
3 participants