-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix enable_logging=True
not working in DockerSwarmOperator
#35677
Conversation
4b7e9ff
to
f5d1bcf
Compare
enable_logging=True
not working in DockerSwarmOperator
Can you add unit test to avoid regression? |
Hi @stavdav143 please note that we are waiting for tests (or if code can't be tested easily then explanation why) |
Hi @eladkal, Thanks for reviewing. yes I understand, I ve been facing some difficulties while trying to provide a demo gif. I will look into if I can add tests, to me it is a bit complicated as Docker infrastructure is involved and these here docker_swarm_tests look more mock. But I will look into a bit deeper and see what I can provide. |
Hi @eladkal
And finally with 048ba1 we have the final commit marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence. |
Hmm. The way it is implemented, it will generate more and more API traffic, the longer the task is runing. Basically every two seconds we are attempting to downaload all logs. I think this is not good approach. I do not know the CLI appi but that does not sound like a good solution |
Well this issue has been open for quite some time now. The alternative would be to use follow=True. This will stop and hang the thread by definition. You'd need a parallel thread to check if the service has finished, as this always hangs no matter what the state of the service (Running, Started, Waiting, Done) . In the end you don't avoid polling. And you have to introduce additional code with parallelism which in Python isn't great, at least to my knowledge. |
We are already doing it for other operators - K8S and others - you do not have to poll Airflow API. In many cases - when remote logging is involved, loggers are just logging to a remote loogging service (cloudwatch for example) which takes care about streaming logs to UI for example - se yes, you can absolutely avoid polling. You could likely use this:
And ask the logs to include timestamps and use them. Especially if you use float (nanoseconds) you could record the last log nanosecond + maybe store last few lines and add a few nanoseconds of overlap and de-duplicate the overlapping lines. This is very similar to your proposal but will avoid potentially huge, increasing traffic and potentially huge memory used to keep the logs in memory. The way you implemented it, might cause MB and even GB of memory wasted to keep whole log in-memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle memory and traffic better.
So yes. let's get it well implemented and merge it. But we cannot merge a change that we know has performance issues. The problem is that people WILL use it for long running operations and it WILL crash their workers. and they WILL come to us when it happens, so we better fix it now. |
Ok valid feedback 👍. I'll see if I can make it more slim using the "since" parameter and timestamps when asking for logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! That is much better !
Thank you! |
Though - it needs static checks fixes. I recommmend |
…ogging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working.
…standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false.
… API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.
…6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.
…in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.
docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e.
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
Nice :) |
Hey, thanks a lot ! |
* Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working. * tty=True/False to be chosen by the user as it was the case before this fix (#28452) * Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false. * service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs. * service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable. * Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence. * Formatting ruff * Reverting as Github actions don't run this test as a swarm node: docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e. * Logging "since" timestamp to avoid memory issues Fix for #28452 * Formatting - Fix for #28452 * Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working. * tty=True/False to be chosen by the user as it was the case before this fix (#28452) * Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false. * service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs. * service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable. * Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence. * Formatting ruff * Reverting as Github actions don't run this test as a swarm node: docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e. * Logging "since" timestamp to avoid memory issues Fix for #28452 * Formatting - Fix for #28452 * Fix bugs (#28452): Correctly assign last_line_logged, last_timestamp
Docker Swarm Operator was impractical as the Docker Service created was never finished in the airflow scheduler. Also logs were not received. With this fix we make docker swarm services finite and we track their logs properly.
The logging logic works like this:
For the current service we ask the docker API client to give us the logs with the
follow=False
. Otherwise,follow=True
hangs for ever when the service has finished.We poll the docker API in 2 minute intervals, throwing logs to the console and checking if the service has terminated in order to break the loop.
Closes: #28452
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.