Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Improve diagnostic task runner" #1097

Merged

Conversation

mattiarighi
Copy link
Contributor

Reverts #1075

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confusado about this PR - in any case monsieur @bouweandela knows more about its functionality so I'll leave it up to him but do hurry up pls @mattiarighi is waiting on this for the frequency PR testing 🍺

@@ -408,13 +416,10 @@ def _start_diagnostic_script(self, cmd, env):
try:
process = subprocess.Popen(
cmd,
universal_newlines=True,
bufsize=1,
stdout=subprocess.PIPE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get it - why do we need to remove this (and in general why this PR) - all this does is flush all output to text in new lines 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion in #1059 for the reason behind reverting this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah cheers, looks pretty hairy - check out my comment there, I suspect it's all about the i/o encoding

while returncode is None:
returncode = process.poll()
txt = process.stdout.read()
txt = txt.decode(encoding='utf-8', errors='ignore')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set the encoding, it's picked up automatically from locale and if the locale encoding is not UTF-8 it's not safe to set it at decode level

@axel-lauer
Copy link
Contributor

This PR solves my problems with the tool "hanging" when writing the provenance info. Tests with recipe_smpi.yml were successful (again).

@mattiarighi
Copy link
Contributor Author

Any decision about this? We need to solve this problem, otherwise we can't run any test on other PRs.

@axel-lauer
Copy link
Contributor

I would propose to merge this PR now, so we can continue working on the other PRs while trying to find a solution for for the problems caused by the changes we revert with this PR.

@bouweandela
Copy link
Member

This PR solves my problems with the tool "hanging" when writing the provenance info. Tests with recipe_smpi.yml were successful (again).

The tool does not hang when writing provenance info, it hangs when executing an NCL diagnostic script that writes a lot of text to stdout or stderr.

I suspect that the actual problem is that NCL crashes when the stdout pipe buffer is full (instead of waiting until it can write to it again). This can be merged, but it does not fix the problem, it just makes it less likely that the problem will occur, because it changes the buffer size from 1 line of text to a few thousand bytes.

@mattiarighi
Copy link
Contributor Author

Could this be related to the large output from the NetCDF library mentioned here?

@mattiarighi mattiarighi merged commit 6bee39d into version2_development May 29, 2019
@mattiarighi mattiarighi deleted the revert-1075-better_diagnostic_runner branch May 29, 2019 09:30
@bouweandela
Copy link
Member

Yes, I suspect that is why we did not see the issue before.

@axel-lauer
Copy link
Contributor

I don't think this is the problem. At least in my case, the log file contains the output from the "leave message" issued the NCL diagnostic, which shows that the very last line of the NCL script has been executed successfully. There are also no crash or error messages from NCL in the log files (at least for what I tested), so I doubt that NCL is the problem here.

@mattiarighi
Copy link
Contributor Author

But wait, @axel-lauer does not have the NetCDF error in his log, since he is using an older environment (right?).

Another interesting thing is that commenting out the calls to log_provenance in perfmetrics my run was successful.

It looks like an interplay of different issues.

@bouweandela
Copy link
Member

The problem was made worse by #1075 instead of better, which is why Axel was now even seeing it without the large NetCDF debug messages.

Another interesting thing is that commenting out the calls to log_provenance in perfmetrics my run was successful.

I think this was accidental, because I was also able to run perfmetrics without error occasionally.

@axel-lauer
Copy link
Contributor

@mattiarighi : I am using an "old" environment and I did not see such netCDF related error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants