Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seacas snapshot push causes Nalu throws #13773

Closed
spdomin opened this issue Feb 3, 2025 · 9 comments
Closed

Seacas snapshot push causes Nalu throws #13773

spdomin opened this issue Feb 3, 2025 · 9 comments
Labels
type: bug The primary issue is a bug in Trilinos code or tests

Comments

@spdomin
Copy link
Contributor

spdomin commented Feb 3, 2025

Bug Report

'Looks like a new STK throw is occurring in the create_mesh() method of Nalu:

what(): global_minmax - Attempting mpi while in barrier owned by 2
I will start a bisect now and report back. FYI:

Good:

NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: dd813e0c9dd54e4a9cbc3b1c6cc07f255ed9cff6

Bad:

The following tests FAILED:
         32 - heliumPlume (Failed)
NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: 221f025eb60ea91ca64efca74ac6f8689393afcd
@spdomin spdomin added the type: bug The primary issue is a bug in Trilinos code or tests label Feb 3, 2025
@alanw0
Copy link
Contributor

alanw0 commented Feb 3, 2025

ok, keep us posted. The last stk snapshot into Trilinos was approx. December 13.
(See packages/stk/CHANGELOG.md)

@spdomin
Copy link
Contributor Author

spdomin commented Feb 3, 2025

@alanw0 - strange... Since I am on top of Trilinos updates, the number of bisect iterations is very small. I should have an answer in the next hour. I will change the title to be more representative as well:)

@spdomin
Copy link
Contributor Author

spdomin commented Feb 3, 2025

64d17de is the first bad commit
commit 64d17de
Author: Greg Sjaardema [email protected]
Date: Thu Jan 30 08:40:55 2025 -0700

Automatic snapshot commit from seacas at 94e88d4519

@gdsjaar - I can build a debug executable now to see if I can learn more about this throw... heliumPlume is the test that is failing - quite early. Aside from two meshes created (and physics and IO realm), it's not clear what is special... Oh... Right, this is the test that has failed in the past due to the following:

  serialized_io_group_size: 2
  output_data_base_name: heliumPlume.e
  output_frequency: 4 
  output_node_set: no
  compression_level: 9
  compression_shuffle: yes

@spdomin spdomin changed the title STK create mesh throw: what(): global_minmax Seacas snapshot push causes Nalu throws Feb 3, 2025
@gsjaardema
Copy link
Contributor

OK, looks like we have a collective call that doesn't realize it is inside a serialized io code block which doesn't work.

I will see if I can figure out where that is and add a test to catch it in the future.

@gsjaardema
Copy link
Contributor

@spdomin If you do have a debug build, it would be helpful to see the stacktrace at the time of the failure. Otherwise I can try to reproduce and/or look at the code changes and infer where we went wrong.

@spdomin
Copy link
Contributor Author

spdomin commented Feb 4, 2025

I can start a debug build now and report back.

@spdomin
Copy link
Contributor Author

spdomin commented Feb 4, 2025

I have seen all of this before the last time this serialized IO was broken:
Info: found non-zero serialized_io_group_size in input file= 2

Image of the trace (in TV) is as follows:

Image

Otherwise, the throw above is noted. Let me know if you want a link to the executable and test (separate email).

@gsjaardema
Copy link
Contributor

Thanks. I'm pretty sure I found the issue and have added the fix to the PR

@spdomin
Copy link
Contributor Author

spdomin commented Feb 5, 2025

Looks clean:

100% tests passed, 0 tests failed out of 84
NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: ef936a4

@spdomin spdomin closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants