Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nalu diffs due to MueLu_Amesos2Smoother change #12383

Closed
spdomin opened this issue Oct 9, 2023 · 23 comments
Closed

Nalu diffs due to MueLu_Amesos2Smoother change #12383

spdomin opened this issue Oct 9, 2023 · 23 comments
Labels
type: bug The primary issue is a bug in Trilinos code or tests

Comments

@spdomin
Copy link
Contributor

spdomin commented Oct 9, 2023

Bug Report

New diffs in Nalu's nightly test suite due to recent Muelu changes.

Good: 7da08904d1d2a40ae8c33338d22ae4e9bfb99620
Bad: 9c30d791fe38aac21669aa9337d672804ff649b9

Bisect noted @cgcgcg:

eebc8109506cd267837b15d2fe342105d74c2460 is the first bad commit
commit eebc8109506cd267837b15d2fe342105d74c2460
Author: Christian Glusa <[email protected]>
Date:   Fri Sep 29 12:42:23 2023 -0600

    MueLu_Amesos2Smoother: Fixes for projection and non-contiguous maps
    
    - Projection was incorrect for nullspaces of dim>1.
    - Check for non-contiguous map was insufficient.

This one looks like a bug fix. Let me know if: 1) the change is intended, 2) improves code quality, 3) Nalu diffs should be accepted.

@spdomin spdomin added the type: bug The primary issue is a bug in Trilinos code or tests label Oct 9, 2023
@github-actions
Copy link

github-actions bot commented Oct 9, 2023

Automatic mention of the @trilinos/muelu team

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

@spdomin Thanks for reporting this. How significant are the diffs? We've had some applications bug out due to a loss of convergence, but that should be fixed. I'm taking it that that's not the case here?

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

The loss of convergence that I recently reported was from a new eigenvalue approach that Chris pushed, however, I thought I closed that issue once we agreed it was a better change (I also increased the Chebyshev eigenvalue iteration count). Perhaps you mean another change - or this one?

In my cases, I do not see large diffs - just order-of-opps-like magnitude. In fact, only the "strict" gnu platform found the diffs, e.g., Mac is more lax in diff checking.

Moreover, in most all of the cases I spot checked, the average linear solver iterations for the continuity solve were the same, although this was not an exhaustive check since it is currently done by hand with a "diff" or "tkdiff".

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

No, I managed to break Albany with the commit you referenced. I did push a fix the next day, but wanted to make sure it's not that. If there is indeed no change in iterations I would suggest to approve the diff.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

Okay, I suppose I will check the diffs between these two changes to make sure all is well.

To be clear, the latest code base in Muelu is an improvement to the previous implementation, correct? If so, I can proceed with the re-bless.

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

Yes. The commit and the subsequent had two changes:

  • An improved option for using the direct solver as a pseudoinverse by shifting the nullspace away from zero;
  • a fix for solves with non-contiguous maps. (This essentially only affected unit tests that are so small that MueLu just constructs a direct solver.)

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

I will proceed with the re-bless. Before I do that, do you have the set of Muelu XML parameters that activate this new code path/formulation? It looks like the amount of cases diffing is much smaller than the full set of Muelu cases.

An XML is under:

https://github.com/NaluCFD/Nalu/blob/master/reg_tests/xml/milestone.xml

I have not checked if all of the diffs point to this xml, however, will do that while the cases are running.

spdomin added a commit to spdomin/NaluCFDFork that referenced this issue Oct 9, 2023
*  Resolving Trilinos push: trilinos/Trilinos@eebc810 that caused Nalu diffs: trilinos/Trilinos#12383
spdomin added a commit to NaluCFD/Nalu that referenced this issue Oct 9, 2023
@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

In order to use the pseudo-inverse you would need to set

<ParameterList name="coarse: params">
   <Parameter name="fix nullspace" type="bool" value="true"/>
</ParameterList>

I would only recommend using that if you are 100% sure that the near-nullspace of the multigrid preconditioner is the actual nullspace because you are solving a singular problem.
For example, Poisson with pure Neumann is such a case. You are specifying (maybe implicitely because MueLu might do that for you) that the near-nullspace is the constant vector.

For the non-contiguous maps nothing has to be changed.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

Hmmm. None of the cases that diffed were a singular system. We only have one test for that, and it was the one that was failing from the eignevalue iteration change a couple of weeks ago.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

Moreover, I have clearly not activated this option. Finally, I do not believe we have non-contiguous maps? Did something else change?

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

Ok, just to make sure: eebc810 gives you a "bad" result.
Can you also check 1d16146? That's the fix for the previous commit.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 9, 2023

It did not seem that the diff magnitude changed over the past week.

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 9, 2023

So the diff appeared with eebc810 and did not change since? Weird.

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 10, 2023

Looking at eebc810 the first two chunks should not be active since you're not setting "fix nullspace". So the diff must be due to the last chunk that deals with non-contiguous maps. You could revert that and see if that is indeed the culprit.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 10, 2023

Could you refresh my memory on what constitutes a non-contiguous map? I know that in certain use cases, such as sliding mesh, overset, or periodic, we may have some sort of advanced matrix connectivity. However, many of the cases that failed are routine cases with standard inflow, open, and wall boundaries. One was a hybrid mesh for a duct flow.

Any info would be nice to have so that we can all be sure that nothing else changed unknowingly. Possibly, one of the smallest cases might have an opportunity to drop a matrix file.

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 10, 2023

Have a look at the section "Contiguous or noncontiguous" in the Tpetra documentation:
https://docs.trilinos.org/dev/packages/tpetra/doc/html/classTpetra_1_1Map.html

@spdomin
Copy link
Contributor Author

spdomin commented Oct 10, 2023

@alanw0, could you please add perspective on why a routine Nalu inflow/open/wall bc case would have a noncontiguous map? Is this something that STK might be "suggesting" due to global id mapping?

For a review, new Nalu diffs showed up due to a recent Muelu change that touched applications that have noncontinuous maps.

@alanw0
Copy link
Contributor

alanw0 commented Oct 10, 2023

We construct a Tpetra_Map like this, on line 317 of TpetraLinearSystem.C in nalu:

ownedRowsMap_ = Teuchos::rcp(new LinSys::Map(Teuchos::OrdinalTraits<Tpetra::global_size_t>::invalid(), ownedGids, 1, tpetraComm));

The ownedGids may indeed be noncontigous, those are the global-IDs coming from the exodus mesh. They can be arbitrary, you could have a mesh of 10 elements with ids starting from 1 billion, etc.
But I was under the impression that Tpetra Maps internally map the user IDs to a (globally) zero-based contiguous set of ids. Is a Tpetra Map non contiguous based on the user IDs?

@alanw0
Copy link
Contributor

alanw0 commented Oct 10, 2023

@spdomin for the diffing case, is it diffing on 1 processor, or only diffing in parallel?
On 1 processor, the ownedGids would be contiguous, whereas in parallel they may be messed up by the decomposition and all bets are off.
You can ncdump the un-decomposed exodus mesh and see if it has a node map. If not, then the ids are contiguous in serial. If it has a node map, then that will also show the (potentially arbitrary) global-ids.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 10, 2023

Hi @alanw0, all of the cases are parallel. I can try a test in serial - once we understand the rules of what is and is not contiguous. Based on how the global ids are served up from Exodus, we may never have a contiguous system:)

Thanks for chiming in:)

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 10, 2023

There should be no change in behavior in parallel.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 11, 2023

Okay, this saves me extra testing, however, we are still not sure why these standard cases are interpreted as non-contiguous maps. Again, the premise set forth is that diffs will only show up when the maps are non-contiguous. As such, the open question is why they are in these simple cases.

I care about this since if the map is contiguous, then something else changed to cause the diffs.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 31, 2023

Diffs accepted; still some questions on maps, however, we can take this offline.

@spdomin spdomin closed this as completed Oct 31, 2023
@jhux2 jhux2 added this to MueLu Aug 12, 2024
@jhux2 jhux2 moved this to Done in MueLu Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

3 participants