Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu: MueLu hangs when try to "export data" such as matrices after repartitioning has occurred #3991

Closed
pwxy opened this issue Dec 4, 2018 · 19 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. lacks reproducer Lacking enough information for developers to realistically reproduce the problems themselves MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@pwxy
Copy link

pwxy commented Dec 4, 2018

MueLu hangs when try to "export data" such as matrices after repartitioning has occurred.
The MPI processes that have dropped out after repartitioning will throw and the run hangs:

p=3: *** Caught standard std::exception of type 'Teuchos::bad_any_cast' :

 ../../packages/muelu/src/Interface/../MueCentral/MueLu_VariableContainer.hpp:103:
 
 Throw number = 17
 
 Throw test that evaluated to true: data_->type() != typeid(T)
 
 Error, cast to type Data<Teuchos::RCP<Xpetra::Matrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > >> failed since the actual underlying type is 'Teuchos::RCP<Xpetra::Operator<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > >!

This is develop Trilinos cloned this morning (Dec 4, 2018), SHA1 573e3290b0500eee45e582cb8fcee0b1c6476cec

Example MueLu_Driver.exe run that exhibits this issue:

mpirun -n 4 MueLu_Driver.exe --matrixType=Laplace3D --nx=50 --ny=50 --nz=4 --mx=2 --my=2 --mz=1


[ptlin@ceerws3709 scaling]$ cat scaling.xml
<ParameterList name="MueLu">

  <!--
    For a generic symmetric scalar problem, these are the recommended settings for MueLu.
  -->

  <!-- ===========  GENERAL ================ -->
    <Parameter        name="verbosity"                            type="string"   value="high"/>

    <Parameter        name="coarse: max size"                     type="int"      value="1000"/>

    <Parameter        name="multigrid algorithm"                  type="string"   value="sa"/>

    <!-- reduces setup cost for symmetric problems -->
    <Parameter        name="transpose: use implicit"              type="bool"     value="true"/>

    <!-- start of default values for general options (can be omitted) -->
    <Parameter        name="max levels"                	        type="int"      value="10"/>
    <Parameter        name="number of equations"                  type="int"      value="1"/>
    <Parameter        name="sa: use filtered matrix"              type="bool"     value="true"/>
    <!-- end of default values -->

  <!-- ===========  AGGREGATION  =========== -->
    <Parameter        name="aggregation: type"                    type="string"   value="uncoupled"/>
    <Parameter        name="aggregation: drop scheme"             type="string"   value="classical"/>
    <!-- Uncomment the next line to enable dropping of weak connections, which can help AMG convergence
         for anisotropic problems.  The exact value is problem dependent. -->
    <!-- <Parameter        name="aggregation: drop tol"                type="double"   value="0.02"/> -->

  <!-- ===========  SMOOTHING  =========== -->
    <Parameter        name="smoother: type"                       type="string"   value="CHEBYSHEV"/>
    <ParameterList    name="smoother: params">
      <Parameter      name="chebyshev: degree"                    type="int"      value="2"/>>
      <Parameter      name="chebyshev: ratio eigenvalue"          type="double"   value="7"/>
      <Parameter      name="chebyshev: min eigenvalue"            type="double"   value="1.0"/>
      <Parameter      name="chebyshev: zero starting solution"    type="bool"     value="true"/>
    </ParameterList>

  <!-- ===========  REPARTITIONING  =========== -->
    <Parameter        name="repartition: enable"                  type="bool"     value="true"/>
    <Parameter        name="repartition: partitioner"             type="string"   value="zoltan2"/>
    <Parameter        name="repartition: start level"             type="int"      value="2"/>
    <Parameter        name="repartition: min rows per proc"       type="int"      value="800"/>
    <Parameter        name="repartition: max imbalance"           type="double"   value="1.1"/>
    <Parameter        name="repartition: remap parts"             type="bool"     value="false"/>
    <!-- start of default values for repartitioning (can be omitted) -->
    <Parameter name="repartition: remap parts"                type="bool"     value="true"/>
    <Parameter name="repartition: rebalance P and R"          type="bool"     value="false"/>
    <ParameterList name="repartition: params">
       <Parameter name="algorithm" type="string" value="multijagged"/>
    </ParameterList> 
    <!-- end of default values -->

    <ParameterList name="export data">
      <Parameter name="A" type="string" value="{2}"/>
    </ParameterList> 


</ParameterList>
[ptlin@ceerws3709 scaling]$ 
@pwxy pwxy added pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests labels Dec 4, 2018
@pwxy pwxy added client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code labels Dec 4, 2018
@pwxy
Copy link
Author

pwxy commented Dec 4, 2018

I'm assuming @jhux2 and @cgcgcg are automatically added to anything with the MueLu label?

@csiefer2
Copy link
Member

csiefer2 commented Dec 4, 2018

@pwxy That would be nice, but that's not the way github works.
@trilinos/muelu

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 4, 2018

@pwxy I started looking into fixing this. Using a try-catch block, A gets dumped correctly. I'm trying to track down the missing coordinates now..

@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Dec 4, 2018
@bartlettroscoe
Copy link
Member

CC: @srajama1 (Trilinos Linear Solvers Product Area Lead)

@pwxy, can you please provide reprodocability info? Is this using that ATDM Trilinos build configuration described at:

If not, then this is not something we can support as an ATDM issue.

Please see the standard ATDM Trilinos GitHub issue template at:

@srajama1
Copy link
Contributor

srajama1 commented Dec 5, 2018

@bartlettroscoe This is not ATDM issue from the builds to dashboard, still it is an ATDM issue. We can support this. If we cannot support runs of @pwxy then we are in real trouble :)

@bartlettroscoe
Copy link
Member

@bartlettroscoe This is not ATDM issue from the builds to dashboard, still it is an ATDM issue. We can support this. If we cannot support runs of @pwxy then we are in real trouble :)

@srajama1, I simply mean that if we can't reproduce this problem with the ATDM Trilinos build configuration, then we can't support this using the triagging and resolution process described in Triagging and Addressing ATDM Trilinos Failures becuase we can't provide reproducability instructions. It is out of our hands. But if you can support it, then that is fine.

NOTE: According to the policy:

please make sure that someone adds a test that exposes this defect (that can be run in the ATDM Trilino builds) first and then fixes the code. Please don't just "fix the code" and move on.

@pwxy
Copy link
Author

pwxy commented Dec 5, 2018

@bartlettroscoe, yes I used the ATDM Trilinos build configuration described at: https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

However, I used the SEMS rhel6 environment to build Trilinos on the CEE LAN after my question yesterday morning regarding the current CEE rhel6 environment being hardwired to load SPARC modules.

@bartlettroscoe
Copy link
Member

@pwxy said:

yes I used the ATDM Trilinos build configuration described at: https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

Excellent!

@jhux2, why are we not seeing an automated test failing? Do we need to add a test that runs:

mpirun -n 4 MueLu_Driver.exe --matrixType=Laplace3D --nx=50 --ny=50 --nz=4 --mx=2 --my=2 --mz=1

?

@jhux2
Copy link
Member

jhux2 commented Dec 5, 2018

why are we not seeing an automated test failing?

Because this is a newly discovered bug.

@bartlettroscoe
Copy link
Member

@jhux2 said:

why are we not seeing an automated test failing?

Because this is a newly discovered bug.

@jhux2, got you. Then as long as someone adds or updates a native MueLu test to expose this defect before fixing it, then we should be good to go.

cgcgcg added a commit to cgcgcg/Trilinos that referenced this issue Dec 6, 2018
cgcgcg added a commit to cgcgcg/Trilinos that referenced this issue Dec 6, 2018
cgcgcg added a commit to cgcgcg/Trilinos that referenced this issue Dec 6, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
@bartlettroscoe
Copy link
Member

CC: @jhux2, @csiefer2

@cgcgcg, was this resolved by PR #4008? That PR did not look to touch any tests. Was a defect fixed that was not exposed by any native MueLu tests?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 11, 2018

@bartlettroscoe I'm waiting on some feedback from an application. It did appear to fix things for me, but I want to wait until I declare success. The defect was not exposed by any tests.

@bartlettroscoe
Copy link
Member

I'm waiting on some feedback from an application. It did appear to fix things for me, but I want to wait until I declare success. The defect was not exposed by any tests.

@cgcgcg, would it be possible to add a test to MueLu that showed this bug before your change and then verified that you fixed the bug after your change?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 11, 2018

@bartlettroscoe I can do that. The only thing I'm afraid of is that I will be creating another test that breaks the ATDM builds. It seems that tests that do IO are randomly failing quite frequently..

@bartlettroscoe
Copy link
Member

I can do that. The only thing I'm afraid of is that I will be creating another test that breaks the ATDM builds. It seems that tests that do IO are randomly failing quite frequently..

@cgcgcg, tests that do I/O don't need to be flaky. You just need to not have a race between writing and reading the same file in the same executable. If you read that file in a separate process it seems to help. I have found that to fix problems like this.

@cgcgcg cgcgcg self-assigned this Dec 15, 2018
@srajama1
Copy link
Contributor

@pwxy Can you please check if this is fixed ?

@bartlettroscoe bartlettroscoe added the lacks reproducer Lacking enough information for developers to realistically reproduce the problems themselves label Dec 11, 2019
@bartlettroscoe
Copy link
Member

Can you please check if this is fixed ?

Better question, was an automated test ever written to demonstrate this defect and show that it was fixed?

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 18, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Sep 18, 2021
@jhux2 jhux2 added this to MueLu Aug 12, 2024
@jhux2 jhux2 moved this to Done in MueLu Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. lacks reproducer Lacking enough information for developers to realistically reproduce the problems themselves MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

6 participants