Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to Restart on Large Meshes with Time-Averaged Data #949

Closed
GomerOfDoom opened this issue Apr 24, 2020 · 35 comments · Fixed by #980
Closed

Failure to Restart on Large Meshes with Time-Averaged Data #949

GomerOfDoom opened this issue Apr 24, 2020 · 35 comments · Fixed by #980
Labels

Comments

@GomerOfDoom
Copy link
Contributor

GomerOfDoom commented Apr 24, 2020

sd2_case1b_ddes_v7.cfg.txt

Hello,

We are currently unable to restart SU2 in DDES mode from a restart file that includes time-averaged data on very large (~180 million cell) meshes.

Compiled in release mode, the code gives the error "FGMRES orthogonalization failed, linear solver diverged."

Compiled in debug mode, the code issues an assertion failure at line 1881 of $SU2_HOME/SU2_CFD/src/numerics_structure.cpp, which is a check in the CNumerics::SetRoe_Dissipation(...) method to make sure that variable 'Dissipation_j' is between zero and one.

This problem only appears when attempting to restart from solution files that include "TIME_AVERAGE" data on very large meshes.

Note the above behavior is occurring with commit 382e82f of the "develop" branch.

I have pulled the latest commits of develop (c093a62) and master (d9c867d), but get segfaults during Jacobian structure initialization when attempting to restart on multiple cores.

All help is appreciated.

-Paul

To Reproduce
Config file attached, but mesh file is quite large... 17.6 GB.

Desktop (please complete the following information):

  • Department of Defense Unclassified System: "Onyx"
  • System type: Cray XC40/50
  • OS: Variant of SuSe Linux 12.3 and/or Cray Linux Environment
  • Compiler: Intel 19.0.1.144
  • MPI: cray-mpich 7.6.3
  • SU2 v. 7.0.1, develop branch, commit 382e82f (and segfaults with latest commits of develop (c093a62) and master (d9c867d) ).
@economon
Copy link
Member

Hey Paul! Sorry to hear this is an issue for you. ~180 mio cells is quite large... any chance you can recreate on something much smaller? If you roll back some commits, does it work again? If you can find the exact commit, that would be helpful.

On a smaller case, you might check whether the time averaged data is being written as you expect to the restart file by using an ASCII file and viewing it (add RESTART_ASCII to your OUTPUT_FILES). It is just a normal CSV. Might give you some clues.

If the number of columns in the restart is incorrect, or not what you expect, then I am guessing the loading of the restart is to blame and some quantities are not getting initialized correctly. That is done in CEulerSolver::LoadRestart() for the compressible code.

@GomerOfDoom
Copy link
Contributor Author

Hello Tom,

Thanks for responding. I have been trying to recreate the problem on smaller meshes, but so far, no luck.

However, I am seeing a problem that may be related on a smaller mesh. This is occurring with the latest commit (c093a62) of develop:

When compiled with mpi and buildtype of either release or debug, when attempting to restart a DDES, I am getting the following:

  1. Compiled in release mode: segmentation faults following "initialize jacobian structure (SA model)"
  2. Compiled in debug mode:
|            Total Time|             1|             1|         s|             1|
|             Time Step|      1.25e-06|             1|         s|      1.25e-06|
+------------------------------------------------------------------------------+
Initialize Jacobian structure (Navier-Stokes). MG level: 0.
Initialize Jacobian structure (SA model).
SU2_CFD_MPI_DEBUG: ../SU2_CFD/src/output/../../../Common/include/geometry/../toolboxes/C2DContainer.hpp:294: const Scalar_t& container_helpers::AccessorImpl<Index_t, Scalar_t, StorageType::ColumnMajor, AlignSize, 0, 1>::operator()(Index_t) const [with Index_t = long unsigned int; Scalar_t = double; long unsigned int AlignSize = 64]: Assertion `i>=0 && i<m_rows' failed.

This problem appears on a relatively small (~7 million cell) mesh of the same geometry as the larger one I mentioned in my initial post.

Running the simulation in serial mode runs successfully and does not produce any errors.

I have put the RANS result, mesh, and DDES configure file and restart files on a shared folder:

https://drive.google.com/open?id=1IDpK_0xYktI-lbG-ozre9Sf_3ksC7xtt

Any/all help is greatly appreciated.

-Paul

@GomerOfDoom
Copy link
Contributor Author

An update to the issue mentioned above...

I was able to attach debuggers to a 4-core run of the case I posted yesterday.

The code runs fine until it gets into CEulerSolver::SetUpwind_Ducros_Sensor.

At some point in this method, during a loop through points, an invalid number is returned by the call to:
geometry->node[iPoint]->GetPoint(iNeigh)
(line 3759 of CEulerSolver.cpp).

I have attached backtraces from each of the processes I was running. In each case, you can see that "jPoint" gets assigned a value here that looks to me like garbage (for example jPoint=18,446,744,073,709,551,615). This value is then passed to nodes->GetGradient_Primitive(jPoint,iDim+1,iDim)

I have no idea why a number like this would be returned... Any ideas?

Thanks!

-Paul
backtrace_1.txt
backtrace_2.txt
backtrace_3.txt
backtrace_4.txt

@GomerOfDoom
Copy link
Contributor Author

One more thing... it looks like the assignment of number of neighbors for iPoint is using
geometry->node[iPoint]->GetnNeighbor()
But, shouldn't it be using:
geometry->node[iPoint]->GetnPoint()
instead?

@bmunguia
Copy link
Member

It also looks like the loops over iNeigh in CEulerSolver::SetUpwind_Ducros_Sensor in the branches you mentioned either have the exit condition > nNeigh or <= nNeigh, so I guess the bug was almost fixed. Try changing it to <

@GomerOfDoom
Copy link
Contributor Author

Hi Brian,

I'm not seeing the exit condition > nNeigh you mentioned. As for the <= nNeigh, I believe that when iNeigh = nNeigh, the points own value is called for, rather than its neighbors. I could be wrong about that, so feel free to double check me.

-Paul

@bmunguia
Copy link
Member

382e82f has iNeigh > geometry->node[iPoint]->GetnNeighbor().

And yes, you're right about whein iNeigh == nNeigh. My bad.

@GomerOfDoom
Copy link
Contributor Author

Hello,

So, to clarify, there were two issues:

  1. With the old commit (382e82f), we were seeing the assertion failure at line 1881 of numerics_structure.cpp, but only with the very large mesh (180 million cells).

  2. With the newer commit (c093a62), we were seeing the assertion failure at line 294 of C2DContainer.hpp, which was occurring even with smaller meshes (7 million cells).

The solution to (2) appears to be to change geometry->node[iPoint]->GetnNeighbor() to geometry->node[iPoint]->GetnPoint() at line 3759 of CEulerSolver.cpp, in the CEulerSolver::SetUpwind_Ducros_Sensor() method.

I made this change locally, and attempted to run on our large mesh. Issue (2) seems to be fixed, but we still run into issue (1).

I have now gone through the read restart routines, and have found a potential issue:

For reference, the restart file for our large mesh with averaging data included consists of:
39 fields * 75,107,967 points = 2,929,210,713 variable values (which is larger than INT_MAX).

Beginning at line 3931 of CSolver.cpp, in method CSolver::Read_SU2_Restart_Binary(...), we have the following:

int *blocklen = new int[geometry->GetnPointDomain()];
int *displace = new int[geometry->GetnPointDomain()];
int counter = 0;
for (iPoint_Global = 0; iPoint_Global < geometry->GetGlobal_nPointDomain(); iPoint_Global++ ) {
  if (geometry->GetGlobal_to_Local_Point(iPoint_Global) > -1) {
    blocklen[counter] = nFields;
    displace[counter] = iPoint_Global*nFields;
    counter++;
  }
}
MPI_Type_indexed(geometry->GetnPointDomain(), blocklen, displace, MPI_DOUBLE, &filetype);

The problem here is that for our case, where iPoint_Global can get up to 75,107,967 and nFields = 39, the value assigned to displace[counter] in the loop can over-run INT_MAX. This would result in potential garbage / incorrect displace values being passed to MPI_Type_indexed(...)

Unfortunately, simply changing displace to a long int * won't work, as the expected argument type for MPI_Type_indexed(...) is int.

It may be that, given the limitations of MPI here, reading such large restart information requires the restart file to be read serially by one rank, and the data split and broadcast to the other ranks?

I am not an MPI expert, so there may be another way to do this. Thoughts?

-Paul

@economon
Copy link
Member

How about a quick fix like removing some fields from the output if possible (39 is a lot)? Are there some auxiliary variables apart from the conservatives and time averages that you don't need?

We'll have to look into the MPI issue with INT_MAX.. we will probably hit other snags like this throughout the code as we keep increasing mesh size. I bet there are some int vs. long issues hidden.

@GomerOfDoom
Copy link
Contributor Author

GomerOfDoom commented Apr 30, 2020 via email

@talbring
Copy link
Member

We can just look at the field names and read in the coordinates, solution and grid velocity (if needed). Thats actually a quite easy fix. I wanted to do that anyway at some point. Another solution would be to read chunks of "columns" if we really have very big meshes ...

@talbring
Copy link
Member

The strange thing is that in OpenMPI for some index types they already use long long type ... e.g. for MPI_Offset ...

@talbring
Copy link
Member

@GomerOfDoom I would say that this issue is the perfect place to discuss that :)

@economon
Copy link
Member

Actually, @vdweide predicted exactly this issue in #696. I can report that, after updating the CGNS reader in #728, it is possible to run the solver on meshes with over 1 billion grid points (it was a hex cube, so also roughly 1 billion cells). I did not go much beyond that or try to do a restart, but I am certain we will uncover additional issues like this. @talbring, good ideas for the restart, we should start there.

I am fine with keeping this issue (maybe pin it if necessary), or we can reopen #696.

@GomerOfDoom
Copy link
Contributor Author

In a separate email, Edwin mentioned using MPI_Type_create_hindexed(...), which I also came across in looking at this issue. MPI_Type_create_hindexed() specifies displacements in MPI_Aint, rather than int, and should allow for displacement values that get around the INT_MAX limit.

However, there are probably other repercussions that need to be explored when making this kind of change, as I'm sure there are other places in the code that use similar functionality.

@vdweide
Copy link
Contributor

vdweide commented May 1, 2020

The MPI_Type_create_hindexed will indeed solve the integer overflow encountered in MPI_Type_indexed, as the former uses MPI_Aint (8 byte integers) for the the displacements and the latter regular 4 byte integers. However, as @GomerOfDoom mentioned, there may be issues for the discrete adjoint (I saw that Type_indexed is actually present in medi) if we simply replace MPI_Type_indexed by MPI_Type_create_hindexed. So @talbring, @MaxSagebaum and @economon, do you foresee any problems here? If not, then it is a very simple change of a couple of lines.

@talbring
Copy link
Member

talbring commented May 1, 2020

@vdweide The file reading/writing does not involve any AD types, so we can just change it right away I think.

@vdweide
Copy link
Contributor

vdweide commented May 1, 2020

Good, that should be an easy fix then (I hope).

@pcarruscag
Copy link
Member

The solution to (2) appears to be to change geometry->node[iPoint]->GetnNeighbor() to geometry->node[iPoint]->GetnPoint() at line 3759 of CEulerSolver.cpp, in the CEulerSolver::SetUpwind_Ducros_Sensor() method.

@GomerOfDoom thanks for tracking that, I fixed the exit condition on that method, but only now that I'm refactoring CPoint in #966 have I realized the difference between nPoint and nNeighbor... Does your case have periodicity?
I will fix incorrect uses of nNeighbor in other areas of the code as part of #966, time domain with moving grids will have similar issues.

@GomerOfDoom
Copy link
Contributor Author

I have created a branch "fix_large_mesh" in which I am changing those MPI_Type_indexed calls. I am testing the changes now on our large case. Will let you all know how it goes, and then submit a pull request.

As for the node[...]->GetnNeighbor issue, the case I have been working on does not have periodicity, but other cases may. Can you clarify the difference between nNeighbor and nPoint? Also, note that this issue could have been caused by an incorrect setting of nNeighbor due to the MPI_Type_indexed bug.

@pcarruscag
Copy link
Member

pcarruscag commented May 1, 2020

nPoint is given by the adjacency, number of edges connected to a given point.
nNeighbor is the same (initially) but as it is used in some numerical schemes some MPI communication is done for it, and so I guess in some circumstances it will not be in sync with nPoint.
Moral of the story, nNeighbor is for numerical schemes, nPoint is to loop over neighboring edges/points (arguably the naming is not great).

@vdweide
Copy link
Contributor

vdweide commented May 7, 2020

This one is related to the integer overflow problem for very large meshes when writing paraview XML files. In the function CParaviewXMLFileWriter::WriteDataArray the variable totalByteSize is defined as an integer. When writing lots of data, especially the connectivity data, this variable overflows, as @GomerOfDoom experienced. Is it possible to define this byte size to be of type Int64 in paraview instead of Int32?

@talbring
Copy link
Member

talbring commented May 7, 2020

@vdweide I have not found any information in the documentation of the XML paraview format whether that is possible. The size of the data is just dumped as an int32 in front of the binary data blob, so there must be a flag to tell paraview that the size is encoded using a different datatype ...

But anyway, that is not related to the restart files in any way.

@BeckettZhou
Copy link

Folks I have also been experiencing issues with large mesh restarts but I am not sure if it's related to what Paul was experiencing. Here's the problem:

We are running a multi-zone mesh with 65M elements. The simulation starts just fine on our cluster. After 2000 time steps, I stop the run, switch on the COMPUTE_AVERAGE flag and restart the simulation. And it hangs right when it is reading the first restart file.

The problem occurs only when I restart with 20 nodes (480 cores). If I use 15 nodes, it runs just fine.

If I start from freestream, I can use any number of nodes.

@GomerOfDoom
Copy link
Contributor Author

Hey guys... I just want to clarify, as there seems to be three issues being discussed:

Issue 1) SOLVED - This was related to a crash in the the CEulerSolver::SetUpwind_Ducros_Sensor() - This can be fixed by changing a call from GetnNeighbor() to GetnPoints() (see my comments on April 30). I don't believe this ended up being related to the overrun in the restart file (Issue 2).

Issue 2) SOLVED - This was a crash when starting with very large meshes, caused by an integer overrun in CSolver::Read_SU2_Restart_Binary(). This issue is solved by converting the displacement variable to MPI_Aint and using MPI_Type_create_hindexed(), instead of MPI_Type_indexed()

Issue 3) This is another issue I've been seeing with large simulations: the paraview .vtu files output by SU2 are unable to be read in by Paraview. The error in Paraview is

"ERROR: In /local/paraview/VTK/IO/XML/vtkXMLDataReader.cxx, line 400
vtkXMLUnstructuredGridReader (0x55b6b838a330): Cannot read point data array "Density" from PointData in piece 0. The data array in the element may be too short."

This is the error Edwin is referring to today. I am changing the type of totalByteSize to a long, and will test to see if that solves the issue.

Note that the solution to Issues 1 and 2 are not yet in the develop branch. I will create a pull request (should be by end of the day today) when I have verified that the solutions work.

@talbring
Copy link
Member

talbring commented May 7, 2020

@GomerOfDoom I just read that the size can be an unsigned int for the xml paraview writer. I don't think that a long will work...

@GomerOfDoom
Copy link
Contributor Author

Hi Tim, Can you point me to the documentation that says that? I'm not seeing it.

@talbring
Copy link
Member

talbring commented May 7, 2020

That's exactly the problem. The documentation (https://vtk.org/wp-content/uploads/2015/04/file-formats.pdf) is not loosing a single word about the fact that the size has to be in front of the binary blob. I had to dig through some paraview mailing lists (https://public.kitware.com/pipermail/paraview/2007-October/006064.html) in order to get that information.

@vdweide
Copy link
Contributor

vdweide commented May 7, 2020

@talbring, so far we used a standard integer for this size, while apparently it has to be a unsigned int according to the paraview mailing list. So maybe those integers may not be used at all by the actual reader in paraview and only the offsets in the header are used.

@GomerOfDoom, could you change the type of totalByteSize to an unsigned int and try again? For this particular case an unsigned int is just fine. If the problem still shows up when reading the file in paraview, then it is something else.

@BeckettZhou, I think that for 65M elements you should not run into integer overflow issues. Hence that is most likely a different problem

@BeckettZhou
Copy link

BeckettZhou commented May 7, 2020 via email

@vdweide
Copy link
Contributor

vdweide commented May 7, 2020

I found the following
https://vtk.org/Wiki/VTK_XML_Formats#Uncompressed_Data
So it looks like it is possible to specify the type of byte count integer. UInt32 is the default value.

@talbring
Copy link
Member

talbring commented May 9, 2020

@vdweide That was the information I was always looking for. Somehow I was not able to find it ... Thanks!

@vdweide
Copy link
Contributor

vdweide commented May 9, 2020

I already changed it in the branch fix_large_mesh. If the tests of @GomerOfDoom are successful it can be merged into develop.

@GomerOfDoom GomerOfDoom mentioned this issue May 11, 2020
5 tasks
@GomerOfDoom
Copy link
Contributor Author

I was finally able to re-run the case, and it appears that the changes Edwin made have worked. The .vtu files are now readable in Paraview.

I believe this solves (for now) all of the issues we were seeing with these large meshes. I have submitted a pull request. This is my first PR, so I may have done something wrong... please let me know if I need to do additional things for the PR.

@GomerOfDoom GomerOfDoom linked a pull request May 11, 2020 that will close this issue
5 tasks
@vdweide
Copy link
Contributor

vdweide commented May 13, 2020

So I guess we can close this one.

@vdweide vdweide closed this as completed May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants