Failure to Restart on Large Meshes with Time-Averaged Data #949

GomerOfDoom · 2020-04-24T19:39:14Z

sd2_case1b_ddes_v7.cfg.txt

Hello,

We are currently unable to restart SU2 in DDES mode from a restart file that includes time-averaged data on very large (~180 million cell) meshes.

Compiled in release mode, the code gives the error "FGMRES orthogonalization failed, linear solver diverged."

Compiled in debug mode, the code issues an assertion failure at line 1881 of $SU2_HOME/SU2_CFD/src/numerics_structure.cpp, which is a check in the CNumerics::SetRoe_Dissipation(...) method to make sure that variable 'Dissipation_j' is between zero and one.

This problem only appears when attempting to restart from solution files that include "TIME_AVERAGE" data on very large meshes.

Note the above behavior is occurring with commit 382e82f of the "develop" branch.

I have pulled the latest commits of develop (c093a62) and master (d9c867d), but get segfaults during Jacobian structure initialization when attempting to restart on multiple cores.

All help is appreciated.

-Paul

To Reproduce
Config file attached, but mesh file is quite large... 17.6 GB.

Desktop (please complete the following information):

Department of Defense Unclassified System: "Onyx"
System type: Cray XC40/50
OS: Variant of SuSe Linux 12.3 and/or Cray Linux Environment
Compiler: Intel 19.0.1.144
MPI: cray-mpich 7.6.3
SU2 v. 7.0.1, develop branch, commit 382e82f (and segfaults with latest commits of develop (c093a62) and master (d9c867d) ).

The text was updated successfully, but these errors were encountered:

economon · 2020-04-27T21:59:26Z

Hey Paul! Sorry to hear this is an issue for you. ~180 mio cells is quite large... any chance you can recreate on something much smaller? If you roll back some commits, does it work again? If you can find the exact commit, that would be helpful.

On a smaller case, you might check whether the time averaged data is being written as you expect to the restart file by using an ASCII file and viewing it (add RESTART_ASCII to your OUTPUT_FILES). It is just a normal CSV. Might give you some clues.

If the number of columns in the restart is incorrect, or not what you expect, then I am guessing the loading of the restart is to blame and some quantities are not getting initialized correctly. That is done in CEulerSolver::LoadRestart() for the compressible code.

GomerOfDoom · 2020-04-27T23:54:10Z

Hello Tom,

Thanks for responding. I have been trying to recreate the problem on smaller meshes, but so far, no luck.

However, I am seeing a problem that may be related on a smaller mesh. This is occurring with the latest commit (c093a62) of develop:

When compiled with mpi and buildtype of either release or debug, when attempting to restart a DDES, I am getting the following:

Compiled in release mode: segmentation faults following "initialize jacobian structure (SA model)"
Compiled in debug mode:

|            Total Time|             1|             1|         s|             1|
|             Time Step|      1.25e-06|             1|         s|      1.25e-06|
+------------------------------------------------------------------------------+
Initialize Jacobian structure (Navier-Stokes). MG level: 0.
Initialize Jacobian structure (SA model).
SU2_CFD_MPI_DEBUG: ../SU2_CFD/src/output/../../../Common/include/geometry/../toolboxes/C2DContainer.hpp:294: const Scalar_t& container_helpers::AccessorImpl<Index_t, Scalar_t, StorageType::ColumnMajor, AlignSize, 0, 1>::operator()(Index_t) const [with Index_t = long unsigned int; Scalar_t = double; long unsigned int AlignSize = 64]: Assertion `i>=0 && i<m_rows' failed.

This problem appears on a relatively small (~7 million cell) mesh of the same geometry as the larger one I mentioned in my initial post.

Running the simulation in serial mode runs successfully and does not produce any errors.

I have put the RANS result, mesh, and DDES configure file and restart files on a shared folder:

https://drive.google.com/open?id=1IDpK_0xYktI-lbG-ozre9Sf_3ksC7xtt

Any/all help is greatly appreciated.

-Paul

GomerOfDoom · 2020-04-29T00:28:33Z

An update to the issue mentioned above...

I was able to attach debuggers to a 4-core run of the case I posted yesterday.

The code runs fine until it gets into CEulerSolver::SetUpwind_Ducros_Sensor.

At some point in this method, during a loop through points, an invalid number is returned by the call to:
geometry->node[iPoint]->GetPoint(iNeigh)
(line 3759 of CEulerSolver.cpp).

I have attached backtraces from each of the processes I was running. In each case, you can see that "jPoint" gets assigned a value here that looks to me like garbage (for example jPoint=18,446,744,073,709,551,615). This value is then passed to nodes->GetGradient_Primitive(jPoint,iDim+1,iDim)

I have no idea why a number like this would be returned... Any ideas?

Thanks!

-Paul
backtrace_1.txt
backtrace_2.txt
backtrace_3.txt
backtrace_4.txt

GomerOfDoom · 2020-04-29T00:42:05Z

One more thing... it looks like the assignment of number of neighbors for iPoint is using
geometry->node[iPoint]->GetnNeighbor()
But, shouldn't it be using:
geometry->node[iPoint]->GetnPoint()
instead?

bmunguia · 2020-04-29T00:48:02Z

It also looks like the loops over iNeigh in CEulerSolver::SetUpwind_Ducros_Sensor in the branches you mentioned either have the exit condition > nNeigh or <= nNeigh, so I guess the bug was almost fixed. Try changing it to <

GomerOfDoom · 2020-04-29T00:55:06Z

Hi Brian,

I'm not seeing the exit condition > nNeigh you mentioned. As for the <= nNeigh, I believe that when iNeigh = nNeigh, the points own value is called for, rather than its neighbors. I could be wrong about that, so feel free to double check me.

-Paul

bmunguia · 2020-04-29T01:07:39Z

382e82f has iNeigh > geometry->node[iPoint]->GetnNeighbor().

And yes, you're right about whein iNeigh == nNeigh. My bad.

GomerOfDoom · 2020-04-30T18:31:40Z

Hello,

So, to clarify, there were two issues:

With the old commit (382e82f), we were seeing the assertion failure at line 1881 of numerics_structure.cpp, but only with the very large mesh (180 million cells).
With the newer commit (c093a62), we were seeing the assertion failure at line 294 of C2DContainer.hpp, which was occurring even with smaller meshes (7 million cells).

The solution to (2) appears to be to change geometry->node[iPoint]->GetnNeighbor() to geometry->node[iPoint]->GetnPoint() at line 3759 of CEulerSolver.cpp, in the CEulerSolver::SetUpwind_Ducros_Sensor() method.

I made this change locally, and attempted to run on our large mesh. Issue (2) seems to be fixed, but we still run into issue (1).

I have now gone through the read restart routines, and have found a potential issue:

For reference, the restart file for our large mesh with averaging data included consists of:
39 fields * 75,107,967 points = 2,929,210,713 variable values (which is larger than INT_MAX).

Beginning at line 3931 of CSolver.cpp, in method CSolver::Read_SU2_Restart_Binary(...), we have the following:

int *blocklen = new int[geometry->GetnPointDomain()];
int *displace = new int[geometry->GetnPointDomain()];
int counter = 0;
for (iPoint_Global = 0; iPoint_Global < geometry->GetGlobal_nPointDomain(); iPoint_Global++ ) {
  if (geometry->GetGlobal_to_Local_Point(iPoint_Global) > -1) {
    blocklen[counter] = nFields;
    displace[counter] = iPoint_Global*nFields;
    counter++;
  }
}
MPI_Type_indexed(geometry->GetnPointDomain(), blocklen, displace, MPI_DOUBLE, &filetype);

The problem here is that for our case, where iPoint_Global can get up to 75,107,967 and nFields = 39, the value assigned to displace[counter] in the loop can over-run INT_MAX. This would result in potential garbage / incorrect displace values being passed to MPI_Type_indexed(...)

Unfortunately, simply changing displace to a long int * won't work, as the expected argument type for MPI_Type_indexed(...) is int.

It may be that, given the limitations of MPI here, reading such large restart information requires the restart file to be read serially by one rank, and the data split and broadcast to the other ranks?

I am not an MPI expert, so there may be another way to do this. Thoughts?

-Paul

economon · 2020-04-30T18:41:32Z

How about a quick fix like removing some fields from the output if possible (39 is a lot)? Are there some auxiliary variables apart from the conservatives and time averages that you don't need?

We'll have to look into the MPI issue with INT_MAX.. we will probably hit other snags like this throughout the code as we keep increasing mesh size. I bet there are some int vs. long issues hidden.

GomerOfDoom · 2020-04-30T18:48:34Z

Hi Tom, Yes, we can temporarily strip off some of the excess variables, and we may just do that in the short-term to continue the simulation. In fact, I'm not sure the averaged variables are used (is the time-averaging continued from restart, or is it started from scratch at restart?). If the time-averaged variables from the restart aren't used, then another fix would be to change what is output in the restart file vs. what is output in the visualization files (vtu, szplt, etc). Regardless, in an earlier simulation, we were attempting to run a 450 million cell mesh, which would have probably caused the same overrun with just the conservatives, so I agree that we'll have to start planning on how to deal with this issue down the road when larger meshes become more commonplace. Not sure the best forum to discuss this kind of thing. Github? Slack? Email? Hope you are well, Paul

…

________________________________ From: Thomas D. Economon <[email protected]> Sent: Thursday, April 30, 2020 11:41 AM To: su2code/SU2 <[email protected]> Cc: Paul Stephen Urbanczyk <[email protected]>; Author <[email protected]> Subject: Re: [su2code/SU2] Failure to Restart on Large Meshes with Time-Averaged Data (#949) How about a quick fix like removing some fields from the output if possible (39 is a lot)? Are there some auxiliary variables apart from the conservatives and time averages that you don't need? We'll have to look into the MPI issue with INT_MAX.. we will probably hit other snags like this throughout the code as we keep increasing mesh size. I bet there are some int vs. long issues hidden. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#949 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABU42SSQMUXG36DWEIWFNZ3RPHA6ZANCNFSM4MQLQFFQ>.

talbring · 2020-04-30T19:26:41Z

We can just look at the field names and read in the coordinates, solution and grid velocity (if needed). Thats actually a quite easy fix. I wanted to do that anyway at some point. Another solution would be to read chunks of "columns" if we really have very big meshes ...

talbring · 2020-04-30T19:29:40Z

The strange thing is that in OpenMPI for some index types they already use long long type ... e.g. for MPI_Offset ...

talbring · 2020-04-30T19:30:07Z

@GomerOfDoom I would say that this issue is the perfect place to discuss that :)

economon · 2020-04-30T22:30:35Z

Actually, @vdweide predicted exactly this issue in #696. I can report that, after updating the CGNS reader in #728, it is possible to run the solver on meshes with over 1 billion grid points (it was a hex cube, so also roughly 1 billion cells). I did not go much beyond that or try to do a restart, but I am certain we will uncover additional issues like this. @talbring, good ideas for the restart, we should start there.

I am fine with keeping this issue (maybe pin it if necessary), or we can reopen #696.

GomerOfDoom · 2020-04-30T22:50:18Z

In a separate email, Edwin mentioned using MPI_Type_create_hindexed(...), which I also came across in looking at this issue. MPI_Type_create_hindexed() specifies displacements in MPI_Aint, rather than int, and should allow for displacement values that get around the INT_MAX limit.

However, there are probably other repercussions that need to be explored when making this kind of change, as I'm sure there are other places in the code that use similar functionality.

vdweide · 2020-05-01T09:32:32Z

The MPI_Type_create_hindexed will indeed solve the integer overflow encountered in MPI_Type_indexed, as the former uses MPI_Aint (8 byte integers) for the the displacements and the latter regular 4 byte integers. However, as @GomerOfDoom mentioned, there may be issues for the discrete adjoint (I saw that Type_indexed is actually present in medi) if we simply replace MPI_Type_indexed by MPI_Type_create_hindexed. So @talbring, @MaxSagebaum and @economon, do you foresee any problems here? If not, then it is a very simple change of a couple of lines.

talbring · 2020-05-01T14:58:58Z

@vdweide The file reading/writing does not involve any AD types, so we can just change it right away I think.

vdweide · 2020-05-01T15:00:12Z

Good, that should be an easy fix then (I hope).

pcarruscag · 2020-05-01T18:42:48Z

The solution to (2) appears to be to change geometry->node[iPoint]->GetnNeighbor() to geometry->node[iPoint]->GetnPoint() at line 3759 of CEulerSolver.cpp, in the CEulerSolver::SetUpwind_Ducros_Sensor() method.

@GomerOfDoom thanks for tracking that, I fixed the exit condition on that method, but only now that I'm refactoring CPoint in #966 have I realized the difference between nPoint and nNeighbor... Does your case have periodicity?
I will fix incorrect uses of nNeighbor in other areas of the code as part of #966, time domain with moving grids will have similar issues.

GomerOfDoom · 2020-05-01T20:50:12Z

I have created a branch "fix_large_mesh" in which I am changing those MPI_Type_indexed calls. I am testing the changes now on our large case. Will let you all know how it goes, and then submit a pull request.

As for the node[...]->GetnNeighbor issue, the case I have been working on does not have periodicity, but other cases may. Can you clarify the difference between nNeighbor and nPoint? Also, note that this issue could have been caused by an incorrect setting of nNeighbor due to the MPI_Type_indexed bug.

pcarruscag · 2020-05-01T21:15:47Z

nPoint is given by the adjacency, number of edges connected to a given point.
nNeighbor is the same (initially) but as it is used in some numerical schemes some MPI communication is done for it, and so I guess in some circumstances it will not be in sync with nPoint.
Moral of the story, nNeighbor is for numerical schemes, nPoint is to loop over neighboring edges/points (arguably the naming is not great).

vdweide · 2020-05-07T15:23:05Z

This one is related to the integer overflow problem for very large meshes when writing paraview XML files. In the function CParaviewXMLFileWriter::WriteDataArray the variable totalByteSize is defined as an integer. When writing lots of data, especially the connectivity data, this variable overflows, as @GomerOfDoom experienced. Is it possible to define this byte size to be of type Int64 in paraview instead of Int32?

talbring · 2020-05-07T15:33:14Z

@vdweide I have not found any information in the documentation of the XML paraview format whether that is possible. The size of the data is just dumped as an int32 in front of the binary data blob, so there must be a flag to tell paraview that the size is encoded using a different datatype ...

But anyway, that is not related to the restart files in any way.

BeckettZhou · 2020-05-07T16:34:38Z

Folks I have also been experiencing issues with large mesh restarts but I am not sure if it's related to what Paul was experiencing. Here's the problem:

We are running a multi-zone mesh with 65M elements. The simulation starts just fine on our cluster. After 2000 time steps, I stop the run, switch on the COMPUTE_AVERAGE flag and restart the simulation. And it hangs right when it is reading the first restart file.

The problem occurs only when I restart with 20 nodes (480 cores). If I use 15 nodes, it runs just fine.

If I start from freestream, I can use any number of nodes.

GomerOfDoom · 2020-05-07T17:07:38Z

Hey guys... I just want to clarify, as there seems to be three issues being discussed:

Issue 1) SOLVED - This was related to a crash in the the CEulerSolver::SetUpwind_Ducros_Sensor() - This can be fixed by changing a call from GetnNeighbor() to GetnPoints() (see my comments on April 30). I don't believe this ended up being related to the overrun in the restart file (Issue 2).

Issue 2) SOLVED - This was a crash when starting with very large meshes, caused by an integer overrun in CSolver::Read_SU2_Restart_Binary(). This issue is solved by converting the displacement variable to MPI_Aint and using MPI_Type_create_hindexed(), instead of MPI_Type_indexed()

Issue 3) This is another issue I've been seeing with large simulations: the paraview .vtu files output by SU2 are unable to be read in by Paraview. The error in Paraview is

"ERROR: In /local/paraview/VTK/IO/XML/vtkXMLDataReader.cxx, line 400
vtkXMLUnstructuredGridReader (0x55b6b838a330): Cannot read point data array "Density" from PointData in piece 0. The data array in the element may be too short."

This is the error Edwin is referring to today. I am changing the type of totalByteSize to a long, and will test to see if that solves the issue.

Note that the solution to Issues 1 and 2 are not yet in the develop branch. I will create a pull request (should be by end of the day today) when I have verified that the solutions work.

talbring · 2020-05-07T17:30:48Z

@GomerOfDoom I just read that the size can be an unsigned int for the xml paraview writer. I don't think that a long will work...

GomerOfDoom · 2020-05-07T17:38:42Z

Hi Tim, Can you point me to the documentation that says that? I'm not seeing it.

talbring · 2020-05-07T17:47:04Z

That's exactly the problem. The documentation (https://vtk.org/wp-content/uploads/2015/04/file-formats.pdf) is not loosing a single word about the fact that the size has to be in front of the binary blob. I had to dig through some paraview mailing lists (https://public.kitware.com/pipermail/paraview/2007-October/006064.html) in order to get that information.

vdweide · 2020-05-07T18:46:59Z

@talbring, so far we used a standard integer for this size, while apparently it has to be a unsigned int according to the paraview mailing list. So maybe those integers may not be used at all by the actual reader in paraview and only the offsets in the header are used.

@GomerOfDoom, could you change the type of totalByteSize to an unsigned int and try again? For this particular case an unsigned int is just fine. If the problem still shows up when reading the file in paraview, then it is something else.

@BeckettZhou, I think that for 65M elements you should not run into integer overflow issues. Hence that is most likely a different problem

BeckettZhou · 2020-05-07T18:49:05Z

Hi Edwin, I agree -- the fact that it hangs for 20 nodes but not 15 nodes for the exact same mesh indicate this is not related to the integer overflow issue. Best, Beckett Dr. -Ing. Beckett Y. Zhou Scientific Computing Group, TU Kaiserslautern Bldg/Geb 34, Paul-Ehrlich-Strasse 67663 Kaiserslautern, Germany

…

________________________________________ From: Edwin van der Weide [[email protected]] Sent: Thursday, May 07, 2020 8:47 PM To: su2code/SU2 Cc: Yuxiang Zhou; Mention Subject: Re: [su2code/SU2] Failure to Restart on Large Meshes with Time-Averaged Data (#949) @talbring<https://github.com/talbring>, so far we used a standard integer for this size, while apparently it has to be a unsigned int according to the paraview mailing list. So maybe those integers may not be used at all by the actual reader in paraview and only the offsets in the header are used. @GomerOfDoom<https://github.com/GomerOfDoom>, could you change the type of totalByteSize to an unsigned int and try again? For this particular case an unsigned int is just fine. If the problem still shows up when reading the file in paraview, then it is something else. @BeckettZhou<https://github.com/BeckettZhou>, I think that for 65M elements you should not run into integer overflow issues. Hence that is most likely a different problem — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#949 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADXOST6LJQLCEIO55DMGOXLRQL63FANCNFSM4MQLQFFQ>.

vdweide · 2020-05-07T21:07:31Z

I found the following
https://vtk.org/Wiki/VTK_XML_Formats#Uncompressed_Data
So it looks like it is possible to specify the type of byte count integer. UInt32 is the default value.

talbring · 2020-05-09T17:56:24Z

@vdweide That was the information I was always looking for. Somehow I was not able to find it ... Thanks!

vdweide · 2020-05-09T19:39:46Z

I already changed it in the branch fix_large_mesh. If the tests of @GomerOfDoom are successful it can be merged into develop.

GomerOfDoom · 2020-05-11T21:53:11Z

I was finally able to re-run the case, and it appears that the changes Edwin made have worked. The .vtu files are now readable in Paraview.

I believe this solves (for now) all of the issues we were seeing with these large meshes. I have submitted a pull request. This is my first PR, so I may have done something wrong... please let me know if I need to do additional things for the PR.

vdweide · 2020-05-13T19:47:09Z

So I guess we can close this one.

GomerOfDoom added the bug label Apr 24, 2020

GomerOfDoom mentioned this issue May 11, 2020

Fix large mesh #980

Merged

5 tasks

GomerOfDoom linked a pull request May 11, 2020 that will close this issue

Fix large mesh #980

Merged

5 tasks

vdweide closed this as completed May 13, 2020

Failure to Restart on Large Meshes with Time-Averaged Data #949

Failure to Restart on Large Meshes with Time-Averaged Data #949

Comments

GomerOfDoom commented Apr 24, 2020 • edited Loading

economon commented Apr 27, 2020

GomerOfDoom commented Apr 27, 2020

GomerOfDoom commented Apr 29, 2020

GomerOfDoom commented Apr 29, 2020

bmunguia commented Apr 29, 2020

GomerOfDoom commented Apr 29, 2020

bmunguia commented Apr 29, 2020

GomerOfDoom commented Apr 30, 2020

economon commented Apr 30, 2020

GomerOfDoom commented Apr 30, 2020 via email

talbring commented Apr 30, 2020

talbring commented Apr 30, 2020

talbring commented Apr 30, 2020

economon commented Apr 30, 2020

GomerOfDoom commented Apr 30, 2020

vdweide commented May 1, 2020

talbring commented May 1, 2020

vdweide commented May 1, 2020

pcarruscag commented May 1, 2020

GomerOfDoom commented May 1, 2020

pcarruscag commented May 1, 2020 • edited Loading

vdweide commented May 7, 2020

talbring commented May 7, 2020

BeckettZhou commented May 7, 2020

GomerOfDoom commented May 7, 2020

talbring commented May 7, 2020

GomerOfDoom commented May 7, 2020

talbring commented May 7, 2020

vdweide commented May 7, 2020

BeckettZhou commented May 7, 2020 via email

vdweide commented May 7, 2020

talbring commented May 9, 2020

vdweide commented May 9, 2020

GomerOfDoom commented May 11, 2020

vdweide commented May 13, 2020

GomerOfDoom commented Apr 24, 2020 •

edited

Loading

pcarruscag commented May 1, 2020 •

edited

Loading