-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to Restart on Large Meshes with Time-Averaged Data #949
Comments
Hey Paul! Sorry to hear this is an issue for you. ~180 mio cells is quite large... any chance you can recreate on something much smaller? If you roll back some commits, does it work again? If you can find the exact commit, that would be helpful. On a smaller case, you might check whether the time averaged data is being written as you expect to the restart file by using an ASCII file and viewing it (add RESTART_ASCII to your OUTPUT_FILES). It is just a normal CSV. Might give you some clues. If the number of columns in the restart is incorrect, or not what you expect, then I am guessing the loading of the restart is to blame and some quantities are not getting initialized correctly. That is done in CEulerSolver::LoadRestart() for the compressible code. |
Hello Tom, Thanks for responding. I have been trying to recreate the problem on smaller meshes, but so far, no luck. However, I am seeing a problem that may be related on a smaller mesh. This is occurring with the latest commit (c093a62) of develop: When compiled with mpi and buildtype of either release or debug, when attempting to restart a DDES, I am getting the following:
This problem appears on a relatively small (~7 million cell) mesh of the same geometry as the larger one I mentioned in my initial post. Running the simulation in serial mode runs successfully and does not produce any errors. I have put the RANS result, mesh, and DDES configure file and restart files on a shared folder: https://drive.google.com/open?id=1IDpK_0xYktI-lbG-ozre9Sf_3ksC7xtt Any/all help is greatly appreciated. -Paul |
An update to the issue mentioned above... I was able to attach debuggers to a 4-core run of the case I posted yesterday. The code runs fine until it gets into CEulerSolver::SetUpwind_Ducros_Sensor. At some point in this method, during a loop through points, an invalid number is returned by the call to: I have attached backtraces from each of the processes I was running. In each case, you can see that "jPoint" gets assigned a value here that looks to me like garbage (for example jPoint=18,446,744,073,709,551,615). This value is then passed to I have no idea why a number like this would be returned... Any ideas? Thanks! -Paul |
One more thing... it looks like the assignment of number of neighbors for iPoint is using |
It also looks like the loops over |
Hi Brian, I'm not seeing the exit condition -Paul |
382e82f has And yes, you're right about whein |
Hello, So, to clarify, there were two issues:
The solution to (2) appears to be to change I made this change locally, and attempted to run on our large mesh. Issue (2) seems to be fixed, but we still run into issue (1). I have now gone through the read restart routines, and have found a potential issue: For reference, the restart file for our large mesh with averaging data included consists of: Beginning at line 3931 of CSolver.cpp, in method
The problem here is that for our case, where Unfortunately, simply changing It may be that, given the limitations of MPI here, reading such large restart information requires the restart file to be read serially by one rank, and the data split and broadcast to the other ranks? I am not an MPI expert, so there may be another way to do this. Thoughts? -Paul |
How about a quick fix like removing some fields from the output if possible (39 is a lot)? Are there some auxiliary variables apart from the conservatives and time averages that you don't need? We'll have to look into the MPI issue with INT_MAX.. we will probably hit other snags like this throughout the code as we keep increasing mesh size. I bet there are some int vs. long issues hidden. |
Hi Tom,
Yes, we can temporarily strip off some of the excess variables, and we may just do that in the short-term to continue the simulation. In fact, I'm not sure the averaged variables are used (is the time-averaging continued from restart, or is it started from scratch at restart?).
If the time-averaged variables from the restart aren't used, then another fix would be to change what is output in the restart file vs. what is output in the visualization files (vtu, szplt, etc).
Regardless, in an earlier simulation, we were attempting to run a 450 million cell mesh, which would have probably caused the same overrun with just the conservatives, so I agree that we'll have to start planning on how to deal with this issue down the road when larger meshes become more commonplace.
Not sure the best forum to discuss this kind of thing. Github? Slack? Email?
Hope you are well,
Paul
…________________________________
From: Thomas D. Economon <[email protected]>
Sent: Thursday, April 30, 2020 11:41 AM
To: su2code/SU2 <[email protected]>
Cc: Paul Stephen Urbanczyk <[email protected]>; Author <[email protected]>
Subject: Re: [su2code/SU2] Failure to Restart on Large Meshes with Time-Averaged Data (#949)
How about a quick fix like removing some fields from the output if possible (39 is a lot)? Are there some auxiliary variables apart from the conservatives and time averages that you don't need?
We'll have to look into the MPI issue with INT_MAX.. we will probably hit other snags like this throughout the code as we keep increasing mesh size. I bet there are some int vs. long issues hidden.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#949 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABU42SSQMUXG36DWEIWFNZ3RPHA6ZANCNFSM4MQLQFFQ>.
|
We can just look at the field names and read in the coordinates, solution and grid velocity (if needed). Thats actually a quite easy fix. I wanted to do that anyway at some point. Another solution would be to read chunks of "columns" if we really have very big meshes ... |
The strange thing is that in OpenMPI for some index types they already use long long type ... e.g. for MPI_Offset ... |
@GomerOfDoom I would say that this issue is the perfect place to discuss that :) |
Actually, @vdweide predicted exactly this issue in #696. I can report that, after updating the CGNS reader in #728, it is possible to run the solver on meshes with over 1 billion grid points (it was a hex cube, so also roughly 1 billion cells). I did not go much beyond that or try to do a restart, but I am certain we will uncover additional issues like this. @talbring, good ideas for the restart, we should start there. I am fine with keeping this issue (maybe pin it if necessary), or we can reopen #696. |
In a separate email, Edwin mentioned using MPI_Type_create_hindexed(...), which I also came across in looking at this issue. MPI_Type_create_hindexed() specifies displacements in MPI_Aint, rather than int, and should allow for displacement values that get around the INT_MAX limit. However, there are probably other repercussions that need to be explored when making this kind of change, as I'm sure there are other places in the code that use similar functionality. |
The MPI_Type_create_hindexed will indeed solve the integer overflow encountered in MPI_Type_indexed, as the former uses MPI_Aint (8 byte integers) for the the displacements and the latter regular 4 byte integers. However, as @GomerOfDoom mentioned, there may be issues for the discrete adjoint (I saw that Type_indexed is actually present in medi) if we simply replace MPI_Type_indexed by MPI_Type_create_hindexed. So @talbring, @MaxSagebaum and @economon, do you foresee any problems here? If not, then it is a very simple change of a couple of lines. |
@vdweide The file reading/writing does not involve any AD types, so we can just change it right away I think. |
Good, that should be an easy fix then (I hope). |
@GomerOfDoom thanks for tracking that, I fixed the exit condition on that method, but only now that I'm refactoring CPoint in #966 have I realized the difference between nPoint and nNeighbor... Does your case have periodicity? |
I have created a branch "fix_large_mesh" in which I am changing those MPI_Type_indexed calls. I am testing the changes now on our large case. Will let you all know how it goes, and then submit a pull request. As for the node[...]->GetnNeighbor issue, the case I have been working on does not have periodicity, but other cases may. Can you clarify the difference between nNeighbor and nPoint? Also, note that this issue could have been caused by an incorrect setting of nNeighbor due to the MPI_Type_indexed bug. |
nPoint is given by the adjacency, number of edges connected to a given point. |
This one is related to the integer overflow problem for very large meshes when writing paraview XML files. In the function CParaviewXMLFileWriter::WriteDataArray the variable totalByteSize is defined as an integer. When writing lots of data, especially the connectivity data, this variable overflows, as @GomerOfDoom experienced. Is it possible to define this byte size to be of type Int64 in paraview instead of Int32? |
@vdweide I have not found any information in the documentation of the XML paraview format whether that is possible. The size of the data is just dumped as an int32 in front of the binary data blob, so there must be a flag to tell paraview that the size is encoded using a different datatype ... But anyway, that is not related to the restart files in any way. |
Folks I have also been experiencing issues with large mesh restarts but I am not sure if it's related to what Paul was experiencing. Here's the problem: We are running a multi-zone mesh with 65M elements. The simulation starts just fine on our cluster. After 2000 time steps, I stop the run, switch on the COMPUTE_AVERAGE flag and restart the simulation. And it hangs right when it is reading the first restart file. The problem occurs only when I restart with 20 nodes (480 cores). If I use 15 nodes, it runs just fine. If I start from freestream, I can use any number of nodes. |
Hey guys... I just want to clarify, as there seems to be three issues being discussed: Issue 1) SOLVED - This was related to a crash in the the Issue 2) SOLVED - This was a crash when starting with very large meshes, caused by an integer overrun in Issue 3) This is another issue I've been seeing with large simulations: the paraview .vtu files output by SU2 are unable to be read in by Paraview. The error in Paraview is "ERROR: In /local/paraview/VTK/IO/XML/vtkXMLDataReader.cxx, line 400 This is the error Edwin is referring to today. I am changing the type of totalByteSize to a long, and will test to see if that solves the issue. Note that the solution to Issues 1 and 2 are not yet in the develop branch. I will create a pull request (should be by end of the day today) when I have verified that the solutions work. |
@GomerOfDoom I just read that the size can be an |
Hi Tim, Can you point me to the documentation that says that? I'm not seeing it. |
That's exactly the problem. The documentation (https://vtk.org/wp-content/uploads/2015/04/file-formats.pdf) is not loosing a single word about the fact that the size has to be in front of the binary blob. I had to dig through some paraview mailing lists (https://public.kitware.com/pipermail/paraview/2007-October/006064.html) in order to get that information. |
@talbring, so far we used a standard integer for this size, while apparently it has to be a unsigned int according to the paraview mailing list. So maybe those integers may not be used at all by the actual reader in paraview and only the offsets in the header are used. @GomerOfDoom, could you change the type of totalByteSize to an unsigned int and try again? For this particular case an unsigned int is just fine. If the problem still shows up when reading the file in paraview, then it is something else. @BeckettZhou, I think that for 65M elements you should not run into integer overflow issues. Hence that is most likely a different problem |
Hi Edwin,
I agree -- the fact that it hangs for 20 nodes but not 15 nodes for the exact same mesh indicate this is not related to the integer overflow issue.
Best,
Beckett
Dr. -Ing. Beckett Y. Zhou
Scientific Computing Group, TU Kaiserslautern
Bldg/Geb 34, Paul-Ehrlich-Strasse
67663 Kaiserslautern, Germany
…________________________________________
From: Edwin van der Weide [[email protected]]
Sent: Thursday, May 07, 2020 8:47 PM
To: su2code/SU2
Cc: Yuxiang Zhou; Mention
Subject: Re: [su2code/SU2] Failure to Restart on Large Meshes with Time-Averaged Data (#949)
@talbring<https://github.com/talbring>, so far we used a standard integer for this size, while apparently it has to be a unsigned int according to the paraview mailing list. So maybe those integers may not be used at all by the actual reader in paraview and only the offsets in the header are used.
@GomerOfDoom<https://github.com/GomerOfDoom>, could you change the type of totalByteSize to an unsigned int and try again? For this particular case an unsigned int is just fine. If the problem still shows up when reading the file in paraview, then it is something else.
@BeckettZhou<https://github.com/BeckettZhou>, I think that for 65M elements you should not run into integer overflow issues. Hence that is most likely a different problem
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#949 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADXOST6LJQLCEIO55DMGOXLRQL63FANCNFSM4MQLQFFQ>.
|
I found the following |
@vdweide That was the information I was always looking for. Somehow I was not able to find it ... Thanks! |
I already changed it in the branch fix_large_mesh. If the tests of @GomerOfDoom are successful it can be merged into develop. |
I was finally able to re-run the case, and it appears that the changes Edwin made have worked. The .vtu files are now readable in Paraview. I believe this solves (for now) all of the issues we were seeing with these large meshes. I have submitted a pull request. This is my first PR, so I may have done something wrong... please let me know if I need to do additional things for the PR. |
So I guess we can close this one. |
sd2_case1b_ddes_v7.cfg.txt
Hello,
We are currently unable to restart SU2 in DDES mode from a restart file that includes time-averaged data on very large (~180 million cell) meshes.
Compiled in release mode, the code gives the error "FGMRES orthogonalization failed, linear solver diverged."
Compiled in debug mode, the code issues an assertion failure at line 1881 of $SU2_HOME/SU2_CFD/src/numerics_structure.cpp, which is a check in the CNumerics::SetRoe_Dissipation(...) method to make sure that variable 'Dissipation_j' is between zero and one.
This problem only appears when attempting to restart from solution files that include "TIME_AVERAGE" data on very large meshes.
Note the above behavior is occurring with commit 382e82f of the "develop" branch.
I have pulled the latest commits of develop (c093a62) and master (d9c867d), but get segfaults during Jacobian structure initialization when attempting to restart on multiple cores.
All help is appreciated.
-Paul
To Reproduce
Config file attached, but mesh file is quite large... 17.6 GB.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: