Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos Promotion: 2.6 -> 2.7 #2735

Closed
crtrott opened this issue May 11, 2018 · 34 comments
Closed

Kokkos Promotion: 2.6 -> 2.7 #2735

crtrott opened this issue May 11, 2018 · 34 comments

Comments

@crtrott
Copy link
Member

crtrott commented May 11, 2018

This is to track progress on the Kokkos promotion. Part of this promotion is to remove utilization of deprecated features from Trilinos. See https://github.com/kokkos/kokkos/wiki/DeprecationPage for a list of features (may not yet be complete). The branch for this is kokkos-promotion. Add a symlink into the root Trilinos directory to a separate git clone of kokkos and Kokkos-kernels with their develop branches checked out. Then add

-D Kokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
-D KokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
-D KOKKOS_ENABLE_DEPRECATED_CODE:BOOL=OFF \

To your configure script.

@crtrott
Copy link
Member Author

crtrott commented May 11, 2018

Tpetra issues:
Fixed compilation but have failing tests now (mostly due to changes in deep_copy for copying views of unequal dimensions.

	 11 - TpetraCore_gemm_MPI_1 (Failed)
	 14 - TpetraCore_BlockCrsMatrix_MPI_4 (Failed)
	 16 - TpetraCore_BlockView_MPI_1 (Failed)
	 20 - TpetraCore_BlankRowBugTest_MPI_2 (Failed)
	 41 - TpetraCore_CrsMatrix_UnitTests2_MPI_4 (Failed)
	 44 - TpetraCore_CrsMatrix_NonlocalAfterResume_MPI_4 (Failed)
	 47 - TpetraCore_CrsMatrix_WithGraph_OpenMP_MPI_4 (Failed)
	 48 - TpetraCore_CrsMatrix_ReplaceDomainMapAndImporter_MPI_4 (Failed)
	 50 - TpetraCore_CrsMatrix_NonlocalSumInto_Ignore_MPI_4 (Failed)
	 63 - TpetraCore_Albany182_MPI_4 (Failed)
	 68 - TpetraCore_Issue1752_MPI_2 (Failed)
	 83 - TpetraCore_ImportExport2_UnitTests_MPI_4 (Failed)
	 85 - TpetraCore_MatrixMarket_Tpetra_CrsMatrix_InOutTest_MPI_4 (Failed)
	 87 - TpetraCore_MatrixMarket_Operator_Test_MPI_4 (Failed)
	116 - TpetraCore_MatrixMatrix_UnitTests_MPI_4 (Failed)
	117 - TpetraCore_AddProfiling_UnitTests_MPI_4 (Failed)
	156 - TpetraCore_lesson05_redistribution_MPI_4 (Failed)
	159 - TpetraCore_FEMAssembly_InsertGlobalIndicesDP_MPI_4 (Failed)
	160 - TpetraCore_FEMAssembly_LocalElementLoopDP_MPI_4 (Failed)
	161 - TpetraCore_FEMAssembly_TotalElementLoopDP_MPI_4 (Failed)
	162 - TpetraCore_FEMAssembly_TotalElementLoopSP_MPI_4 (Failed)

@mhoemmen
Copy link
Contributor

@crtrott wrote:

mostly due to changes in deep_copy for copying views of unequal dimensions

Just curious -- what happened there? Was Tpetra copying a lot of Views of unequal dimensions?

@crtrott
Copy link
Member Author

crtrott commented May 11, 2018

@mhoemmen Yeah apparently. Its not fixed yet though but you get nice runtime messages like:

11:  p=0: *** Caught standard std::exception of type 'std::runtime_error' :
11:  
11:   Deprecation Error: Kokkos::deep_copy extents of views don't match: C2(1,1) A_orig(1,2) 
11:   Traceback functionality not available
11:   
11:  [FAILED]  (0.00184 sec) Blas_longlong_Gemm_UnitTest
11:  Location: /home/crtrott/Trilinos/packages/tpetra/core/test/Blas/gemm.cpp:380

@crtrott
Copy link
Member Author

crtrott commented May 11, 2018

Sacado is fixed for compile time stuff but got some runtime errors:

97% tests passed, 10 tests failed out of 296

Label Time Summary:
Sacado    =  28.03 sec (296 tests)

Total Test time (real) =  28.15 sec

The following tests FAILED:
	  8 - Sacado_Fad_Kokkos_CommTests_Serial_MPI_4 (Failed)
	  9 - Sacado_FadKokkosTests_Serial_MPI_1 (Failed)
	 10 - Sacado_FadKokkosTests_NoViewSpec_Serial_MPI_1 (Failed)
	 11 - Sacado_FadFadKokkosTests_Serial_MPI_1 (Failed)
	 12 - Sacado_Fad_Kokkos_CommTests_OpenMP_MPI_4 (Failed)
	 13 - Sacado_FadKokkosTests_OpenMP_MPI_1 (Failed)
	 14 - Sacado_FadKokkosTests_NoViewSpec_OpenMP_MPI_1 (Failed)
	 15 - Sacado_FadFadKokkosTests_OpenMP_MPI_1 (Failed)
	 16 - Sacado_ViewFactoryTests_MPI_1 (Failed)
	296 - Sacado_view_example_MPI_1 (Failed)

Looks like this might be stuff like view constructors given more arguments than the rank is large etc.

@ndellingwood
Copy link
Contributor

ndellingwood commented May 14, 2018

Linking separately filed Trilinos issues related to changes for deprecated code:
Intrepid2 #2719
Seacas #2720
Sacado #2717 (closed)
Stokhos #2718 (closed)
Tpetra #2705, #2707 (closed)

@ndellingwood
Copy link
Contributor

I'll take care of the ShyLU changes, looks like straightfoward sed scripting to replace dimension_ with extent(...) and ptr_on_device with data.

@mperego
Copy link
Contributor

mperego commented May 16, 2018

I've pushed a few changes to several packages:
Intrepid2, Seacas, ShyLU, Amesos2, MiniTensor, Zoltan2.
Here are some of the current issues:

ShyLU_Node:
2 - ShyLU_NodeTacho_Tacho_TestUtil_MPI_1 (Failed)
issue:
2: what(): Kokkos allocation "TaskQueue" is being deallocated after Kokkos::finalize was called

Phalanx:
13 - Phalanx_tKokkosPerf_MPI_1 (Failed)
14 - Phalanx_tKokkosNestedLambda_MPI_1 (Failed)
issues:
13: Kokkos::abort: Requested Team Size is too large!

14: terminate called after throwing an instance of 'std::runtime_error'
14: what(): Kokkos allocation "c" is being deallocated after Kokkos::finalize was called

@rppawlo please check if you are fine with the changes, in particular for Phalanx_MDField.hpp. We might want to remove all the dimension** functions.

Panzer:
112/156 Test #112: PanzerAdaptersSTK_tDomainInterface_MPI_2 .........................***Failed 0.34 sec
Start 113: PanzerAdaptersSTK_node_normals_MPI_2
113/156 Test #113: PanzerAdaptersSTK_node_normals_MPI_2 .............................***Failed 0.40 sec
issues:
112: Constructor for Kokkos View 'gids' has mismatched number of arguments. Number of arguments = 2 but dynamic rank = 1
113: what(): Kokkos allocation "subcellParam" is being deallocated after Kokkos::finalize was called

@srajama1
Copy link
Contributor

@kyungjoo-kim : ShyLU Tacho stuff.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim It may be sufficient to place scope guards between the Kokkos::initialize() and Kokkos::finalize() calls in the unit test for ShyLU_NodeTacho_Tacho_TestUtil_MPI_1. If the "TaskQueue" is static variable or of global scope this may be causing the problem.

@rppawlo
Copy link
Contributor

rppawlo commented May 17, 2018

@mperego - Thanks for taking care of this! Much appreciated! The only issue would be in the file Phalanx_MDField.hpp. While it is ok to change the kokkos array accessor from dimension_N() to extent(N), I would prefer to keep the MDField methods named dimension_N for now. We already have an accessor for extent() defined in MDField directly about your changes. All of the functions you changed in that file should really just be deleted, but we have lots of code using it in empire, drekar, charon, etc... To rename or make those changes, I need to make the corresponding changes in the apps. I'd like to do a coordinated push to trilinos and the apps to prevent disruption. Is this ok?

@mperego
Copy link
Contributor

mperego commented May 17, 2018

@kyungjoo-kim I fixed the issue with Kokkos::finalize() for test ShyLU_NodeTacho_Tacho_TestUtil_MPI_1. It was trivial as @ndellingwood pointed out.

@mperego
Copy link
Contributor

mperego commented May 17, 2018

@rppawlo agreed.

@ndellingwood
Copy link
Contributor

I updated the unit tests and performance tests in Sacado and Stokhos, Serial build completes and tests pass with disabled deprecated code; quick check that nothing is broken with deprecated code enabled then I'll update the kokkos-promotion branch.

@ndellingwood
Copy link
Contributor

Sacado and Stokhos test updates pushed to kokkos-promotion branch.

@ndellingwood
Copy link
Contributor

@mhoemmen I tried a build with deprecated code disabled using the kokkos-promotion branch and kokkos and kokkos-kernels develop branches (gcc/4.8.4 + openmpi/1.10.1, OpenMP and Serial backends enabled) on kokkos-dev and ran into errors like this:

In file included from /home/ndellin/trilinos-ssh/Trilinos/packages/tpetra/classic/NodeAPI/Kokkos_DefaultNode.cpp:42:0:
/home/ndellin/trilinos-ssh/Trilinos/packages/tpetra/classic/NodeAPI/Kokkos_DefaultNode.hpp:88:33: error: ‘KokkosOpenMPWrapperNode’ is not a member of ‘Kokkos::Compat’
   extern template Teuchos::RCP< ::Kokkos::Compat::KokkosOpenMPWrapperNode> getNode< ::Kokkos::Compat::KokkosOpenMPWrapperNode> (const Teuchos::RCP<Teuchos::ParameterList>& );

I explicitly enabled both OpenMP and Serial backends for Tpetra during configuration:

  -D Tpetra_INST_SERIAL:BOOL=ON \
  -D Tpetra_INST_OPENMP:BOOL=ON \

Any suggestions what may be going wrong?

@mhoemmen
Copy link
Contributor

@ndellingwood Is KOKKOS_ENABLE_OPENMP defined?

@ndellingwood
Copy link
Contributor

@mhoemmen I was on the develop branch, sorry, getting lost in the builds I'm juggling...

@mperego
Copy link
Contributor

mperego commented May 17, 2018

@dridzal @karapeterson I disabled all Kokkos code in Intrepid. Eventually we want Intrepid to be Kokkos free, so there is no point in spending a significant amount to make it compliant with the new changes in Kokkos.

@ndellingwood
Copy link
Contributor

Phalanx fixes straightforward

  1. tKokkosNestedLambda.cpp: add scope guards between initialize/finalize calls
  2. tKokkosPerf.cpp: use AUTO() to determine team_size in the tests rather than hard-coded size of 8 to avoid a team size too large, which is no longer resized to the max team size Deprecate team_size auto adjusting to maximal value possible kokkos/kokkos#1618. @rppawlo is this change acceptable with you?

@ndellingwood
Copy link
Contributor

ndellingwood commented May 17, 2018

The Panzer test adapters-stk/test/evaluator_tests/domain_interface.cpp error occurs at line 123

  auto worksets = worksetFactory.getWorksets(worksetDescriptor,
                                             worksetNeeds);

From the message there is a call to a View ctor for 'gids' that passes more arguments than the dynamic rank...

@rppawlo
Copy link
Contributor

rppawlo commented May 17, 2018

@ndellingwood - phalanx changes are fine. Thanks!

@karapeterson
Copy link
Contributor

@mperego, makes sense - intrepid should be kokkos free. Users who want kokkos functionality should use intrepid2

@ndellingwood
Copy link
Contributor

We missed a static variable in Intrepid2 which caused the error in PanzerAdaptersSTK_node_normals_MPI_2. Adding the necessary finalize hooks fixes the problem.
Still digging into the domain_interface error...

@ndellingwood
Copy link
Contributor

@mhoemmen The Panzer issue I have been chasing down occurs due to more args being passed to a View ctor than the dynamic rank (which is now deprecated behavior resulting in a runtime error), occurring in Tpetra at line 154 of Tpetra_Details_PackTraits.hpp. The accompanying comment states that it is intentional based on the prior View ctor behavior:

    // This exploits the fact that Kokkos::View's constructor ignores
    // size arguments beyond what the View's type specifies.  For
    // value_type = Stokhos::UQ::PCE<S>, numValuesPerScalar returns
    // something other than 1, and the constructor will actually use
    // that value.

The assumption in the first line no longer holds except in cases where the specialize trait is non-void (e.g. when constructing View's of FadType), which may be when Stokhos makes use of this case.
I added the following change which fixed the issue and seems correct since numVals appears to be used only in the special case for Stokhos based on the comment:

    if ( std::is_same< typename view_type::traits::specialize, void >::value ) {
      return view_type (label, static_cast<size_type> (numEnt) );
    } 
    else {
      return view_type (label, static_cast<size_type> (numEnt), numVals);
    }

I'll test more fully, if tests pass are you comfortable with me pushing this into the kokkos-promotion branch?

@mhoemmen
Copy link
Contributor

@ndellingwood I'm OK with this, as long as it compiles -- thanks!

@tjfulle may find this of interest too.

@mhoemmen
Copy link
Contributor

@ndellingwood Also please correct the comment as appropriate if you have a chance; thanks! :D

@ndellingwood
Copy link
Contributor

With gcc/4.8.4 + openmpi/1.10.1 (to match Trilinos' tested versions) and serial + openmp backends enabled these tests compiled and passed:

100% tests passed, 0 tests failed out of 656

Label Time Summary:
Intrepid2    =  80.31 sec (240 tests)
Panzer       = 894.23 sec (117 tests)
Phalanx      =   7.73 sec (22 tests)
Sacado       =  43.02 sec (277 tests)

I'll update the comment and push :)

@ndellingwood
Copy link
Contributor

ndellingwood commented May 18, 2018

@rppawlo Does the file in phalanx/src/deprecated/Phalanx_KokkosUtilities.cpp need to be maintained? I can update for the deprecated changes to initialize/finalize calls if so, otherwise I'll leave it as is.

@rppawlo
Copy link
Contributor

rppawlo commented May 18, 2018

@ndellingwood - that directory is not compiled into the code. You do not have to update/edit those files. This code is an alternate implementation of some phalanx objects.

@ndellingwood
Copy link
Contributor

Fixes have been pushed to kokkos-promotion branch detected during first round of integration testing, starting the next round shortly.

@ndellingwood
Copy link
Contributor

Integration testing successfully complete with deprecated code enabled.

@ndellingwood
Copy link
Contributor

Added more Trilinos fixes to the kokkos-promotion branch for bugs found testing with deprecated code disabled with a cuda+serial build using the white integration script with -D KOKKOS_ENABLE_DEPRECATED_CODE:BOOL=OFF \ added to the script.
Along with the ROL fix and TrilinosCouplings fix, 100% build :)

@ndellingwood
Copy link
Contributor

PR #2822 submitted.

@ndellingwood
Copy link
Contributor

PR #2822 merged, thanks all for your help and contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants