Skip to content

Commit

Permalink
Merge Pull Request #12793 from trilinos/Trilinos/master_merge_2024030…
Browse files Browse the repository at this point in the history
…1_175926

Automatically Merged using Trilinos Master Merge AutoTester
PR Title: b'Trilinos Master Merge PR Generator: Auto PR created to promote from master_merge_20240301_175926 branch to master'
PR Author: trilinos-autotester
  • Loading branch information
trilinos-autotester authored Mar 2, 2024
2 parents a791e19 + a469209 commit bee69a6
Show file tree
Hide file tree
Showing 123 changed files with 3,242 additions and 1,482 deletions.
20 changes: 19 additions & 1 deletion RELEASE_NOTES
Original file line number Diff line number Diff line change
@@ -1,75 +1,93 @@

###############################################################################
# #
# Trilinos Release 15.1 Release Notes TBD, 2024 #
# Trilinos Release 15.1.0 Release Notes February 26, 2024 #
# #
###############################################################################

Amesos2

- The interface to SuperLU_DIST now also works for the CUDA-enabled
variant of the library.
https://github.com/trilinos/Trilinos/pull/12524


Framework

- Began using semantic versioning for Trilinos with 15.1.0 release.


Ifpack2

- BlockRelaxation can now generate blocks using a Zoltan2.
https://github.com/trilinos/Trilinos/pull/12728


Kokkos & Kokkos Kernels

- Inclusion of version 4.2.1 of Kokkos and Kokkos Kernels
https://github.com/trilinos/Trilinos/pull/12707


MueLu

- The reformulated Maxwell solver (RefMaxwell) was generalized to
also work for grad-div / Darcy flow problems.
https://github.com/trilinos/Trilinos/pull/12142

- In an effort to consolidate the old non-Kokkos code path with the
newer Kokkos code path, the following factories were deprecated
and should be removed from input decks: NullspaceFactory_kokkos,
SaPFactory_kokkos, UncoupledAggregationFactory_kokkos.
https://github.com/trilinos/Trilinos/pull/12720
https://github.com/trilinos/Trilinos/pull/12740


Panzer

- MiniEM can now also assemble and solve Darcy problems using first
or higher order mixed finite elements.
https://github.com/trilinos/Trilinos/pull/12142


PyTrilinos2

- New package that auto-generates Python interfaces for Trilinos
packages. Currently, most of Tpetra is exposed. We are planning on
adding other packages.
https://github.com/trilinos/Trilinos/pull/12332


ROL

- An auto-generated Python interface was added. A standalone Python
package can be downloaded from rol.sandia.gov
https://github.com/trilinos/Trilinos/pull/12770


Teko

- Block Jacobi and Gauss-Seidel methods allow now to specify
preconditioners for the iterative solves of the diagonal blocks.
https://github.com/trilinos/Trilinos/pull/12675


Tpetra

- Tpetra will now assume by default that the MPI library is GPU
aware, unless automatic detection or the user indicates otherwise.
https://github.com/trilinos/Trilinos/pull/12517

- Reject unrecognized TPETRA_* environment variable. Misspelled or
removed environment variables are no longer silently ignored.
https://github.com/trilinos/Trilinos/pull/12722

- In order to allocate in shared host/device space (i.e.
CudaUVMSpace, HIPManagedSpace or SYCLSharedUSMSpace) by default,
please use the CMake options
KokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=ON
Tpetra_ALLOCATE_IN_SHARED_SPACE=ON
https://github.com/trilinos/Trilinos/pull/12622


###############################################################################
Expand Down
6 changes: 3 additions & 3 deletions Version.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@
# for release mode and set the version.
#

SET(Trilinos_VERSION 15.1)
SET(Trilinos_VERSION 15.2.0)
SET(Trilinos_MAJOR_VERSION 15)
SET(Trilinos_MAJOR_MINOR_VERSION 150100)
SET(Trilinos_VERSION_STRING "15.1 (Dev)")
SET(Trilinos_MAJOR_MINOR_VERSION 150200)
SET(Trilinos_VERSION_STRING "15.2.0-dev")
SET(Trilinos_ENABLE_DEVELOPMENT_MODE_DEFAULT ON) # Change to 'OFF' for a release

# Used by testing scripts and should not be used elsewhere
Expand Down
4 changes: 2 additions & 2 deletions packages/framework/ini-files/config-specs.ini
Original file line number Diff line number Diff line change
Expand Up @@ -2409,9 +2409,9 @@ use CUDA11-RUN-SERIAL-TESTS
opt-set-cmake-var ROL_example_PinT_parabolic-control_AugmentedSystem_test_MPI_2_DISABLE BOOL FORCE : ON


[rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_all]
[rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_pr]
use rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables
use PACKAGE-ENABLES|ALL
use PACKAGE-ENABLES|PR
opt-set-cmake-var Trilinos_ENABLE_TESTS BOOL FORCE : OFF


Expand Down
2 changes: 1 addition & 1 deletion packages/ifpack2/src/Ifpack2_BlockTriDiContainer_def.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ namespace Ifpack2 {
const bool useSeqMethod = false;
const bool overlapCommAndComp = false;
initInternal(matrix, importer, overlapCommAndComp, useSeqMethod);
n_subparts_per_part_ = 1;
n_subparts_per_part_ = -1;
IFPACK2_BLOCKHELPER_TIMER_FENCE(typename BlockHelperDetails::ImplType<MatrixType>::execution_space)
}

Expand Down
112 changes: 98 additions & 14 deletions packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -842,22 +842,83 @@ namespace Ifpack2 {
return Teuchos::null;
}

template<typename local_ordinal_type>
local_ordinal_type costTRSM(const local_ordinal_type block_size) {
return block_size*block_size;
}

template<typename local_ordinal_type>
local_ordinal_type costGEMV(const local_ordinal_type block_size) {
return 2*block_size*block_size;
}

template<typename local_ordinal_type>
local_ordinal_type costTriDiagSolve(const local_ordinal_type subline_length, const local_ordinal_type block_size) {
return 2 * subline_length * costTRSM(block_size) + 2 * (subline_length-1) * costGEMV(block_size);
}

template<typename local_ordinal_type>
local_ordinal_type costSolveSchur(const local_ordinal_type num_parts,
const local_ordinal_type num_teams,
const local_ordinal_type line_length,
const local_ordinal_type block_size,
const local_ordinal_type n_subparts_per_part) {
const local_ordinal_type subline_length = ceil(double(line_length - (n_subparts_per_part-1) * 2) / n_subparts_per_part);
if (subline_length < 1) {
return INT_MAX;
}

const local_ordinal_type p_n_lines = ceil(double(num_parts)/num_teams);
const local_ordinal_type p_n_sublines = ceil(double(n_subparts_per_part)*num_parts/num_teams);
const local_ordinal_type p_n_sublines_2 = ceil(double(n_subparts_per_part-1)*num_parts/num_teams);

const local_ordinal_type p_costApplyE = p_n_sublines_2 * subline_length * 2 * costGEMV(block_size);
const local_ordinal_type p_costApplyS = p_n_lines * costTriDiagSolve((n_subparts_per_part-1)*2,block_size);
const local_ordinal_type p_costApplyAinv = p_n_sublines * costTriDiagSolve(subline_length,block_size);
const local_ordinal_type p_costApplyC = p_n_sublines_2 * 2 * costGEMV(block_size);

if (n_subparts_per_part == 1) {
return p_costApplyAinv;
}
return p_costApplyE + p_costApplyS + p_costApplyAinv + p_costApplyC;
}

template<typename local_ordinal_type>
local_ordinal_type getAutomaticNSubparts(const local_ordinal_type num_parts,
const local_ordinal_type num_teams,
const local_ordinal_type line_length,
const local_ordinal_type block_size) {
local_ordinal_type n_subparts_per_part_0 = 1;
local_ordinal_type flop_0 = costSolveSchur(num_parts, num_teams, line_length, block_size, n_subparts_per_part_0);
local_ordinal_type flop_1 = costSolveSchur(num_parts, num_teams, line_length, block_size, n_subparts_per_part_0+1);
while (flop_0 > flop_1) {
flop_0 = flop_1;
flop_1 = costSolveSchur(num_parts, num_teams, line_length, block_size, (++n_subparts_per_part_0)+1);
}
return n_subparts_per_part_0;
}

template<typename ArgActiveExecutionMemorySpace>
struct SolveTridiagsDefaultModeAndAlgo;

///
/// setup part interface using the container partitions array
///
template<typename MatrixType>
BlockHelperDetails::PartInterface<MatrixType>
createPartInterface(const Teuchos::RCP<const typename BlockHelperDetails::ImplType<MatrixType>::tpetra_block_crs_matrix_type> &A,
const Teuchos::Array<Teuchos::Array<typename BlockHelperDetails::ImplType<MatrixType>::local_ordinal_type> > &partitions,
const typename BlockHelperDetails::ImplType<MatrixType>::local_ordinal_type n_subparts_per_part) {
const typename BlockHelperDetails::ImplType<MatrixType>::local_ordinal_type n_subparts_per_part_in) {
IFPACK2_BLOCKHELPER_TIMER("createPartInterface");
using impl_type = BlockHelperDetails::ImplType<MatrixType>;
using local_ordinal_type = typename impl_type::local_ordinal_type;
using local_ordinal_type_1d_view = typename impl_type::local_ordinal_type_1d_view;
using local_ordinal_type_2d_view = typename impl_type::local_ordinal_type_2d_view;
using size_type = typename impl_type::size_type;

const auto blocksize = A->getBlockSize();
constexpr int vector_length = impl_type::vector_length;
constexpr int internal_vector_length = impl_type::internal_vector_length;

const auto comm = A->getRowMap()->getComm();

Expand All @@ -867,6 +928,40 @@ namespace Ifpack2 {
const local_ordinal_type A_n_lclrows = A->getLocalNumRows();
const local_ordinal_type nparts = jacobi ? A_n_lclrows : partitions.size();

typedef std::pair<local_ordinal_type,local_ordinal_type> size_idx_pair_type;
std::vector<size_idx_pair_type> partsz(nparts);

if (!jacobi) {
for (local_ordinal_type i=0;i<nparts;++i)
partsz[i] = size_idx_pair_type(partitions[i].size(), i);
std::sort(partsz.begin(), partsz.end(),
[] (const size_idx_pair_type& x, const size_idx_pair_type& y) {
return x.first > y.first;
});
}

local_ordinal_type n_subparts_per_part;
if (n_subparts_per_part_in == -1) {
// If the number of subparts is set to -1, the user let the algorithm
// decides the value automatically
using execution_space = typename impl_type::execution_space;

const int line_length = partsz[0].first;

const local_ordinal_type team_size =
SolveTridiagsDefaultModeAndAlgo<typename execution_space::memory_space>::
recommended_team_size(blocksize, vector_length, internal_vector_length);

const local_ordinal_type num_teams = execution_space().concurrency() / (team_size * vector_length);

n_subparts_per_part = getAutomaticNSubparts(nparts, num_teams, line_length, blocksize);

printf("Automatically chosen n_subparts_per_part = %d for nparts = %d, num_teams = %d, team_size = %d, line_length = %d, and blocksize = %d;\n", n_subparts_per_part, nparts, num_teams, team_size, line_length, blocksize);
}
else {
n_subparts_per_part = n_subparts_per_part_in;
}

// Total number of sub lines:
const local_ordinal_type n_sub_parts = nparts * n_subparts_per_part;
// Total number of sub lines + the Schur complement blocks.
Expand Down Expand Up @@ -896,14 +991,6 @@ namespace Ifpack2 {
// reorder parts to maximize simd packing efficiency
p.resize(nparts);

typedef std::pair<local_ordinal_type,local_ordinal_type> size_idx_pair_type;
std::vector<size_idx_pair_type> partsz(nparts);
for (local_ordinal_type i=0;i<nparts;++i)
partsz[i] = size_idx_pair_type(partitions[i].size(), i);
std::sort(partsz.begin(), partsz.end(),
[] (const size_idx_pair_type& x, const size_idx_pair_type& y) {
return x.first > y.first;
});
for (local_ordinal_type i=0;i<nparts;++i)
p[i] = partsz[i].second;

Expand Down Expand Up @@ -2074,9 +2161,6 @@ namespace Ifpack2 {
};
#endif

template<typename ArgActiveExecutionMemorySpace>
struct SolveTridiagsDefaultModeAndAlgo;

template<typename impl_type, typename WWViewType>
KOKKOS_INLINE_FUNCTION
void
Expand Down Expand Up @@ -3251,7 +3335,7 @@ namespace Ifpack2 {

{
#ifdef IFPACK2_BLOCKTRIDICONTAINER_USE_PRINTF
printf("Star ComputeSchurTag\n");
printf("Start ComputeSchurTag\n");
#endif
IFPACK2_BLOCKHELPER_TIMER("BlockTriDi::NumericPhase::ComputeSchurTag");
writeBTDValuesToFile(part2packrowidx0_sub.extent(0), scalar_values_schur, "before_schur.mm");
Expand All @@ -3270,7 +3354,7 @@ namespace Ifpack2 {

{
#ifdef IFPACK2_BLOCKTRIDICONTAINER_USE_PRINTF
printf("Star FactorizeSchurTag\n");
printf("Start FactorizeSchurTag\n");
#endif
IFPACK2_BLOCKHELPER_TIMER("BlockTriDi::NumericPhase::FactorizeSchurTag");
Kokkos::TeamPolicy<execution_space,FactorizeSchurTag>
Expand Down
4 changes: 3 additions & 1 deletion packages/ifpack2/src/Ifpack2_SparseContainer_decl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ class SparseContainer
using inverse_mv_type = Tpetra::MultiVector<InverseScalar, InverseLocalOrdinal, InverseGlobalOrdinal, InverseNode>;
using InverseCrs = Tpetra::CrsMatrix<InverseScalar, InverseLocalOrdinal, InverseGlobalOrdinal, InverseNode>;
using InverseMap = typename Tpetra::Map<InverseLocalOrdinal, InverseGlobalOrdinal, InverseNode>;

using InverseGraph = typename InverseCrs::crs_graph_type;
using typename Container<MatrixType>::HostView;
using typename Container<MatrixType>::ConstHostView;
using HostViewInverse = typename inverse_mv_type::dual_view_type::t_host;
Expand Down Expand Up @@ -287,6 +287,8 @@ class SparseContainer

//! Extract the submatrices identified by the local indices set by the constructor.
void extract ();
void extractGraph ();
void extractValues ();

/// \brief Post-permutation, post-view version of apply().
///
Expand Down
Loading

0 comments on commit bee69a6

Please sign in to comment.