qr_mumps
Users Guide
Table of Contents
1 Introduction
qr_mumps
is a software package for the solution of sparse, linear
systems on multicore computers. It implements a direct solution
method based on the \(QR\) of Cholesky factorization of the input
matrix. It is suited to solving sparse least-squares problems
\(\min_x\|Ax-b\|_2\), to computing the minimum-norm solution of
sparse, underdetermined problems and to the solution of sparse
symmetric positive definite linear systems. It can obviously be used
for solving unsymmetric square problems in which case the stability
provided by the use of orthogonal transformations comes at the cost
of a higher operation count with respect to solvers based on, e.g.,
the \(LU\) factorization such as MUMPS. qr_mumps
supports real and
complex, single or double precision arithmetic.
As in all the sparse, direct solvers, the solution is achieved in three distinct phases:
- Analysis
- in this phase an analysis of the structural properties of the input matrix is performed in preparation for the numerical factorization phase. This includes computing a column permutation which reduces the amount of fill-in coefficients (i.e., nonzeroes introduced by the factorization). This step does not perform any floating-point operation and is, thus, commonly much faster than the factorization and solve (depending on the number of right-hand sides) phases.
- Factorization
- at this step, the actual \(QR\) or Cholesky factorization is computed. This step is the most computationally intense and, therefore, the most time consuming.
- Solution
- once the factorization is done, the factors can be
used to compute the solution of the problem through
two operations:
- Solve
- this operation computes the solution of the triangular system \(Rx=b\) or \(R^Tx=b\);
- Apply
- this operation applies the \(Q\) orthogonal matrix to a vector, i.e., \(y = Qx\) or \(y = Q^Tx\).
These three steps have to be done in order but each of them can be performed multiple times. If, for example, the problem has to be solved against multiple right-hand sides (not all available at once), the analysis and factorization can be done only once while the solution is repeated for each right-hand side. By the same token, if the coefficients of a matrix are updated but not its structure, the analysis can be performed only once for multiple factorization and solution steps.
qr_mumps
is based on the multifrontal factorization method. This
method was first introduced by Duff and Reid dure:83 as a
method for the factorization of sparse, symmetric linear systems and,
since then, has been the object of numerous studies and the method of
choice for several, high-performance, software packages such as
MUMPS adkl:00 and UMFPACK davis:04. The method
used in qr_mumps
is described in full details
in b:13,a.b.g.l:16.
qr_mumps
is built upon the large knowledge base and know-how
developed by the members of the MUMPS project. However, qr_mumps
does
not share any code with the MUMPS package and it is a completely
independent software. qr_mumps
is developed and maintained in a
collaborative effort by the APO team at the IRIT laboratory in
Toulouse and the LaBRI laboratory in Bordeaux, France.
2 Features
2.1 Types of problems
qr_mumps
can handle unsymmetric and symmetric, positive definite
problems. In the first case it will use a \(QR\) factorization
whereas, in the second, it will use a Cholesky factorization. In
order to choose one or the other method, qr_mumps
must be informed
about the type of the problem through the sym
field of the
zqrm_spmat_type
data structure: 0 means that the problem is
unsymmetric and >0 means symmetric, positive definite. Note that in
the second case, only half of the matrix must be provided, i.e., if
the coefficient \((i,j)\) is provided \((j,i)\) must not be given.
2.2 Multithreading
qr_mumps
is a parallel, multithreaded software based on the StarPU
runtime system a.t.n.w:11 and it currently supports multicore or,
more generally, shared memory multiprocessor computers. qr_mumps
does
not run on distributed memory (e.g. clusters) parallel
computers. Parallelism is achieved through a decomposition of the
workload into fine-grained computational tasks which basically
correspond to the execution of a BLAS or LAPACK operation on a
blocks. It is strongly recommended to use sequential BLAS and LAPACK
libraries and let qr_mumps
have full control of the parallelism.
The number of threads/cores used by qr_mumps
can be controlled in two
different ways:
- by setting
QRM_NCPU
environment variable to the desired number of threads. In this case the number of threads will be the same throughout the execution of your program/application; - through the
qrm_init
. This method has higher priority than theQRM_NCPU
environment variable.
The granularity of the tasks is controlled by the qrm_mb_
and
qrm_nb_
parameters (see Section qrm_set
) which set the block
size for partitioning internal data. Smaller values mean more
parallelism; however, because this blocking factor is an upper-bound
for the granularity of operations (or, more precisely for the
granularity of calls to BLAS and LAPACK routines), it is recommended
to choose reasonably large values in order to achieve high efficiency.
2.3 GPU acceleration
qr_mumps
can leverage the computing power of Nvidia GPU, commonly
available on modern supercomputing systems, to accelerate the
solution of linear systems, especially large size ones. The use of
GPUs is achieved through the StarPU runtime ag.b.g.l:15*1.
The number of GPUs used by qr_mumps
can be controlled in two
different ways:
- by setting
QRM_NGPU
environment variable to the desired number of GPUs. In this case the number of GPUs will be the same throughout the execution of your program/application; through the
qrm_init
. This method has higher priority than theQRM_NGPU
environment variable.Note that it is possible to use multiple streams per GPU; this can be controlled through the StarPU
STARPU_NWORKER_PER_CUDA
environment variable.
2.4 Fill-reducing permutations
qr_mumps
supports multiple methods for reducing the factorization
fill-in through matrix column permutations. The choice is
controlled through the qrm_ordering_
control
parameter. Nested-dissection based methods are available through
the packages Metis metis and
SCOTCH pelle:07 packages as well as average minimum degree
through the COLAMD Davis:2004:CAM:1024074.1024079
one. Nested-dissection based methods usually lead to lower fill-in
which ultimately results in faster and less memory consuming
factorization. COLAMD, instead, typically leads to a faster
execution of the analysis phase although is not as effective in
reducing the fill-in which may result in a slower and more memory
consuming factorization. Because the overall execution time is
commonly dominated by the factorization, nested-dissection methods
are usually more effective especially for large size
problems. qr_mumps
also allows the user to provide their own
permutation, as explained in Secion [[*Control parameters.
2.5 Memory consumption control
qr_mumps
allows for controlling the amount of memory used in the parallel
factorization stage. In the multifrontal method, the memory
consumption varies greatly throughout the sequential factorization
reaching a maximum value which is referred to as the sequential peak
(\(sp\)). Parallelism can considerably increase this peak because, in
order to feed the working threads, more data is allocated at the same
time which results in higher concurrency. In qr_mumps
it is possible to
bound the memory consumption of the factorization phase through the
qrm_mem_relax
parameter. If this parameter is set to a real
value \(x \ge 1\), the memory consumption will be bounded by
\(x \times sp\). Clearly, the tighter is this upper bound, the slower
the factorization will proceed. Note that \(sp\) only includes the
memory consumed by the factorization operation; moreover, although in
practice it is possible to precisely pre-compute this value in the
analysis phase, this may be expensive and thus qrm_analyse
only computes a (hopefully) slight overestimation. The value of \(sp\)
is available upon completion of the analysis phase through the
qrm_e_facto_mempeak
information parameter (see
Section 5).
2.6 Asynchronous interface
An asynchronous interface is provided for the analysis, factorization
apply and solve operations, respectively qrm_analyse_async
,
qrm_factorize_async
, qrm_apply_async
and qrm_solve_async
. These
routines are non-blocking variants of the zqrm_analyse
,
zqrm_factorize
, zqrm_apply
and zqrm_solve
; this means that they
will submit to the StarPU runtime system all the tasks the
corresponding operating is composed of and will return control to the
calling program as soon as possible. The completion of the tasks
will be achieved asynchronously and can only be ensured through a
call to the qrm_barrier
routine. This has a number of advantages;
for example it allows for executing concurrently operations that work
on different data (e.g. the factorization of different matrices) or
to pipeline the execution of operations which work on the same data
(for example factorization and solve with the same matrix), in which
case StarPU will take care of ensuring that the precedence
constraints between tasks are respected. The asynchronous routine
take an additional argument qrm_dscr
which is a communication
descriptor, i.e. a container for the submitted tasks; this has to be
initialized through the qrm_dscr_init
routine and destroyed using
the qrm_dscr_destroy
one.
Two main differences exist with respect to the synchronous interface:
- Right-hand sides must be registered to
qr_mumps
by means of thezqrm_rhs_init
routine which associates a rank-1 or rank-2 Fortran array to azqrm_rhs_type
data structure. - a communication descriptor must be initialized and passed to the operation routines: all the associated tasks will be submitted to this descriptor.
The completion of the operation can be guaranteed by calling a
qrm_barrier
routine either with the (optional) descriptor
argument, in which case the routine will wait for all the tasks in
that descriptor, or without, in which case the routine will wait for
all the previously submitted tasks in all descriptors.
There is currently no support for asynchronous execution in the
qr_mumps
C interface.
See Section 2.6 for an example.
3 API
qr_mumps
is developed in the Fortran 2008 language but includes a
portable C interface developed through the Fortran iso_c_binding
feature. Most of the qr_mumps
features are available from both
interfaces although the Fortran one takes full advantage of the
language features, such as the interface overloading, that are not
available in C. The naming convention used in qr_mumps
groups all
the routine or data type names into two families depending on
whether they depend on the arithmetic or not. Typed names always
begin by _qrm_
where the first underscore becomes d
, s
, z
,
c
for real double, real single, complex double or complex single
arithmetic, respectively. Untyped names, instead, simply begin by
qrm_
. Note that thanks to interface overloading in Fortran all the
typed interfaces of a routine can be conveniently grouped into a
single untyped one; this is described in details in
Section 3.4. All the interfaces described
in the remainder of this section are for the complex, double
precision case. The interfaces for complex single, real double and
real single can be obtained by replacing zqrm
with cqrm
, dqrm
and sqrm
, respectively and complex(real64)
with
complex(real32)
, real(real64)
, real(real32)
,
respectively. Note that real64
and real32
are defined in the
iso_fortran_env
Fortran 2008 module and correspond to kind(1.d0)
and kind(1.e0)
or 8
and 4
, respectively on basically all
common compilers/architectures. All the routines that take vectors
as input (e.g., zqrm_apply
) can be called with either one vector
(i.e. a rank-1 Fortran array x(:)
) or multiple ones (i.e., a
rank-2 Fortran array x(:,:)
) through the same interface thanks to
interface overloading. This is not possible for the C interface, in
which case an extra argument is present in order to specify the
number of vectors which are expected to be stored in column-major
(i.e., Fortran style) format.
In this section only the Fortran API is presented. For each Fortran
name (either of a routine or of a data type) the corresponding C
name is obtained by adding the _c
suffix. The number, type and
order of arguments in the C routines is the same except for those
routines that take dense vectors in which case, the C interface
needs an extra argument specifying the number of vectors passed
trough the same pointer. The user can refer to the code examples and
to the zqrm_mumps.h
file for the full details of the C interface.
3.1 Data types
3.1.1 zqrm_spmat_type
This data type is used to store a sparse matrix in the COO
(or
coordinate) format through the irn
, jcn
and
val
fields containing the row indices, column indices and
values, respectively and the m
, n
and nz
containing the number of rows, columns and nonzeroes,
respectively. qr_mumps
uses a Fortran-style 1-based numbering and thus
all row indices are expected to be between 1 and m
and all
the column indices between 1 and n
. Duplicate entries are
summed during the factorization, out-of-bound entries are
ignored. The sym
field is used to specify if the matrix is
symmetric (\(>0\)) or unsymmetric (\(=0\)).
type zqrm_spmat_type integer :: m integer :: n integer :: nz integer :: sym integer, pointer :: irn(:) integer, pointer :: jcn(:) complex(real64), pointer :: val(:) end type zqrm_spmat_type
3.1.2 zqrm_spfct_type
This type is used to set the parameters that control the behavior of a sparse factorization, to collect information about its execution (number of flops, memory consumpnion etc) and store the result of the factorization, namely, the factors with all the symbolic information needed to use them in the solve phase.
type zqrm_spfct_type integer :: icntl(:) real(real32) :: rcntl(:) integer(int64) :: gstats(:) integer, pointer :: cperm_in(:) end type zqrm_spfct_type
cperm_in
: this array can be used to provide a matrix column permutation and is only accessed byqr_mumps
in this case.icntl
: this array contains all the integer control parameters. Its content can be modified either directly or indirectly through theqrm_set
routine (see Section 3.3.3).gstats
: this array contains all the statistics collected byqr_mumps
. Its content can be accessed either directly or indirectly through theqrm_get
routine (see Section 3.3.4).
3.2 Computational routines
3.2.1 zqrm_analyse
This routine performs the analysis phase (see Section 1) on \(A\) or \(A^T\).
interface qrm_analyse subroutine zqrm_analyse(qrm_spmat, qrm_spfct, transp, info) type(zqrm_spmat_type) :: qrm_spmat type(zqrm_spfct_type) :: qrm_spfct character, optional :: transp integer, optional :: info end subroutine zqrm_analyse end interface qrm_analyse
Arguments:
qrm_spmat
: the input matrix ofzqrm_spmat_type
.qrm_spfct
: the sparse factorization object ofzqrm_spfct_type
.transp
: whether the input matrix should be transposed or not. Can be either't'
('c'
in in complex arithmetic) or'n'
. In the Fortran interface this parameter is optional and set by default to'n'
if not passed.info
: an optional output parameter that returns the exit status of the routine.
3.2.2 zqrm_factorize
This routine performs the factorization phase (see Section 1 ) on \(A\) or \(A^T\). It can only be executed if the analysis is already done.
interface qrm_factorize subroutine zqrm_factorize(qrm_spmat, qrm_spfct, transp, info) type(zqrm_spmat_type) :: qrm_spmat type(zqrm_spfct_type) :: qrm_spfct character, optional :: transp integer, optional :: info end subroutine zqrm_factorize end interface qrm_factorize
Arguments:
qrm_spmat
: the input matrix ofzqrm_spmat_type
.qrm_spfct
: the sparse factorization object ofzqrm_spfct_type
.transp
: whether the input matrix should be transposed or not. Can be either't'
('c'
in in complex arithmetic) or'n'
. In the Fortran interface this parameter is optional and set by default to'n'
if not passed.info
: an optional output parameter that returns the exit status of the routine.
3.2.3 zqrm_apply
This routine computes \(b=Q\cdot b\) or \(b=Q^T\cdot b\). It can only be executed once the factorization is done.
interface zqrm_apply subroutine zqrm_apply2d(qrm_spfct, transp, b, info) type(zqrm_spfct_type) :: qrm_spfct complex(real64) :: b(:,:) character(len=*) :: transp integer, optional :: info end subroutine zqrm_apply2d subroutine zqrm_apply1d(qrm_spfct, transp, b, info) type(zqrm_spfct_type) :: qrm_spfct complex(real64) :: b(:) character(len=*) :: transp integer, optional :: info end subroutine zqrm_apply1d end interface zqrm_apply
Arguments:
qrm_spfct
: the sparse factorization object resulting fromzqrm_factorize
transp
: whether to apply \(Q\) or \(Q^T\). Can be either't'
('c'
in complex arithmetic) or'n'
.b
: the \(b\) vector(s) to which \(Q\) or \(Q^T\) is applied.info
: an optional output parameter that returns the exit status of the routine.
3.2.4 zqrm_solve
This routine solves the triangular system \(R\cdot x = b\) or \(R^T\cdot x =b\). It can only be executed once the factorization is done.
interface zqrm_solve subroutine zqrm_solve2d(qrm_spfct, transp, b, x, info) type(zqrm_spfct_type) :: qrm_spfct complex(real64) :: b(:,:) complex(real64) :: x(:,:) character(len=*) :: transp integer, optional :: info end subroutine zqrm_solve2d subroutine zqrm_solve1d(qrm_spfct, transp, b, x, info) type(zqrm_spfct_type) :: qrm_spfct complex(real64) :: b(:) complex(real64) :: x(:) character(len=*) :: transp integer, optional :: info end subroutine zqrm_solve1d end interface zqrm_solve
Arguments:
qrm_spfct
: the sparse factorization object resulting fromzqrm_factorize
transp
: whether to solve for \(R\) or \(R^T\). Can be either't'
('c'
in complex arithmetic) or'n'
.b
: the \(b\) right-hand side(s).x
: the \(x\) solution vector(s).info
: an optional output parameter that returns the exit status of the routine.
3.2.5 zqrm_least_squares
This subroutine can be used to solve a linear least squares problem \(\min_x\|Ax-b\|_2\) in the case where the input matrix is square or overdetermined. It is a shortcut for the sequence
call zqrm_analyse(qrm_spmat, qrm_spfct, 'n', info) call zqrm_factorize(qrm_spmat, qrm_spfct, 'n', info) call zqrm_apply(qrm_spfct, 'c', b, info) call zqrm_solve(qrm_spfct, 'n', b, x, info)
Note that \(A^T\) can be used instead of \(A\), in which case \(A^T\) must be overdetermined.
interface qrm_least_squares subroutine zqrm_least_squares2d(qrm_spmat, b, x, transp, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:,:), x(:,:) character, optional :: transp integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_least_squares2d subroutine zqrm_least_squares1d(qrm_spmat, b, x, transp, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:), x(:) character, optional :: transp integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_least_squares1d end interface qrm_least_squares
Arguments:
qrm_spmat
: the input matrix.b
: the \(b\) right-hand side(s). Will be modified.x
: the \(x\) solution vector(s).transp
: whether to use \(A\) or \(A^T\).cperm
: an optional integer array to pass a fill-reducing matrix column permutation.info
: an optional output parameter that returns the exit status of the routine.
3.2.6 zqrm_min_norm
This subroutine can be used to solve a linear minimum norm problem in the case where the input matrix is square or underdetermined. It is a shortcut for the sequence
call qrm_analyse(qrm_spmat, 'c', info) call qrm_factorize(qrm_spmat, 'c', info) call qrm_solve(qrm_spmat, 'c', b, x, info) call qrm_apply(qrm_spmat, 'n', x, info)
Note that \(A^T\) can be used instead of \(A\), in which case \(A^T\) must be underdetermined.
interface qrm_min_norm subroutine zqrm_min_norm2d(qrm_spmat, b, x, transp, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:,:), x(:,:) character, optional :: transp integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_min_norm2d subroutine zqrm_min_norm1d(qrm_spmat, b, x, transp, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:), x(:) character, optional :: transp integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_min_norm1d end interface qrm_min_norm
Arguments:
qrm_spmat
: the input matrix.b
: the \(b\) right-hand side(s).x
: the \(x\) solution vector(s).transp
: whether to use \(A\) or \(A^T\)cperm
: an optional integer array to pass a fill-reducing matrix row permutation (i.e., a column permutation for \(A^T\)).info
: an optional output parameter that returns the exit status of the routine.
3.2.7 zqrm_spposv
This subroutine can be used to solve a linear symmetric, positive definite problem. It is a shortcut for the sequence
x = b call qrm_analyse(qrm_spmat, 'n', info) call qrm_factorize(qrm_spmat, 'n', info) call qrm_solve(qrm_spmat, 'c', x, b, info) call qrm_solve(qrm_spmat, 'n', b, x, info)
interface qrm_min_norm subroutine zqrm_min_norm2d(qrm_spmat, b, x, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:,:), x(:,:) integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_min_norm2d subroutine zqrm_min_norm1d(qrm_spmat, b, x, cperm, info) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:), x(:) integer, optional :: cperm(:) integer, optional :: info end subroutine zqrm_min_norm1d end interface qrm_min_norm
Arguments:
qrm_spmat
: the input matrix.b
: the \(b\) right-hand side(s). Will be modified.x
: the \(x\) solution vector(s).cperm
: an optional integer array to pass a fill-reducing matrix row permutation (i.e., a column permutation for \(A^T\)).info
: an optional output parameter that returns the exit status of the routine.
3.2.8 zqrm_spmat_mv
This subroutine performs a matrix-vector product of the type \(y =\alpha Ax + \beta y\) or \(y =\alpha A^Tx + \beta y\).
interface zqrm_spmat_mv subroutine zqrm_spmat_mv_2d(qrm_spmat, transp, alpha, x, beta, y) type(zqrm_spmat_type) :: qrm_spmat complex(real64), :: y(:,:) complex(real64), :: x(:,:) complex(real64), :: alpha, beta character(len=*) :: transp end subroutine zqrm_spmat_mv_2d subroutine zqrm_spmat_mv_1d(qrm_spmat, transp, alpha, x, beta, y) type(zqrm_spmat_type) :: qrm_spmat complex(real64), :: y(:) complex(real64), :: x(:) complex(real64), :: alpha, beta character(len=*) :: transp end subroutine zqrm_spmat_mv_1d end interface zqrm_spmat_mv
Arguments:
qrm_spmat
: the input matrix.transp
: whether to multiply by \(A\) or \(A^T\). Can be either't'
('c'
if in complex arithmetic) or'n'
.alpha
,beta
the \(\alpha\) and \(\beta\) scalarsy
: the \(y\) vector(s).x
: the \(x\) vector(s).
3.2.9 qrm_spmat_nrm
This routine computes the one \(\|A\|_1\) or the infinity \(\|A\|_\infty\) or the Frobenius \(\|A\|_F\) norm of a matrix.
interface qrm_spmat_nrm subroutine zqrm_spmat_nrm(qrm_spmat, ntype, nrm, info) type(zqrm_spmat_type) :: qrm_spmat real(real64) :: nrm character :: ntype integer, optional :: info end subroutine zqrm_spmat_nrm end interface qrm_spmat_nrm
Arguments:
qrm_spmat
: the input matrix.ntype
: the type of norm to be computed. It can be either'i'
,'1'
or'f'
for the infinity, one and Frobenius norms, respectively.nrm
: the computed norm.info
: an optional output parameter that returns the exit status of the routine.
3.2.10 zqrm_vecnrm
This routine computes the one-norm \(\|x\|_1\), the infinity-norm \(\|x\|_\infty\) or the two-norm \(\|x\|_2\) of a vector.
interface qrm_vecnrm subroutine zqrm_vecnrm2d(vec, n, ntype, nrm, info) complex(real64) :: vec(:,:) real(real64) :: nrm(:) integer :: n character :: ntype integer, optional :: info end subroutine zqrm_vecnrm2d subroutine zqrm_vecnrm1d(vec, n, ntype, nrm, info) complex(real64) :: vec(:) real(real64) :: nrm integer :: n character :: ntype integer, optional :: info end subroutine zqrm_vecnrm1d end interface qrm_vecnrm
Arguments:
x
: the \(x\) vector(s).n
: the size of the vector.ntype
: the type of norm to be computed. It can be either'i'
,'1'
or'2'
for the infinity, one and two norms, respectively.nrm
the computed norm(s). Ifx
is a rank-2 array (i.e., a multivector) this argument has to be a rank-1 arraynrm(:)
and each of its elements will contain the norm of the corresponding column ofx
.info
: an optional output parameter that returns the exit status of the routine.
3.2.11 qrm_residual_norm
This routine computes the scaled norm of the residual \(\frac{\|b-Ax\|_\infty}{\|b\|_\infty + \|x\|_\infty\|A\|_\infty}\), i.e., the normwise backward error. It is a shortcut for the sequence
call qrm_vecnrm(b, qrm_spmat%m, 'i', nrmb) call qrm_vecnrm(x, qrm_spmat%n, 'i', nrmx) call qrm_spmat_mv(qrm_spmat, 'n', -1, x, 1, b) call qrm_spmat_nrm(qrm_spmat, 'i', nrma) call qrm_vecnrm(b, qrm_spmat%m, 'i', nrmr) nrm = nrmr/(nrmb+nrma*nrmx)
Note that \(A^T\) can be used instead of \(A\).
interface qrm_residual_norm subroutine zqrm_residual_norm2d(qrm_spmat, b, x, nrm, transp, info) real(real64) :: nrm(:) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:,:), x(:,:) character, optional :: transp integer, optional :: info end subroutine zqrm_residual_norm2d subroutine zqrm_residual_norm1d(qrm_spmat, b, x, nrm, transp, info) real(real64) :: nrm type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: b(:), x(:) character, optional :: transp integer, optional :: info end subroutine zqrm_residual_norm1d end interface qrm_residual_norm
Arguments:
qrm_spmat
: the input matrix.b
: the \(b\) right-hand side(s). On output this argument contains the residual.x
: the \(x\) solution vector(s).nrm
the scaled residual norm. This argument is of typereal
for single precision arithmetic (both real and complex) andreal(kind(1.d0))
for double precision ones (both real and complex). Ifx
andb
are rank-2 arrays (i.e., multivectors) this argument has to be a rank-1 arraynrm(:)
and each coefficient will contain the scaled norm of the residual for the corresponding column ofx
andb
.transp
: whether to use \(A\) or \(A^T\)info
: an optional output parameter that returns the exit status of the routine.
3.2.12 qrm_residual_orth
Computes the quantity \(\frac{\|A^Tr\|_2}{\|r\|_2}\) which can be used to evaluate the quality of the solution of a least squares problem (see bjor:96, page 34). It is a shortcut for the sequence
call qrm_spmat_mv(qrm_spmat, 'c', 1, r, 0, atr) call qrm_vecnrm(r, qrm_spmat%m, '2', nrmr) call qrm_vecnrm(atr, qrm_spmat%n, '2', nrm) nrm = nrm/nrmr
Note that \(A^T\) can be used instead of \(A\).
interface qrm_residual_orth subroutine zqrm_residual_orth2d(qrm_spmat, r, nrm, transp, info) real(real64) :: nrm(:) type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: r(:,:) character, optional :: transp integer, optional :: info end subroutine zqrm_residual_orth2d subroutine zqrm_residual_orth1d(qrm_spmat, r, nrm, transp, info) real(real64) :: nrm type(zqrm_spmat_type) :: qrm_spmat complex(real64) :: r(:) character, optional :: transp integer, optional :: info end subroutine zqrm_residual_orth1d end interface qrm_residual_orth
Arguments:
qrm_spmat
: the input problem.r
: the \(r\) residual(s).nrm
the scaled \(A^Tr\) norm. This argument is of typereal(real64)
for double precision arithmetic (both real and complex) andreal(real32)
for single precision ones (both real and complex). Ifr
is a rank-2 array (i.e., a multivector) this argument has to be a rank-1 arraynrm(:)
and each coefficient will contain the scaled norm of \(A^Tr\) for the corresponding column ofr
.transp
: whether to use \(A\) or \(A^T\)info
: an optional output parameter that returns the exit status of the routine.
3.3 Management routines
3.3.1 qrm_init
This routine initializes qr_mumps
and should be called prior to any other
qr_mumps
routine.
subroutine qrm_init(ncpu, ngpu, info) integer, optional :: ncpu, ngpu, info end subroutine qrm_init
Arguments:
ncpu
: an optional input parameter that sets the number of working threads. If not specified, theQRM_NCPU
is used.ngpu
: an optional input parameter that sets the number of working threads. If not specified, theQRM_NGPU
is used.info
: an optional output parameter that returns the exit status of the routine.
3.3.2 qrm_finalize
This routine finalizes qr_mumps
and no other qr_mumps
routine should
be called afterwards.
subroutine qrm_finalize() end subroutine qrm_finalize
3.3.3 qrm_set
This family of routines is used to set control parameters that define
the behavior of qr_mumps
; it is possible to set default parameters
which are applied to all the following factorizations or the
parameters of a specific sparse factorization object of type
zqrm_spfct_type
. In the Fortran API the qrm_set
interfaces
overloads all of them. These control parameters are explained in full
details in Section 4.
interface qrm_set subroutine qrm_glob_set_i4(string, ival, info) character(len=*) :: string integer :: ival integer, optional :: info end subroutine qrm_glob_set_i4 subroutine qrm_glob_set_r4(string, rval, info) character(len=*) :: string real(real32) :: rval integer, optional :: info end subroutine qrm_glob_set_r4 subroutine zqrm_spfct_set_i4(qrm_spfct, string, ival, info) type(zqrm_spfct_type) :: qrm_spfct character(len=*) :: string integer :: ival integer, optional :: info end subroutine zqrm_spfct_set_i4 subroutine zqrm_spfct_set_r4(qrm_spfct, string, rval, info) type(zqrm_spfct_type) :: qrm_spfct character(len=*) :: string real(real32) :: rval integer, optional :: info subroutine zqrm_spfct_set_r4 end interface qrm_set
Arguments:
qrm_fct
: the sparse factorization object.string
: a string describing the parameter to be set (see Section 4 for a full list).val
: the parameter value.info
: an optional output parameter that returns the exit status of the routine.
3.3.4 qrm_get
This family of routines can be used to get the value of a control
parameter or the get the value of information collected by
qr_mumps
during the execution (see Section 4 for a full list).
interface qrm_get subroutine qrm_glob_get_i4(string, ival, info) character(len=*) :: string integer :: ival integer, optional :: info end subroutine qrm_glob_get_i4 subroutine qrm_glob_get_i8(string, iival, info) character(len=*) :: string integer(int64) :: iival integer, optional :: info end subroutine qrm_glob_get_i8 subroutine qrm_glob_get_r4(string, rval, info) character(len=*) :: string real(real32) :: rval integer, optional :: info end subroutine qrm_glob_get_r4 subroutine zqrm_spfct_get_i4(qrm_spfct, string, ival, info) type(zqrm_spfct_type) :: qrm_spfct character(len=*) :: string integer :: ival integer, optional :: info end subroutine zqrm_spfct_get_i4 subroutine zqrm_spfct_get_i8(qrm_spfct, string, ival, info) type(zqrm_spfct_type) :: qrm_spfct character(len=* ) :: string integer(int64) :: ival integer, optional :: info end subroutine zqrm_spfct_get_i8 end interface qrm_get
Arguments:
3.3.5 zqrm_spmat_init
This routine initializes a zqrm_spmat_type
data structure. This
amounts to nullifying all the pointers and setting the rest of the
data fileds to 0.
interface qrm_spmat_init subroutine zqrm_spmat_init(qrm_spmat, info) type(zqrm_spmat_type) :: qrm_spmat integer, optional :: info end subroutine zqrm_spmat_init end interface qrm_spmat_init
Arguments:
qrm_spmat
: the matrix to be initialized.info
: an optional output parameter that returns the exit status of the routine.
3.3.6 zqrm_spfct_init
This routine initializes a zqrm_spfct_type
data structure. This
amounts to setting all the control parameters to the default values.
interface qrm_spfct_init subroutine zqrm_spfct_init(qrm_spfct, qrm_spmat, info) type(zqrm_spfct_type) :: qrm_spfct type(zqrm_spmat_type) :: qrm_spmat integer, optional :: info end subroutine zqrm_spfct_init end interface qrm_spfct_init
Arguments:
qrm_spfct
: the sparse factorization object to be initialized.info
: an optional output parameter that returns the exit status of the routine.
3.3.7 zqrm_spfct_destroy
This routine cleans up a zqrm_spfct_type
data structure by
deleting the result of a sparse factorization.
interface qrm_spfct_destroy subroutine zqrm_spfct_destroy(qrm_spfct, info) type(zqrm_spfct_type) :: qrm_spfct integer, optional :: info end subroutine zqrm_spfct_destroy end interface qrm_spfct_destroy
Arguments:
qrm_spfct
: the sparse factorization object to be destroyed.info
: an optional output parameter that returns the exit status of the routine.
3.3.8 qrm_alloc
and qrm_dealloc
These routines are used to allocate and deallocate Fortran
pointers
or allocatables
. They're essentially wrappers around
the Fortran allocate
function and they're mostly used internally
by qr_mumps
too keep track of the amount of memory
allocated. Input pointers and allocatables can be either 1D or 2D,
integer, real or complex, single precision or double precision
(all of these are available regardless of the arithmetic with
which qr_mumps
has been compiled). For the sake of brevity, only
the interface of the 1D and 2D, double precision, complex versions is
given below.
interface qrm_alloc subroutine qrm_aalloc_z(a, m, info) complex(real64), allocatable :: a(:) integer :: m integer, optional :: info end subroutine qrm_aalloc_z subroutine qrm_aalloc_2z(a, m, n, info) complex(real64), allocatable :: a(:,:) integer :: m, n integer, optional :: info end subroutine qrm_aalloc_2z subroutine qrm_palloc_z(a, m, info) complex(real64), pointer :: a(:) integer :: m integer, optional :: info end subroutine qrm_palloc_z subroutine qrm_palloc_2z(a, m, n, info) complex(real64), pointer :: a(:,:) integer :: m, n integer, optional :: info end subroutine qrm_palloc_2z end interface qrm_alloc interface qrm_dealloc subroutine qrm_adealloc_z(a, info) complex(real64), allocatable :: a(:) integer, optional :: info end subroutine qrm_adealloc_z subroutine qrm_adealloc_2z(a, info) complex(real64), allocatable :: a(:,:) integer, optional :: info end subroutine qrm_adealloc_2z subroutine qrm_pdealloc_z(a, info) complex(real64), pointer :: a(:) integer, optional :: info end subroutine qrm_pdealloc_z subroutine qrm_pdealloc_2z(a, info) complex(real64), pointer :: a(:,:) integer, optional :: info end subroutine qrm_pdealloc_2z end interface qrm_dealloc
Arguments:
a
: the input 1D or 2D pointer or allocatable array.m
: the row size.n
: the column size.info
: an optional output parameter that returns the exit status of the routine.
3.4 Interface overloading
The interface overloading feature of the Fortran language is heavily
used inside qr_mumps
. First of all, all the typed routines of the
type _qrm_xyz
are overloaded with a generic qrm_xyz
interface. This means that, for example, a call to the qrm_factorize
routine will result in a call to sqrm_factorize
or as a call to
dqrm_factorize
depending on whether the input matrix is of type
sqrm_spmat_type
or dqrm_spmat_type
, respectively (i.e., single or
double precision real, respectively). As said in
Sections 3.3.3 and 3.3.4, the qrm_set
and qrm_get
interfaces overload the routines in the corresponding families and the
same holds for the allocation/deallocation routines (see
Section 3.3.8). The advantages of the
overloading are obvious. Take the following example:
type(sqrm_spmat_type) :: qrm_spmat real(real32), allocatable :: b(:), x(:) ! initialize the control data structure. call qrm_spmat_init(qrm_spmat) ... ! allocate arrays for the input matrix call qrm_alloc(qrm_spmat%irn, nz) call qrm_alloc(qrm_spmat%jcn, nz) call qrm_alloc(qrm_spmat%val, nz) call qrm_alloc(b, m) call qrm_alloc(x, n) ! initialize the data ... ! solve the problem call qrm_least_squares(qrm_spmat, b, x) ...
In case the user wants to switch to double precision, only the declarations on the first two lines have to be modified and the rest of the code stays unchanged.
4 Control parameters
Control parameters define the behavior of qr_mumps
and can be set in
two modes:
- global mode: in this mode it possible to either set generic
parameters (e.g., the unit for output or error messages) or default
parameter values (e.g., the ordering method to be used on the
problem) that apply to all
zqrm_spfct_type
objects that are successively initialized through thezqrm_spfct_init
routine. - problem mode: these parameters control the behavior of
qr_mumps
on a specific sparse factorization problem. Because thezqrm_spfct_init
routine sets the control parameters to their default values, these have to be modified after the sparse factorization object initialization.
All the control parameters can be set through the qrm_set
routine
(see the interface in Section 3.3); problem
specific control parameters can also be set by manually changing the
coefficients of the qrm_spfct_type%icntl
array (note the
underscore in this case); alternatively, the default values of all
parameters can be set through the corresponding environment
variables:
type(zqrm_spmat_type) :: qrm_spmat type(zqrm_spfct_type) :: qrm_spfct1, qrm_spfct2 ! set default ordering method to scotch call qrm_set('qrm_ordering', qrm_scotch_) call qrm_spfct_init(qrm_spfct1, qrm_spmat) call qrm_spfct_init(qrm_spfct2, qrm_spmat) ! set the block size to 256 only for qrm_spfct2 call qrm_set(qrm_spfct2, 'qrm_mb', 256) ! set the block size to 128 only for qrm_spfct1 qrm_spfct1%icntl(qrm_mb_) = 128 ...
Here is a list of the parameters, their meaning and the accepted
values; for each parameter omp_param
a corresponding OMP_PARAM
environment variable exists which can be used to set its default
value.
qrm_ncpu
(global, int32):val
is an integer specifying the number of CPU cores to use for the subsequentqr_mumps
calls. This has to be set prior to the call toqrm_init
. This value can also be set either through theQRM_NCPU
environment variable (lowest priority) or passed directly as an argument to theqrm_init
routine (highest priority). Default is 1. This is a global parameter and cannot be set for a specific problem only.qrm_ngpu
(global, int32):val
is an integer specifying the number of GPUs to use for the subsequentqr_mumps
calls. This has to be set prior to the call toqrm_init
. This value can also be set either through theQRM_NGPU
environment variable (lowest priority) or passed directly as an argument to theqrm_init
routine (highest priority. Default is 0. This is a global parameter and cannot be set for a specific problem only.qrm_ounit
(global, int32):val
is an integer specifying the unit for output messages; if negative, output messages are suppressed. Default is 6. This is a global parameter and cannot be set for a specific problem only.qrm_eunit
(global, int32):val
is an integer specifying the unit for error messages; if negative, error messages are suppressed. Default is 0. This is a global parameter and cannot be set for a specific problem only.qrm_ordering
(both, int32): this parameter specifies what permutation to apply to the columns of the input matrix in order to reduce the fill-in and, consequently, the operation count of the factorization and solve phases. This parameter is used byqr_mumps
during the analysis phase and, therefore, has to be set before it starts. The following pre-defined values are accepted:qrm_auto_
: the choice is automatically made byqr_mumps
. This is the default.qrm_natural_
: no permutation is applied.qrm_given_
: a column permutation is provided by the user through theqrm_spmat_type%cperm_in
.qrm_colamd_
: the COLAMD software package (if installed) is used for computing the column permutation.qrm_scotch_
: the SCOTCH software package (if installed) is used for computing the column permutation.qrm_metis_
: the Metis software package (if installed) is used for computing the column permutation.
qrm_keeph
(both, int32): this parameter says whether the \(H\) matrix should be kept for later use or discarded. This parameter is used byqr_mumps
during the factorization phase and, therefore, has to be set before it starts. Accepted value are:qrm_yes_
: the \(H\) matrix is kept. This is the default.qrm_no_
: the \(H\) matrix is discarded.
qrm_mb
andqrm_nb
(both, int32): These parameters define the block-size (rows and columns, respectively) for data partitioning and, thus, granularity of parallel tasks. Smaller values mean higher concurrence. This parameter, however, implicitly defines an upper bound for the granularity of call to BLAS and LAPACK routines (defined by theqrm_ib
parameter described below); therefore, excessively small values may result in poor performance. This parameter is used byqr_mumps
during the analysis and factorization phases and, therefore, has to be set before these start. The default value is 256 for both. Note thatqrm_mb
has to be a multiple ofqrm_nb
.qrm_ib
(both, int32): this parameter defines the granularity of BLAS/LAPACK operations. Larger values mean better efficiency but imply more fill-in and thus more flops and memory consumption (please refer to a.b.g.l:16 for more details). The value of this parameter is upper-bounded by theqrm_nb
parameter described above. This parameter is used byqr_mumps
during the factorization phase and, therefore, has to be set before it starts. The default value is 32. It is strongly advised to choose, for this parameter, a submultiple ofqrm_nb
.qrm_bh
(both, int32): this parameter defines the type of algorithm for the communication-avoiding QR factorization of frontal matrices (see the details in a.b.g.l:16. Smaller values mean more concurrency but worse tasks efficiency; if lower or equal to zero the largest possible value is chosen for each front. Default value is -1.qrm_rhsnb
(both, int32): in the case where multiple right-hand sides are passed to theqrm_apply
or theqrm_solve
routines, this parameter can be used to define a blocking of the right-hand sides. This parameter is used byqr_mumps
during the solve phase and, therefore, has to be set before it starts. By default, all the right-hand sides are treated in a single block.qrm_pinth
(both, int32): an integer value to control memory pinning when GPUs are used: all frontal matrices whose size (min(rows,cols)) is bigger than this value will be pinned.qrm_mem_relax
(both, real32): areal(real32)
value (\(>=1\)) that sets a relaxation parameter, with respect to the sequential peak, for the memory consumption in the factorization phase. If negative, the memory consumption is not bounded. Default value is \(-1.0\). See Section 2.5 for the details of this feature.qrm_rd_eps
(both, int32): areal(real32)
value setting a threshold to estimate the rank of the problem. If \(>0\) thezqrm_factorize
routine will count the number of diagonal coefficients of the \(R\) factor whose absolute value is smaller than the provided value. This number can be retrieved through theqrm_rd_num
information parameter described in the next section.
5 Information parameters
Information parameters return information about the behavior of
qr_mumps
and can be either global or problem specific.
All the information parameters can be gotten through the qrm_get
routine (see the interface in Section 3.3.4); problem
specific control parameters can also be retrieved by manually reading the
coefficients of the qrm_spfct_type%gstats
array.
The qrm_get
routine can also be used to retrieve the values
of all the control parameters described in the previous section with
the obvious usage.
qrm_max_mem
(global, int64): this parameter, of typeinteger(int64)
returns the maximum amount of memory allocated byqr_mumps
during its execution.qrm_tot_mem
(global, int64): this parameter, of typeinteger(int64)
returns the total amount of memory allocated byqr_mumps
at the moment when theqrm_get
routine is called.qrm_e_nnz_r
(local, int64): this parameter, of typeinteger(int64)
returns an estimate, computed during the analysis phase, of the number of nonzero coefficients in the \(R\) factor. This value is only available after theqrm_analyse
routine is executed.qrm_e_nnz_h
(local, int64): this parameter, of typeinteger(int64)
returns an estimate, computed during the analysis phase, of the number of nonzero coefficients in the \(H\) matrix. This value is only available after theqrm_analyse
routine is executed.qrm_e_facto_flops
(local, int64): this parameter, of typeinteger(int64)
returns an estimate, computed during the analysis phase, of the number of floating point operations performed during the factorization phase. This value is only available after theqrm_analyse
routine is executed.qrm_nnz_r
(local, int64): this parameter, of typeinteger(int64)
returns the actual number of the nonzero coefficients in the \(R\) factor after the factorization is done. This value is only available after theqrm_factorize
routine is executed.qrm_nnz_h
(local, int64): this parameter, of typeinteger(int64)
returns the actual number of the nonzero coefficients in the \(H\) matrix after the factorization is done. This value is only available after theqrm_factorize
routine is executed.qrm_e_facto_mempeak
(local, int64): this parameter, of typeinteger(int64)
returns an estimate of the peak memory consumption of the factorization operation.qrm_rd_num
(local, int32): this information parameter returns the number of diagonal coefficients of the \(R\) factor whose absolute value is lower thanqrm_rd_eps
if this control parameter was set to a value greater than 0.
6 Error handling
Most qr_mumps
routines have an optional argument info
(which is
always last) that returns the exit status. If the routine succeeded
info
will be equal to 0 otherwise it will have a positive value. A
message will be printed on the qrm_eunit
unit (see
Section 4 upon occurrence of an error. A
list of error codes:
- 1: The provided sparse matrix format is not supported.
- 3:
qrm_spfct%cntl
is invalid. - 4: Trying to allocate an already allocated allocatable or pointer.
- 5-6: Memory allocation problem.
- 8: Input column permutation not provided or invalid.
- 9: The requested ordering method is unknown.
- 10: Internal error: insufficient size for array .
- 11: Internal error: Error in lapack routine.
- 12: Internal error: out of memory.
- 13: The analysis must be done before the factorization.
- 14: The factorization must be done before the solve.
- 15: This type of norm is not implemented.
- 16: Requested ordering method not available (i.e., has not been installed).
- 17: Internal error: error from call to subroutine…
- 18: An error has occured in a call to COLAMD.
- 19: An error has occured in a call to SCOTCH.
- 20: An error has occured in a call to Metis.
- 23: Incorrect argument to
qrm_set=/=qrm_get
. - 25: Internal error: problem opening file.
- 27: Incompatible values in
qrm_spfct%icntl
. - 28: Incorrect value for
qrm_mb_=/=qrm_nb_=/=qrm_ib_
. - 29: Incorrect value for
qrm_spmat%m=/=n=/=nz
. - 30:
qrm_apply
cannot be called if the H matrix is discarded. - 31: StarPU initialization error.
- 32: Matrix is rank deficient.
7 Examples
7.1 Standard interface
The code below shows a basic example program that allocates and fills up a sparse matrix, runs the analysis, factorization and solve on it, computes the solution backward error and finally prints some information collected during the process.
program zqrm_example use zqrm_mod implicit none type(zqrm_spmat_type) :: qrm_spmat complex(r64), allocatable :: b(:), x(:), r(:), xe(:) integer :: info real(r64) :: anrm, bnrm, xnrm, rnrm, onrm, fnrm call qrm_init() ! initialize the matrix data structure. call qrm_spmat_init(qrm_spmat) call qrm_spmat_alloc(qrm_spmat, 13, 7, 5, 'coo') qrm_spmat%irn = (/1, 1, 1, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7/) qrm_spmat%jcn = (/1, 3, 5, 2, 3, 5, 1, 4, 4, 5, 2, 1, 3/) qrm_spmat%val = (/1.d0, 2.d0, 3.d0, 1.d0, 1.d0, 2.d0, 4.d0, & 1.d0, 5.d0, 1.d0, 3.d0, 6.d0, 1.d0/) call qrm_alloc(b, qrm_spmat%m, info) call qrm_alloc(r, qrm_spmat%m, info) call qrm_alloc(x, qrm_spmat%n, info) call qrm_alloc(xe, qrm_spmat%n, info) b = (/22.d0, 5.d0, 13.d0, 8.d0, 25.d0, 5.d0, 9.d0/) xe = (/1.d0, 2.d0, 3.d0, 4.d0, 5.d0/) r = b call qrm_vecnrm(b, size(b,1), '2', bnrm) call qrm_least_squares(qrm_spmat, b, x) call qrm_residual_norm(qrm_spmat, r, x, rnrm) call qrm_vecnrm(x, qrm_spmat%n, '2', xnrm) call qrm_spmat_nrm(qrm_spmat, 'f', anrm) call qrm_residual_orth(qrm_spmat, r, onrm) write(*,'("Expected result is x= 1.00000 2.00000 3.00000 4.00000 5.00000")') write(*,'("Computed result is x=",5(x,f7.5))')x xe = xe-x; call qrm_vecnrm(xe, qrm_spmat%n, '2', fnrm) write(*,'(" ")') write(*,'("Forward error norm ||xe-x|| = ",e7.2)')fnrm write(*,'("Optimality residual norm ||A^T*r|| = ",e7.2)')onrm call qrm_spmat_destroy(qrm_spmat) call qrm_dealloc(b) call qrm_dealloc(r) call qrm_dealloc(x) stop end program zqrm_example
7.2 Asynchronous interface
program zqrm_example use zqrm_mod implicit none type(zqrm_spmat_type) :: qrm_spmat type(zqrm_spfct) :: qrm_spfct type(qrm_dscr) :: qrm_dscr type(zqrm_sdata_type) :: x_rhs, b_rhs complex(r64), allocatable :: b(:), x(:), r(:), xe(:) integer :: info real(r64) :: anrm, bnrm, xnrm, rnrm, onrm, fnrm call qrm_init() ! initialize the matrix data structure. call qrm_spmat_init(qrm_spmat) call qrm_spmat_alloc(qrm_spmat, 13, 7, 5, 'coo') qrm_spmat%irn = (/1, 1, 1, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7/) qrm_spmat%jcn = (/1, 3, 5, 2, 3, 5, 1, 4, 4, 5, 2, 1, 3/) qrm_spmat%val = (/1.d0, 2.d0, 3.d0, 1.d0, 1.d0, 2.d0, 4.d0, & 1.d0, 5.d0, 1.d0, 3.d0, 6.d0, 1.d0/) call qrm_alloc(b, qrm_spmat%m, info) call qrm_alloc(r, qrm_spmat%m, info) call qrm_alloc(x, qrm_spmat%n, info) call qrm_alloc(xe, qrm_spmat%n, info) b = (/22.d0, 5.d0, 13.d0, 8.d0, 25.d0, 5.d0, 9.d0/) xe = (/1.d0, 2.d0, 3.d0, 4.d0, 5.d0/) r = b call qrm_vecnrm(b, size(b,1), '2', bnrm) ! init the sparse fato object call zqrm_spfct_init(qrm_spfct, qrm_spmat, err) ! init the descriptor call qrm_dscr_init(qrm_dscr) ! init the rhs and solution data call zqrm_sdata_init(b_rhs, b) call zqrm_sdata_init(x_rhs, x) call zqrm_analyse_async(qrm_dscr, qrm_mat, qrm_spfct) call zqrm_factorize_async(qrm_dscr, qrm_mat, qrm_spfct) call qrm_apply_async(qrm_dscr, qrm_spfct, qrm_conj_transp, b_rhs) call qrm_solve_async(qrm_dscr, qrm_spfct, qrm_no_transp, b_rhs, x_rhs) call qrm_barrier() call qrm_residual_norm(qrm_spmat, r, x, rnrm) call qrm_vecnrm(x, qrm_spmat%n, '2', xnrm) call qrm_spmat_nrm(qrm_spmat, 'f', anrm) call qrm_residual_orth(qrm_spmat, r, onrm) write(*,'("Expected result is x= 1.00000 2.00000 3.00000 4.00000 5.00000")') write(*,'("Computed result is x=",5(x,f7.5))')x xe = xe-x; call qrm_vecnrm(xe, qrm_spmat%n, '2', fnrm) write(*,'(" ")') write(*,'("Forward error norm ||xe-x|| = ",e7.2)')fnrm write(*,'("Optimality residual norm ||A^T*r|| = ",e7.2)')onrm call qrm_spmat_destroy(qrm_spmat) call qrm_spfct_destroy(qrm_spfct) call qrm_dscr_destroy(qrm_dscr) call qrm_sdata_destroy(b_rhs) call qrm_sdata_destroy(x_rhs) call qrm_dealloc(b) call qrm_dealloc(r) call qrm_dealloc(x) stop end program zqrm_example
7.3 C interface
int main(){ struct zqrm_spmat_type_c qrm_spmat; int i; double rnrm, onrm, anrm, bnrm, xnrm; int irn[13] = {1, 1, 1, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7}; int jcn[13] = {1, 3, 5, 2, 3, 5, 1, 4, 4, 5, 2, 1, 3}; double _Complex val[13] = {1.0, 2.0, 3.0, 1.0, 1.0, 2.0, 4.0, 1.0, 5.0, 1.0, 3.0, 6.0, 1.0}; double _Complex b[7] = {22.0, 5.0, 13.0, 8.0, 25.0, 5.0, 9.0}; double _Complex r[7] = {22.0, 5.0, 13.0, 8.0, 25.0, 5.0, 9.0}; double _Complex xe[5] = {1.0, 2.0, 3.0, 4.0, 5.0}; double _Complex x[5]; qrm_init_c(-1, -1); /* initialize the matrix data structure */ zqrm_spmat_init_c(&qrm_spmat); qrm_spmat.m = 7; qrm_spmat.n = 5; qrm_spmat.nz = 13; qrm_spmat.irn = irn; qrm_spmat.jcn = jcn; qrm_spmat.val = val; zqrm_least_squares_c(&qrm_spmat, b, x, 1); zqrm_residual_norm_c(&qrm_spmat, r, x, 1, &rnrm); zqrm_residual_orth_c(&qrm_spmat, r, 1, &onrm); zqrm_vecnrm_c(x, qrm_spmat.n, 1, '2', &xnrm); zqrm_vecnrm_c(b, qrm_spmat.m, 1, '2', &bnrm); zqrm_spmat_nrm_c(&qrm_spmat, 'f', &anrm); printf("Expected result is x= 1.00000 2.00000 3.00000 4.00000 5.00000\n"); printf("Computed result is x= "); for(i=0; i<5; i++){ printf("%7.5f ",creal(x[i])); x[i] -= xe[i]; } printf("\n"); zqrm_vecnrm_c(x, qrm_spmat.n, 1, '2', &xnrm); printf("Forward error ||xe-x|| = %10.5e\n",xnrm); printf("Optimality residual norm ||A^T*r|| = %10.5e\n",onrm); zqrm_spmat_destroy_c(&qrm_spmat); qrm_finalize_c(); return 0; }
8 Performance tuning
The performance of qr_mumps
depends on a number of
parameters. Default values are provided for these parameters that
are expected to achieve reasonably good performance on a wide range
of problems and architectures but for optimal performance these
should be tuned. In this section we provide a list of these
parameters and explain how do they have an effect on performance.
- Block size
qr_mumps
decomposes frontal matrices into blocks of size \(mb \times mb\) (set through theqrm_mb
control parameter described in section 4); this decomposition provides an additional level of parallelism (other than that already expressed by the elimination tree) because it is possible to execute concurrently tasks that operate on different blocks. On the one hand, small values of \(mb\) provide high parallelism; on the other hand, high values of \(mb\) provide high efficiency for each task and make the tasks scheduling overhead negligible. This parameter should be, therefore, chosen as to provide the best compromise between parallelism and tasks efficiency. The optimal value depends on the size and structure of the problem, the number and features of processing units, the efficiency and scalability of BLAS operations etc. On current CPUs block sizes of \(128\) or \(256\) achieve close to optimal task performance and good parallelism on moderately sized problems; is GPUs are used, higher block sizes (1024) provide better performance. Choosing a large \(mb\) value to achieve high performance on GPU devices can severely reduce parallelism and lead to CPU starvation. In this case the \(nb\) parameter (qrm_nb
) can be used to generate additional parallelism; if this parameter is set to a submultiple of \(mb\), the dynamic, hierarchical partitioning technique described in ag.b.g.l:15*1 is used which can lead to better performance. Finally, some tasks use an internal block size; this is set by the \(ib\) parameter (qrm_ib
which has to be a submultiple of \(mb\) and \(nb\)) and defines a compromise between efficiency of tasks and overall amount of floating point operations. Again, when GPUs are used, larger values of \(ib\) lead to better speed whereas on CPUs values of \(32/64\) provide satisfactory speed.- Reduction tree shape
- the \(bh\) parameter (
qrm_bh
) defines the shape of the reduction tree in the \(QR\) panel reduction as explained in h.l.a.d:10 or a.b.g.l:16 . A value of \(k\) means that a panel is divided in groups of size \(k\), intra-group reduction is done with a flat tree, inter-group reduction with a binary tree. Therefore, a value of one achieves the highest parallelism because the whole panel is reduced through a binary tree. Conversely a value which is equal or higher than the number of blocks in a panel leads to lower parallelism because all the blocks in the panels are reduced one after the other; a zero or negative value sets a flat tree on all panels in all fronts of the multifrontal factorization. Nevertheless it must be noted that excessively small values of \(bh\) may lead to inefficient computations because of the nature of the involved tasks. A flat tree typically achieves high performance on a wide range of problems but for very overdetermined problems it may be beneficial to use hybrid trees. - Ordering
- fill-reducing ordering is essential to limit the
fill-in produced by the factorization. This ordering
(set through the
qrm_ordering
control parameter) is computed during the analysis phase and corresponds to a matrix permutation that defines the order in which unknowns are eliminated. The ordering will also affect the shape of the elimination tree which can be more or less balanced or deep with obvious consequences on parallelism, efficiency and, ultimately, execution time. Nested Dissection g:73 methods, such as those implemented in the Metis and SCOTCH packages, usually provide the best results and their running time may be high; local orderings such as AMD/COLAMD a.d.d:04 typically have a lower running time, which results in a faster analysis step, but lead to higher fill-in and thus higher running time and memory consumption for the factorization and the solve. - GPU streams
- when GPUs are used, it can be helpful (and it
usually is) to use multiple streams per GPU to
allow a single GPU to execute multiple tasks
concurrently. Using multiple GPU streams is
especially beneficial to achieve high GPU occupancy
when a relatively small block size \(mb\) is chosen to prevent
CPU starvation. This can be controlled through the
STARPU_NWORKER_PER_CUDA
StarPU environment variable. By default one stream is active per GPU device and higher performance can be commonly achieved with values of 2 up to 20.
9 Credits
- The following people have contributed to the development of
qr_mumps
code: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Florent Lopez, Ian Masliah, Antoine Jego. - The design of
qr_mumps
is deeply influenced by the countless advices from Patrick Amestoy, Jean-Yves L'Excellent and Chiara Puglisi, the developers of the MUMPS and HSL MA49 solvers. - We are also grateful to the StarPU development team for their constant support in the use of this runtime system.
- Thanks Arnaud Legrand, Lucas Schnorr and Luka Stanisic for their feedback on performance analysis through SimGrid and StarVZ.
qr_mumps
was partially funded by the ANR SOLHAR and SOLHARIS projects.- The development, testing and performance evaluation of
qr_mumps
are possible thanks to the computing resources provided by the CALMIP and PlaFRIM supercomputing centers as well as those of the GENCI consortium.
10 Bibliography
Bibliography
- [dure:83] Duff & Reid, The multifrontal solution of indefinite sparse symmetric linear systems, TOMS, 9, 302-325 (1983).
- [adkl:00] Amestoy, Duff, Koster & L'Excellent, MUMPS: a general purpose distributed memory sparse solver, 122-131, in in: Proceedings of PARA2000, the Fifth International Workshop on Applied Parallel Computing, Bergen, June 18-21, edited by Gebremedhin, Manne, Moe & S\orevik, Springer-Verlag (2000)
- [davis:04] Davis, Algorithm 832: UMFPACK V4.3 -- an unsymmetric-pattern multifrontal method, TOMS, 30(2), 196-199 (2004).
- [b:13] Buttari, Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices, SIAM Journal on Scientific Computing, 35(4), C323-C345 (2013). link. doi.
- [a.b.g.l:16] Agullo, Buttari, Guermouche & Lopez, Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems, ACM Trans. Math. Softw., 43(2), 13:1-13:22 (2016). link. doi.
- [a.t.n.w:11] Augonnet, Thibault, Namyst, & Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, 23, 187-198 (2011). link. doi.
- [ag.b.g.l:15*1] Agullo, Buttari, , Guermouche & Lopez, Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures., 54-63, in in: HiPC, edited by IEEE Computer Society (2015)
- [metis] Karypis & Kumar, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM J. Sci. Comput., 20, 359-392 (1998). link. doi.
- [pelle:07] Pellegrini, Scotch 5.0 User's guide, LaBRI, Universit\'e Bordeaux~I, (2007).
- [Davis:2004:CAM:1024074.1024079] Davis, Gilbert, Larimore & Ng, A column approximate minimum degree ordering algorithm, ACM Trans. Math. Softw., 30, 353-376 (2004). link. doi.
- [bjor:96] Bj\"orck, Numerical methods for Least Squares Problems, SIAM (1996).
- [h.l.a.d:10] "Hadri, Ltaief, Agullo, & Dongarra, Tile QR factorization with parallel panel processing for multicore architectures, 1-10, in in: "IPDPS", edited by IEEE (2010)
- [g:73] George, Nested dissection of a regular finite-element mesh, SIAM J. Numer. Anal., 10(2), 345-363 (1973). link. doi.
- [a.d.d:04] Amestoy, Davis, Duff & , Algorithm 837: AMD, an approximate minimum degree ordering algorithm, TOMS, 33(3), 381-388 (2004).