Personal tools
You are here: Home cm Wiki Strategy for Introducing Distributed Memory Multiprocessing to CMISS
Views

History for Strategy for Introducing Distributed Memory Multiprocessing to CMISS

changed:
-
Strategy for Introducing Distributed Memory Multiprocessing to CMISS

Initial draft: 13 July 2004.


* Motivation


There are two current motivations for implementing distributed memory
multiprocessing in CMISS: configuring CMISS for the Bioengineering Institute's
own future high performance computing (HPC) resources and for resources
available through e-Science.  Adding distributed memory multiprocessing to
CMISS, however, will also enable it to more fully utilize other resources
if and when they become available.


** Replacement of the Bioengineering Institute's HPC Resources

The Bioengineering Institute intends to replace its HPC resources either at
the end of this year or soon afterwards.  It appears that it may be possible
to buy significantly more computing power for the same price if we purchase a
(distributed memory) cluster of shared memory nodes (with 2 to 8 processors in
each node) than if we purchase one shared memory machine.

When designing the parallelization of algorithms, maximum efficiency (in
utilization of computational resources) is usually achieved by selecting the
parallel tasks so as to minimize communication between them.  Each parallel
task is made as large and independent as possible so that it performs as much
of the algorithm as possible before it needs to report back to or wait for
information from other parallel tasks.  For example, it is usually more
efficient to parallelize the outermost loop in a serial algorithm if each task
performed in the loop is independent.

With CMISS, the largest possible task is the processing of one command file.
The processing of one command file will be referred to as one _job_.  Jobs are
usually independent of other CMISS jobs (but there are exceptions).

With our current HPC resources, they are usually fully utilized as there are
plenty of people with plenty of independent jobs to run.  It is therefore
usually more efficient to run the non-communicating jobs in parallel, than to
run each job as several communicating parallel tasks, one job after the other.
There is little motivation for dividing jobs into parallel tasks.
Doing so could make one job complete sooner but would normally increase the
average time taken by all the jobs.  The exceptions to this are when there is
a job that is demanding such a large proportion of memory (or some other
resource of the computer) so as to hinder the performance of other jobs
running concurrently.  It is then more efficient to run the resource-hungry
job as fast as possible (by spreading its tasks across available processors)
to free up the resource for other jobs to use.

If the Institute's HPC Resources are replaced with a distributed memory
cluster, then this would provide two differences that would motivate the
division of one job into parallel tasks executed on different processors.

One difference is that there would most likely be sufficiently more processors
available in the new resource that there would be more processors available
than jobs to run.

The other more significant difference is that the distributed memory cluster
would have less memory available to each process than in a shared memory
system.  If one job requires more memory than is available one
one shared memory node then it must divide its data storage across shared
memory nodes (and attempt to divide its tasks so that they are performed on
the same node as the data is placed).


** Utilization of HPC Resources available through E-Science

Through the E-Science project, large computing resources in the United Kingdom
are available to us.  These resources include HPCx, which is a cluster of 50
IBM p690+ Regatta nodes, each with 32 1.7 GHz POWER4 processors and 32 GB of
memory.

The administrators of HPCx and the E-Science project are not so interested in
using HPCx in the most efficient way possible as they are in running the
largest jobs possible on the system.  The terms _capability_ and _capacity_
are used in describing different approaches to utilizing the system.  Running
many non-communicating jobs in parallel is said to utilize the system to its
full capacity, while running one job communicating across all processors
utilizes the system to full capability.

A speed up of 1.7 (wall-clock time) per doubling of the number of processors
used is considered a good scaling.  With such scaling, one _capability_ job
running on 1024 processors will run at 45% ((1.7/2)5) of the efficiency of 32
similar independent _capacity_ jobs each on one 32 processor node, but such
capability jobs that run on 512 or 1024 processors are encouraged and capacity
jobs that use less than 32 processors (and run for more than an hour) are
discouraged.

Utilizing HPCx to its full capability with good scaling will require
distribution of a CMISS jobs data and tasks across many shared memory nodes
while minimizing communication between the tasks.


*** HPCx Partitioning

It is interesting to note that HPCx was initially partitioned with each
Regatta divided into four logical partitions (LPARs).  Each shared memory node
then had only 8 processors and 8 GB memory.  It appears that this was done so
as to improve the bandwidth of the interconnect between the p690 machines.
Although there may have been an improvement in local memory access on each of
the LPARs, it seems that this was not sufficient to motivate the partitioning.

This is an excerpt from the IBM Redbook, ``Performance and Tuning
Considerations for the p690 in a Cluster 1600'':

``Each LPAR can have two SP Switch2 PCI Attachment Adapters (on a dual-plane SP
Switch2 system). So you can effectively increase the interconnect bandwidth
off a pSeries 690 by configuring it with multiple LPARs as compared with
leaving it in Full System Partition mode. (Note that there is a limit of 16 SP
Switch2 PCI Attachment Adapters per pSeries 690.)''

The following paragraphs are excerpts from the HPCx User Guide version 2.1:

``The introduction of the HPS switch has a number of knock-on effects. In order
to increase the throughput over the Colony switch, the HPCx phase1 system was
configured with each 32-way frame logically divided into four 8-way LPARs each
running its own copy of the AIX operating system. This allowed four pairs of
adaptors per frame to attach to the Colony network. Due to the better
specification of the HPS, dividing the frames into LPARs of 8 processors is no
longer necessary. On HPCx phase2 will be operating as a cluster of 32-way SMP
nodes, with just one copy of AIX per frame. For production sized jobs, each
32-way SMP node will be dedicated and charged to a single compute job
(exclusive access).''

``On the HPCx phase1 system, with its 8-way LPARs, each LPAR was automatically
mapped onto one of the MCMs (Multi-Chip Module) of the p690 frame. On phase2,
with its larger LPARs, user applications can now span across MCMs. It is
possible for a process (MPI task) to find itself running on one MCM and its
memory to be located in another. We therefore recommend setting memory
affinity (set the MEMORY_AFFINITY environment variable to MCM) so that AIX
attempts to allocate memory from the MCM where the process is running. This
improves locality of reference. There are also issues around placement of
processes with respect to MCMs (process affinity), which are currently under
investigation.''


* History of Multiprocessing in CMISS


** Distributed Memory Mechanics

Distributed memory multiprocessing has been implemented in CMISS previously
for finite deformation mechanics to calculate element stiffness matrix
calculations of different elements on potentially different processors.  This
parallelization sent most geometry data to all slave processes, and then a
master process sent dependent variable information for elements of its choice
to slave processes of its choice for slave processes to calculate their
assigned element stiffness matrices.  The slave processes then returned the
element stiffness matrices to the master process, which assembled them
into a global matrix and solved the system.


** OpenMP Multithreading

Current multiprocessing in CMISS requires processors with shared memory and is
implemented through OpenMP.  _Multithreading_ is probably a more accurate name
for this as the CMISS job is divided into tasks that are executed by different
threads.  All threads share the same physical memory and so there is no need
to explicitly pass data from one thread to another.  It is necessary however
to ensure that memory used by one thread to write data for its calculations is
not used to write data from other threads.

The scalability provided by the current shared memory multithreading
implementation varies from one problem type to another.  Currently users
solving mechanics problems do not see sufficient speed up and hence prefer to
run multiple jobs simultaneously.  However the cell modelling code often
shows very good speed ups.

The current implementation of IO within parallel threads is often poor. The
subroutine WRITES is usually called after placing the message in a global
buffer.  This can interfere with other threads of the same process that use
the same global buffer.  A thread-private buffer should be used instead.
Unfortunately WRITES expects the message buffer to be large even when the
message is not; this requirement should also be removed.


* Differences with Distributed Memory Multiprocessing


With distributed memory multiprocessing, all data is by default local to the
process, and so there is little risk of one processes's data begin corrupted
by another process.  Instead the issue is to get the necessary data to the
processes that need it by the time that the need it, and to get information
back without holding up other processes.


** Data and Task Distribution

Because of the architecture of clusters, data transfer between nodes is
usually significantly slower than memory access on a shared memory system, and
so minimization of data transfer becomes more critical.

All data is available to each thread in shared memory multithreading, but if
all data were made available to all processes of a distributed memory
multiprocessing job, the memory requirements would be multiplied by the number
of processes.  Instead the data distribution needs to be designed so that only
the necessary data is stored in each process.  Furthermore, if there is master
process controlling the other processes, it will often not have available
enough memory to store a master copy of all data and so will need to release
memory that it is no longer using.  This will require more dynamic memory
management policies than are currently implemented in CMISS.

Because all data is readily available to threads in shared memory
multithreading, it is easy to spawn threads when they are required for a
parallel section, then re-spawn them again for another section (although the
OpenMP implementation may keep the threads spinning as an optimization).
However, if a new process in distributed memory multiprocessing were spawned
each time a parallel section was encountered, then the input data would need
to be sent each time.  As much data does not change between parallel sections
it would be preferable to leave processes running between tasks.  (If there
are distinct parallel tasks that don't share any data, however, then it may be
reasonable to have separate processes, possibly on the same machine.)  Each
process would usually have an initialization phase, during which constant data
was set up, and then a series of computation phases, before each of which some
data is updated.

The locallization of data in distributed memory multiprocessing makes it
difficult to decide dynamically which tasks should be performed by which
process.  It will be significantly more efficient to keep using the same data
in the same process than to pass it to another process.  The decision as to
which process should perform which tasks so as to share the load evenly across
processors should be made before the data is set up if at all possible.  This
means that jobs with distributed memory multiprocessing will work much more
efficiently if they have exclusive access to a certain number of processors.
If there is another job competing for processor time with just one process of
a multiprocessing job then either all the other processes will end up waiting
for the results from that job or the tasks and associated data assigned to
that process will need to be moved to other processes.


** Designation of Single or Multi- Processing

OpenMP provides a mechanism to compile the code with and without the
parallelizations, through conditional compilation.  Distributed memory
programming standards, however, do not provide such a mechanism, and so it
will be necessary to determine a way for the code to behave differently when
using and not using distributed memory multiprocessing.  This could be
determined at run time, but may be more efficiently done at compile time by
preprocessing the source files.


* Tools


** MPI and PVM

It is difficult to compare Message Passing Interface (MPI) and Parallel
Virtual Machine (PVM) as both standards are continually developing.  A feature
that is provided by a version of one standard is often later incorporated into
a version of the other standard.  Furthermore, the actual implementations of
the APIs lag behind the standards, introducing some of the new features before
others.  However, MPI appears to be more widely used than PVM and some
advantageous features that it may have (but may have subsequently been
introduced to PVM) include non-blocking sends, predefined reduction functions,
and derived data types.

Two freely available implementations of MPI are MPICH and MPI/LAM.  These both
implement MPI 1.2 with some of the features of MPI 2.0.  Hardware vendors
often provide their own implementations of MPI and PVM.


** Mosix

Mosix can dynamically and transparently move processes from machine to machine
as loads indicate.  Openmosix (which forked from Mosix) has been ported to the
IA64 architecture but is not supported on this architecture, the developers
claiming that there was not enough support provided by the manufacturer.
There are plans to port Openmosix to the Opteron architecture.


** PETSc

The Portable, Extensible Toolkit for Scientific Computation (PETSc) provides a
suite of parallel computational tools for distributed memory architectures.
This includes solvers of many kinds but also an implementation of vectors and
matrices that shares storage across the processes.

The PETSc copyright includes the phrase: 

``Permission is hereby granted to use, reproduce, prepare
derivative works, and to redistribute to others, so long as this original
copyright notice is retained.''

It is unclear whether or not this licence would be suitable for use in CMISS.
Even if not suitable, it could be helpful to investigate the design decisions
made in PETSc as it implements many of the functions required in CMISS.

PETSc is not recommended as a linear solver for matrices from a
sequential program.  From the abstract to the PETSc Users Manual:

``PETSc should not be used to attempt to provide a parallel linear solver in
an otherwise sequential code. Certainly all parts of a previously sequential
code need not be parallelized but the matrix generation portion must be to
expect any kind of reasonable performance. Do not expect to generate your
matrix sequentially and then use PETSc to solve the linear system in
parallel.''

It is interesting to note that PETSc matrices are partitioned by contiguous
chunks of rows.  There is currently no implementation for matrices partitioned
by blocks.  It is unclear whether this is a due to ease of implementation or
to expected performance benefit.


* General Code Improvements


** Documentation of Data Usage

It would be very useful for each subroutine that is called from within a
parallel region (and for other subroutines also) to state which variables are
modified in the subroutine (i.e. which parameters are input and which are
output).  This includes not only subroutine parameters but also common blocks.


** Common Blocks

Common blocks can make it difficult to determine which variables are actually
used in subroutines.  If variables are only used and not modified this is not
so bad because the common blocks can be duplicated across all processes just
once (provided they are not too large).  If the common blocks are modified,
however, it is necessary to ensure that the other machines are updated if and
before they use the old value.  If the common blocks are modified during
computation, a more tightly controlled data storage method would facilitate
maintaining code that keeps the data synchronized.


** Memory Usage

The multi-dimensional rectangular array storage in CMISS can often result in
CMISS using much more memory and having its data much more spread out than
necessary.  The benefits of rectangular array storage are often only in the
contiguous storage of data in the first dimension (or column) of the array.
A data structure that retained the contiguous storage in the first dimension
but allowed for these to be stored as vectors of varying length, could enable
CMISS to significantly economize on memory usage.


* Selecting a CMISS Problem Type with which to Begin


Because each problem type is CMISS is implemented somewhat independently,
distributed memory multiprocessing will need to be introduced into each
problem type independently.  If it is introduced to one problem type first,
then the techniques used and lessons learnt can be applied to other problem
types.

Three problem types in CMISS that are computationally demanding and/or use
large amounts of memory are lung modelling, finite deformation mechanics, and
cell modelling.  Of these cell modelling stands out as lending itself most
easily to distributed memory parallelization.

In cell modelling, each cell may contain many state variables, only a few of
which interact with neighbouring cells.  The cell-internal state variables
need only be initialized and stored in the process that does the computation
for that cell, and there is much cell-internal computation that can be
performed with little communication with other processes.

Often even within problem types there are many options that utilize different
parts of CMISS.  In cell modelling there are, for example, different types of
cell integrators, different diffusion operators, and different linear
solvers.  It is important to determine which of these options work the best
so as to spend effort on parallelization of these options, and save work on
options that will not be used.


* Profiling


Once one problem type with one set of options has been selected for initial
implementation of distributed memory multiprocessing, it is important to
profile a job of this type to determine which code is spending the most time
on calculations and which data structures are using the most memory.  The
parallelization effort can then focus on this code and these structures.
There is little point in carefully dividing the read-only data in small arrays
amongst many process if it does not consume much memory, and there is little
point parallelizing a section of code that does not work with large data
structures and consumes very little processor time.

For parallelization of cell modelling, the assumptions that much of the
computation is cell-internal and that much of the data is the cell-internal
state variables needs to be verified.

Single process profiling may identify unexpected areas of time-consuming
computation or large memory usage.  Such areas should be investigated to see
if they have been implemented as efficiently as possible.  A little tuning
could make big gains in the performance of both single and multiprocessor jobs.


* Object-Oriented Data Management


An efficient distributed memory multiprocessing implementation relies on the
efficiency of the data management.  Once profiling has identified the data
structures that occupy the most memory, these structures will need redesigning
so that the data can be distributed across the processes.


** A Flexible Procedural Interface

The new data structures need to be flexible so that they can be optimized for
either single or multi- processor jobs, and it would be useful to be able to
test the efficiency of different storage methods without having to modify
large quantities of code.  They should also make it easy to work out where in
the code data is modified.  It may be useful if the data storage automatically
handled migration of data from one process to another.

These design criteria suggest an object-oriented data management system with a
procedural interface.  The interface might initially simply provide set
value and get value procedures, but could be extended to allow operations with
multiple values, and perhaps add instructions about which processes should
store which values.  Completely different storage mechanisms could be
implemented for single/multiprocessor applications, etc.

PETSc implements such an interface for its vector and matrix storage.  The
PETSc interface should be considered to see if it is applicable to CMISS,
either for direct use or merely for borrowing ideas.

A procedural interface to the large data arrays may incur the overhead of a
procedure call to access the data, but there are techniques to avoid this.
Normally only a much smaller subset of values from the large data structures
is required for computation at any one time.  If multiple values can be
fetched through one call then the overhead becomes small.

For example, if the CMISS arrays YP and ZP (used to hold dependent variable
information) were converted to a data structure that was accessed through a
procedural interface, then the interface might provide a procedure analogous
to CMISS subroutine ZPXE, which provides the values for one element.  (For
cell modelling the initial array of interest might well be YQS, but profiling
will confirm this or otherwise.)

If the procedure call overhead is still expensive, some procedures could be
inlined.


** Data Management Implementation

The data management code could be developed to store data more efficiently
with ragged arrays or other techniques, to automatically calculate the
size of memory required for the storage, and to release memory when it is no
longer needed.

Even though it is possible to write the data management code behind the
procedural interface predominantly in Fortran (with a few calls to C
subroutines for memory (de)allocation), a more flexible language such as C
may be more appropriate for storing pointers to vectors in ragged arrays and
for providing derived data types.  Fortran 90 provides these features but does
not have freely available optimized compilers and is not as flexible as C.
C++ adds features such as inheritance but may not be as portable as C.
The procedural interface would still be available to Fortran code whatever
language was used for the data management.


* Benchmarking


The high performance computer vendors can provide specifications on CPU
speeds, number of floating point units, memory latency times, etc., but to
really find out how well the system performs overall for us, it will be
necessary to run some benchmarks.

In the past we have run single thread and shared memory multithread benchmarks
on different systems.  In addition to these it would be interesting to run the
same single-thread job on all processors in a shared memory node to see what
effect the total load on the memory system has on performance.

The above benchmarks test only the performance of one shared memory node.
While this is important, it doesn't provide information about the speed of
communication between nodes.  Even, if we attempted to quickly put some
message passing code into CMISS, the resulting benchmarks would not be
indicative of the performance we could expect to achieve with well tuned
message passing.

In order to test the performance of the interconnects on various systems, it
would be more advantageous to analyse benchmarks of codes that have already
been tested and tuned.  One possible candidate is a PETSc example program, but
this may not provide any more information than the readily available LINPACK
benchmark results.  LINPACK performs a solution of a dense system of
equations.  This is not too different from computation done within CMISS, but
LINPACK may give performance closer to peak theoretical performance due the
regular nature of the problem.


From andre Wed Jul 28 18:44:08 +1200 2004
From: andre
Date: Wed, 28 Jul 2004 18:44:08 +1200
Subject: current plan?
Message-ID: <20040728184408+1200@www.bioeng.auckland.ac.nz/cms>

Might be worth adding in the current plan for what is going to happen once Chris gets here, just so we can have more of an idea about what will actually happen...

From karl Fri Jul 30 11:29:31 +1200 2004
From: karl
Date: Fri, 30 Jul 2004 11:29:31 +1200
Subject: summary of initial steps
Message-ID: <20040730112931+1200@www.bioeng.auckland.ac.nz/cms>

There is no concise summary of the initial steps suggested in the
report so I'll list them here as I see them.

1 The first thing that I thought needed doing was deciding on a
  problem type to focus on.

  Peter and David have already discussed this and decided on coupled
  electromechanics.  I understand this decision was based on the focus of the
  grants involved and on selecting a fairly challenging problem.

2 Determine which are the best options to use in solving this
  problem type.

3 Profile the solution of this problem type with these options to
  determine the arrays that use the most memory and the code that consumes the
  most cpu time.

4 Investigate whether PETSc or some other existing solver package
  is suitable for use as a linear solver for CMISS.

5 Use the knowledge from the profiling (and to some extent solver
  selection) to decide how storage can best be designed for the
  largest arrays so as to give us flexibility of storage, better
  control of memory access, more efficient memory management, and
  of course the ability to distribute arrays across processes.