Strategy for Introducing Distributed Memory Multiprocessing to CMISS
Strategy for Introducing Distributed Memory Multiprocessing to CMISS
Initial draft: 13 July 2004.
- Motivation
There are two current motivations for implementing distributed memory multiprocessing in CMISS: configuring CMISS for the Bioengineering Institute's own future high performance computing (HPC) resources and for resources available through e-Science. Adding distributed memory multiprocessing to CMISS, however, will also enable it to more fully utilize other resources if and when they become available.
** Replacement of the Bioengineering Institute's HPC Resources
The Bioengineering Institute intends to replace its HPC resources either at the end of this year or soon afterwards. It appears that it may be possible to buy significantly more computing power for the same price if we purchase a (distributed memory) cluster of shared memory nodes (with 2 to 8 processors in each node) than if we purchase one shared memory machine.
When designing the parallelization of algorithms, maximum efficiency (in utilization of computational resources) is usually achieved by selecting the parallel tasks so as to minimize communication between them. Each parallel task is made as large and independent as possible so that it performs as much of the algorithm as possible before it needs to report back to or wait for information from other parallel tasks. For example, it is usually more efficient to parallelize the outermost loop in a serial algorithm if each task performed in the loop is independent.
With CMISS, the largest possible task is the processing of one command file. The processing of one command file will be referred to as one _job_. Jobs are usually independent of other CMISS jobs (but there are exceptions).
With our current HPC resources, they are usually fully utilized as there are plenty of people with plenty of independent jobs to run. It is therefore usually more efficient to run the non-communicating jobs in parallel, than to run each job as several communicating parallel tasks, one job after the other. There is little motivation for dividing jobs into parallel tasks. Doing so could make one job complete sooner but would normally increase the average time taken by all the jobs. The exceptions to this are when there is a job that is demanding such a large proportion of memory (or some other resource of the computer) so as to hinder the performance of other jobs running concurrently. It is then more efficient to run the resource-hungry job as fast as possible (by spreading its tasks across available processors) to free up the resource for other jobs to use.
If the Institute's HPC Resources are replaced with a distributed memory cluster, then this would provide two differences that would motivate the division of one job into parallel tasks executed on different processors.
One difference is that there would most likely be sufficiently more processors available in the new resource that there would be more processors available than jobs to run.
The other more significant difference is that the distributed memory cluster would have less memory available to each process than in a shared memory system. If one job requires more memory than is available one one shared memory node then it must divide its data storage across shared memory nodes (and attempt to divide its tasks so that they are performed on the same node as the data is placed).
** Utilization of HPC Resources available through E-Science
Through the E-Science project, large computing resources in the United Kingdom are available to us. These resources include HPCx, which is a cluster of 50 IBM p690+ Regatta nodes, each with 32 1.7 GHz POWER4 processors and 32 GB of memory.
The administrators of HPCx and the E-Science project are not so interested in using HPCx in the most efficient way possible as they are in running the largest jobs possible on the system. The terms _capability_ and _capacity_ are used in describing different approaches to utilizing the system. Running many non-communicating jobs in parallel is said to utilize the system to its full capacity, while running one job communicating across all processors utilizes the system to full capability.
A speed up of 1.7 (wall-clock time) per doubling of the number of processors used is considered a good scaling. With such scaling, one _capability_ job running on 1024 processors will run at 45% ((1.7/2)5) of the efficiency of 32 similar independent _capacity_ jobs each on one 32 processor node, but such capability jobs that run on 512 or 1024 processors are encouraged and capacity jobs that use less than 32 processors (and run for more than an hour) are discouraged.
Utilizing HPCx to its full capability with good scaling will require distribution of a CMISS jobs data and tasks across many shared memory nodes while minimizing communication between the tasks.
*** HPCx Partitioning
It is interesting to note that HPCx was initially partitioned with each Regatta divided into four logical partitions (LPARs). Each shared memory node then had only 8 processors and 8 GB memory. It appears that this was done so as to improve the bandwidth of the interconnect between the p690 machines. Although there may have been an improvement in local memory access on each of the LPARs, it seems that this was not sufficient to motivate the partitioning.
This is an excerpt from the IBM Redbook, ``Performance and Tuning Considerations for the p690 in a Cluster 1600'':
``Each LPAR can have two SP Switch2 PCI Attachment Adapters (on a dual-plane SP Switch2 system). So you can effectively increase the interconnect bandwidth off a pSeries 690 by configuring it with multiple LPARs as compared with leaving it in Full System Partition mode. (Note that there is a limit of 16 SP Switch2 PCI Attachment Adapters per pSeries 690.)''
The following paragraphs are excerpts from the HPCx User Guide version 2.1:
``The introduction of the HPS switch has a number of knock-on effects. In order to increase the throughput over the Colony switch, the HPCx phase1 system was configured with each 32-way frame logically divided into four 8-way LPARs each running its own copy of the AIX operating system. This allowed four pairs of adaptors per frame to attach to the Colony network. Due to the better specification of the HPS, dividing the frames into LPARs of 8 processors is no longer necessary. On HPCx phase2 will be operating as a cluster of 32-way SMP nodes, with just one copy of AIX per frame. For production sized jobs, each 32-way SMP node will be dedicated and charged to a single compute job (exclusive access).''
``On the HPCx phase1 system, with its 8-way LPARs, each LPAR was automatically mapped onto one of the MCMs (Multi-Chip Module) of the p690 frame. On phase2, with its larger LPARs, user applications can now span across MCMs. It is possible for a process (MPI task) to find itself running on one MCM and its memory to be located in another. We therefore recommend setting memory affinity (set the MEMORY_AFFINITY environment variable to MCM) so that AIX attempts to allocate memory from the MCM where the process is running. This improves locality of reference. There are also issues around placement of processes with respect to MCMs (process affinity), which are currently under investigation.''
- History of Multiprocessing in CMISS
** Distributed Memory Mechanics
Distributed memory multiprocessing has been implemented in CMISS previously for finite deformation mechanics to calculate element stiffness matrix calculations of different elements on potentially different processors. This parallelization sent most geometry data to all slave processes, and then a master process sent dependent variable information for elements of its choice to slave processes of its choice for slave processes to calculate their assigned element stiffness matrices. The slave processes then returned the element stiffness matrices to the master process, which assembled them into a global matrix and solved the system.
** OpenMP Multithreading
Current multiprocessing in CMISS requires processors with shared memory and is implemented through OpenMP. _Multithreading_ is probably a more accurate name for this as the CMISS job is divided into tasks that are executed by different threads. All threads share the same physical memory and so there is no need to explicitly pass data from one thread to another. It is necessary however to ensure that memory used by one thread to write data for its calculations is not used to write data from other threads.
The scalability provided by the current shared memory multithreading implementation varies from one problem type to another. Currently users solving mechanics problems do not see sufficient speed up and hence prefer to run multiple jobs simultaneously. However the cell modelling code often shows very good speed ups.
The current implementation of IO within parallel threads is often poor. The subroutine WRITES is usually called after placing the message in a global buffer. This can interfere with other threads of the same process that use the same global buffer. A thread-private buffer should be used instead. Unfortunately WRITES expects the message buffer to be large even when the message is not; this requirement should also be removed.
- Differences with Distributed Memory Multiprocessing
With distributed memory multiprocessing, all data is by default local to the process, and so there is little risk of one processes's data begin corrupted by another process. Instead the issue is to get the necessary data to the processes that need it by the time that the need it, and to get information back without holding up other processes.
** Data and Task Distribution
Because of the architecture of clusters, data transfer between nodes is usually significantly slower than memory access on a shared memory system, and so minimization of data transfer becomes more critical.
All data is available to each thread in shared memory multithreading, but if all data were made available to all processes of a distributed memory multiprocessing job, the memory requirements would be multiplied by the number of processes. Instead the data distribution needs to be designed so that only the necessary data is stored in each process. Furthermore, if there is master process controlling the other processes, it will often not have available enough memory to store a master copy of all data and so will need to release memory that it is no longer using. This will require more dynamic memory management policies than are currently implemented in CMISS.
Because all data is readily available to threads in shared memory multithreading, it is easy to spawn threads when they are required for a parallel section, then re-spawn them again for another section (although the OpenMP implementation may keep the threads spinning as an optimization). However, if a new process in distributed memory multiprocessing were spawned each time a parallel section was encountered, then the input data would need to be sent each time. As much data does not change between parallel sections it would be preferable to leave processes running between tasks. (If there are distinct parallel tasks that don't share any data, however, then it may be reasonable to have separate processes, possibly on the same machine.) Each process would usually have an initialization phase, during which constant data was set up, and then a series of computation phases, before each of which some data is updated.
The locallization of data in distributed memory multiprocessing makes it difficult to decide dynamically which tasks should be performed by which process. It will be significantly more efficient to keep using the same data in the same process than to pass it to another process. The decision as to which process should perform which tasks so as to share the load evenly across processors should be made before the data is set up if at all possible. This means that jobs with distributed memory multiprocessing will work much more efficiently if they have exclusive access to a certain number of processors. If there is another job competing for processor time with just one process of a multiprocessing job then either all the other processes will end up waiting for the results from that job or the tasks and associated data assigned to that process will need to be moved to other processes.
** Designation of Single or Multi- Processing
OpenMP provides a mechanism to compile the code with and without the parallelizations, through conditional compilation. Distributed memory programming standards, however, do not provide such a mechanism, and so it will be necessary to determine a way for the code to behave differently when using and not using distributed memory multiprocessing. This could be determined at run time, but may be more efficiently done at compile time by preprocessing the source files.
- Tools
** MPI and PVM
It is difficult to compare Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) as both standards are continually developing. A feature that is provided by a version of one standard is often later incorporated into a version of the other standard. Furthermore, the actual implementations of the APIs lag behind the standards, introducing some of the new features before others. However, MPI appears to be more widely used than PVM and some advantageous features that it may have (but may have subsequently been introduced to PVM) include non-blocking sends, predefined reduction functions, and derived data types.
Two freely available implementations of MPI are MPICH and MPI/LAM. These both implement MPI 1.2 with some of the features of MPI 2.0. Hardware vendors often provide their own implementations of MPI and PVM.
** Mosix
Mosix can dynamically and transparently move processes from machine to machine as loads indicate. Openmosix (which forked from Mosix) has been ported to the IA64 architecture but is not supported on this architecture, the developers claiming that there was not enough support provided by the manufacturer. There are plans to port Openmosix to the Opteron architecture.
** PETSc
The Portable, Extensible Toolkit for Scientific Computation (PETSc) provides a suite of parallel computational tools for distributed memory architectures. This includes solvers of many kinds but also an implementation of vectors and matrices that shares storage across the processes.
The PETSc copyright includes the phrase:
``Permission is hereby granted to use, reproduce, prepare derivative works, and to redistribute to others, so long as this original copyright notice is retained.''
It is unclear whether or not this licence would be suitable for use in CMISS. Even if not suitable, it could be helpful to investigate the design decisions made in PETSc as it implements many of the functions required in CMISS.
PETSc is not recommended as a linear solver for matrices from a sequential program. From the abstract to the PETSc Users Manual:
``PETSc should not be used to attempt to provide a parallel linear solver in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then use PETSc to solve the linear system in parallel.''
It is interesting to note that PETSc matrices are partitioned by contiguous chunks of rows. There is currently no implementation for matrices partitioned by blocks. It is unclear whether this is a due to ease of implementation or to expected performance benefit.
- General Code Improvements
** Documentation of Data Usage
It would be very useful for each subroutine that is called from within a parallel region (and for other subroutines also) to state which variables are modified in the subroutine (i.e. which parameters are input and which are output). This includes not only subroutine parameters but also common blocks.
** Common Blocks
Common blocks can make it difficult to determine which variables are actually used in subroutines. If variables are only used and not modified this is not so bad because the common blocks can be duplicated across all processes just once (provided they are not too large). If the common blocks are modified, however, it is necessary to ensure that the other machines are updated if and before they use the old value. If the common blocks are modified during computation, a more tightly controlled data storage method would facilitate maintaining code that keeps the data synchronized.
** Memory Usage
The multi-dimensional rectangular array storage in CMISS can often result in CMISS using much more memory and having its data much more spread out than necessary. The benefits of rectangular array storage are often only in the contiguous storage of data in the first dimension (or column) of the array. A data structure that retained the contiguous storage in the first dimension but allowed for these to be stored as vectors of varying length, could enable CMISS to significantly economize on memory usage.
- Selecting a CMISS Problem Type with which to Begin
Because each problem type is CMISS is implemented somewhat independently, distributed memory multiprocessing will need to be introduced into each problem type independently. If it is introduced to one problem type first, then the techniques used and lessons learnt can be applied to other problem types.
Three problem types in CMISS that are computationally demanding and/or use large amounts of memory are lung modelling, finite deformation mechanics, and cell modelling. Of these cell modelling stands out as lending itself most easily to distributed memory parallelization.
In cell modelling, each cell may contain many state variables, only a few of which interact with neighbouring cells. The cell-internal state variables need only be initialized and stored in the process that does the computation for that cell, and there is much cell-internal computation that can be performed with little communication with other processes.
Often even within problem types there are many options that utilize different parts of CMISS. In cell modelling there are, for example, different types of cell integrators, different diffusion operators, and different linear solvers. It is important to determine which of these options work the best so as to spend effort on parallelization of these options, and save work on options that will not be used.
- Profiling
Once one problem type with one set of options has been selected for initial implementation of distributed memory multiprocessing, it is important to profile a job of this type to determine which code is spending the most time on calculations and which data structures are using the most memory. The parallelization effort can then focus on this code and these structures. There is little point in carefully dividing the read-only data in small arrays amongst many process if it does not consume much memory, and there is little point parallelizing a section of code that does not work with large data structures and consumes very little processor time.
For parallelization of cell modelling, the assumptions that much of the computation is cell-internal and that much of the data is the cell-internal state variables needs to be verified.
Single process profiling may identify unexpected areas of time-consuming computation or large memory usage. Such areas should be investigated to see if they have been implemented as efficiently as possible. A little tuning could make big gains in the performance of both single and multiprocessor jobs.
- Object-Oriented Data Management
An efficient distributed memory multiprocessing implementation relies on the efficiency of the data management. Once profiling has identified the data structures that occupy the most memory, these structures will need redesigning so that the data can be distributed across the processes.
** A Flexible Procedural Interface
The new data structures need to be flexible so that they can be optimized for either single or multi- processor jobs, and it would be useful to be able to test the efficiency of different storage methods without having to modify large quantities of code. They should also make it easy to work out where in the code data is modified. It may be useful if the data storage automatically handled migration of data from one process to another.
These design criteria suggest an object-oriented data management system with a procedural interface. The interface might initially simply provide set value and get value procedures, but could be extended to allow operations with multiple values, and perhaps add instructions about which processes should store which values. Completely different storage mechanisms could be implemented for single/multiprocessor applications, etc.
PETSc implements such an interface for its vector and matrix storage. The PETSc interface should be considered to see if it is applicable to CMISS, either for direct use or merely for borrowing ideas.
A procedural interface to the large data arrays may incur the overhead of a procedure call to access the data, but there are techniques to avoid this. Normally only a much smaller subset of values from the large data structures is required for computation at any one time. If multiple values can be fetched through one call then the overhead becomes small.
For example, if the CMISS arrays YP and ZP (used to hold dependent variable information) were converted to a data structure that was accessed through a procedural interface, then the interface might provide a procedure analogous to CMISS subroutine ZPXE, which provides the values for one element. (For cell modelling the initial array of interest might well be YQS, but profiling will confirm this or otherwise.)
If the procedure call overhead is still expensive, some procedures could be inlined.
** Data Management Implementation
The data management code could be developed to store data more efficiently with ragged arrays or other techniques, to automatically calculate the size of memory required for the storage, and to release memory when it is no longer needed.
Even though it is possible to write the data management code behind the procedural interface predominantly in Fortran (with a few calls to C subroutines for memory (de)allocation), a more flexible language such as C may be more appropriate for storing pointers to vectors in ragged arrays and for providing derived data types. Fortran 90 provides these features but does not have freely available optimized compilers and is not as flexible as C. C++ adds features such as inheritance but may not be as portable as C. The procedural interface would still be available to Fortran code whatever language was used for the data management.
- Benchmarking
The high performance computer vendors can provide specifications on CPU speeds, number of floating point units, memory latency times, etc., but to really find out how well the system performs overall for us, it will be necessary to run some benchmarks.
In the past we have run single thread and shared memory multithread benchmarks on different systems. In addition to these it would be interesting to run the same single-thread job on all processors in a shared memory node to see what effect the total load on the memory system has on performance.
The above benchmarks test only the performance of one shared memory node. While this is important, it doesn't provide information about the speed of communication between nodes. Even, if we attempted to quickly put some message passing code into CMISS, the resulting benchmarks would not be indicative of the performance we could expect to achieve with well tuned message passing.
In order to test the performance of the interconnects on various systems, it would be more advantageous to analyse benchmarks of codes that have already been tested and tuned. One possible candidate is a PETSc example program, but this may not provide any more information than the readily available LINPACK benchmark results. LINPACK performs a solution of a dense system of equations. This is not too different from computation done within CMISS, but LINPACK may give performance closer to peak theoretical performance due the regular nature of the problem.