subprojects for development of distributed parallelization

This list is a set of subprojects that could be experimented with independently to understand and develop techniques to handle the parallel communication involved.

Cell models (connected by diffusion)

Simple parallelization because there is little communication between cells. (At least initially, the parallization of the diffusion operator can be excluded from this subproject.)

Communication of transmembrane potential may be the only communication involved here (after initialization).

One area for optimization is making sure computation can start as soon as possible having received the required information. Computation for one cell can start before the transmembrane potential is received for all the other cells at the node.
Matrix assembly

The code for this would depend on field storage format (how elements, basis functions, etc. are accessed) and on whether the mesh is structured or unstructured. The algorithm would likely depend on the domain decomposition algorithm.
Solver

Should be able to use existing solver libraries. Probably Krylov methods that make use of provided matrix-vector multiplication and preconditioner functions.
Matrix-vector multiplication

A fundamental function, that could be a good place to start to understand the issues.

There may be existing libraries that provide this function for a data distribution that we can use.
General database and cache

Something to handle changing and access of data, decide on which nodes to store the data (unless explicitly directed otherwise), and automatically invalidate (or update) caches on other nodes, etc.

This could be used by the matrix assembly (and possibly other operations, including all those using fields).

The idea here is a distributed database that can be accessed from any node or process, but may store the data on any one or more nodes either as usage patterns suggest or as explicitly directed.

Having a general automated database would make writing software much easier but at least some explicit directions for storage of data would likely be required for optimal performance.

The distribution of data may be partially or entirely determined by a domain decomposition algorithm, but even then some values would be accessed and changed by more than one process.

A process that needs to use the latest value of a piece of data may either explicitly fetch (or preferably prefetch) the data when (or before) required, or register for automatic updates of changed data, or maybe the data could be explicitly pushed from another node.
Direction of computation

A mechanism for determining what actions happen on what nodes, keeping track of which actions have been completed, and reassessing the distribution of tasks dynamically based on performance.

The distribution of data may be explicitly specified and altered.

Domain decomposition algorithms are involved here.

It may not be easy to develop this as a separate independent subproject.
General input of data into the computational kernel.

By this I don't mean actual I/O as this will be handled outside the computational kernel. What I mean is that eventually there will be a call along the lines of create a bilinear basis function with 2*2 Gauss points etc. This impacts with the MPI program because the calling process (GUI, script, another program etc.) will only call one process of the computational kernel. There are then issues of how this call is communicated to the other processes, how any memory is allocated to store the object, how defaults are handled (I would think that we would want the computational kernel library to be a s simple as possible to use). My general idea for this is that the input for every "object" that the computational kernel needs to be concerned about is as follows. 1) the external program starts with a start_create_object call. This syncs the processes and lets them set up any variables that they might need for object creation. 2) the external program then starts with a create_object call e.g. create_basis(number=1, type=bilinear) or create_node(number=1, position=x, y, z). In general only the most basic of information is used in the create call. The rest of the information assumes a default value (just like the old cm commands whereby the most sensible options were default and did not appear in the actual command). The "master" process (process 0) then stores/caches the creation call values and creates a "default" object (i.e., all default values filled in) in a list. The more complicated values can be filled in later. e.g., start_create_coordinate_system(), create_coordinate_system(system_number=1,number_of_dim=3), set_coordinate_system_type(system_number=1,type=prolate), set_coordinate_system_focus(system_number=1, focus=1.1) etc. etc. All these set calls alter the default values in the list of "semi-created" objects on the master process 3) the objects are finished off with a finish_create_object call. This call then goes through and actually creates the objects where it has to. All the processes are synced now and the appropriate memory allocated (as we know how many objects to create we can use more efficient structures like arrays rather than lists). MPI communication then takes place and sends the data from the pre-object list to the appropriate processes.

Some other approaches are
- Provide all the necessary creation data on the create line. This is
  
  probably ineffiecient for complicated objects.
- Consider all objects temporary or initial until 'added' to a MPI
  
  processing space or such like. So you just create and modify your object, then when complete you have an object representing the MPI space and add it to that, a function like MPI_space_add_node(my_processing_pool, my node); At this point the object becomes available to the necessary partitions of the MPI programming pool.
  
  The real tricky component of this is if you have more than one gateway
  
  node however I don't think we want this in the first instance.
Investigate what data types need to be stored in a distributed fashion.

Initially I guess this can maybe be thought of as scalar, vector, matrix. The question is what type of scalar, vector, matrix? Will simple types do e.g., integer, real or other more complicated (defined) types needed. What does or doesn't fit into this in the current cmiss framework.
Indexing of objects.

There will probably need to be three levels of indexing. 1) the user level e.g. node 112; 2) the global level e.g., the user might have started the numbering at node 101 so node 112 is actually global node 12; 3) the local level e.g., nodes 1-6 are on process 0 and nodes 7-12 are on process 1 so when the external application issues a set_node_solution_value(node=112, value=2.0) then global node 112 is local node 12 is actually local node 6 (we need this because we don't want to allocate more memory than we have to). Because of the multiple levels of indirection we need to consider efficient data structures for converting between user, global and local numberings We could just set up a matrix and indirectly index the labels but this might be inefficient with large numbers of numbers with a non-zero base e.g. if the user creates user node 112 do we allocate an array with 112 entries when only 101-112 are used? Perhaps something like using a tree like representation (e.g., red-black tree) to allow for efficient searching and indirection?

I (Shane) prefer to think that we will be indexing most objects by pointers which locally I may hold an array of. I think this makes this process simpler?
Solvers and how they need their data stored for the solution?

Whilst it is good to think that the solvers are separate and that we can make up our own distribution library without them it might be useful to know how different distributed parallel solvers require their input matrices to be stored i.e., If we ultimately need to distribute and store the global matrices in a certain way then we should consider that now as it will probably not be efficient (in both a computation and memory footprint manner) to compute the solution matrix in one way and then copy it to another format/structure for the actual solve.
Domain decomposition.

What are the best algorithms for deciding how to decompose the actual computational domain. What data structures do we need here? What sort of topological ideas should be considered and how can that topology be exploited when we map the sub-domains onto actual computational nodes.
How best to store allocatable components within types.

Consider:
```
TYPE ELEMENT_TYPE
  INTEGER(INTG) :: NUMBER
  INTEGER(INTG), ALLOCATABLE :: ELEMENT_NODES(:)
END TYPE ELEMENT_TYPE
```
you want to allocate the ELEMENT_NODES because, in general, different element bases have different numbers of nodes. The problem is that the address of element_nodes when it is allocated will be different on different computational processes as they will, in general, either have a different address space or the allocation/deallocation pattern will be different. Thus if you want to send this element data structure between different processes then you can't simply just send the type as the data will not make sense on the other process. This can be done with special care (MPI has ways of sending this type of structure using an MPI_Address call and hindexed type or you can pack and unpack your own data) but the tradeoffs need to be considered if this sort of thing was happening often or in critical regions.

Personal tools

subprojects for development of distributed parallelization