Open MP for Finite Elasticity
This page documents the open mp performance of the finite elasticity code inside cm.
I have been investigating parallelisation on a 96 element tri-cubic Hermite model of a phantom that I have been working with in my PhD. The phantom is made of a neo-Hookean material, loaded under gravity at an angle of 19.8 degrees......its not relevant I know, but I thought I'd give some problem background.
I calculated the percentage of time taken in different phases of the solution process for 1 newton iteration when the code is run as a single processor job.
Calculation of element residuals 0.44% Calculation of element stiffness matrices 81% Constraint reduction of the global system of equations 5.39% Solution of the linear system of equations 10.19% Other 2.98%
The observation that the assembly of the global stiffness matrix takes the largest amount of time is a well known fact. I wanted to ensure that I was getting good speed up as I increased the number of threads used to assemble the stiffness matrix. Theoretically, the assembly of the stiffness matrix is very good for parallelisation. I should be able to obtain a speed up factor of 2, when moving from 1 processor to 2 processors.
Speedup factor here (SpF) is given by the formula
SpF = Time taken to calculate global stiffness matrix with 1 processor / Time taken to calculate global stiffness matrix with N processors
So, theoretically, SpF for N=2 is 2.
Now we know that there is overhead associated with parallelisation, and therefore I should not expect an exact speed up of 2, but it should still be a reasonable level.
currentSpeeds.jpg (Just click on contents to look at currentSpeeds.jpg. I can't seem to make it work so that the image is inline with the text right now.....but hopefully, I'll figure out when I get time.)shows a graph of the speed up factor with varying number of threads to calculate the stiffness matrix. The times recorded and used in the SpF calcuation are of the stiffness matrix assembly alone, ie. not taking the time for linear solution and calculation of element residuals into account. There are two plots comparing to the linear speed up.
_(I) Comparing with Single Processor Compilation_: This graph takes the reference single processor time to be the time it takes for cm64 to assemble the stiffness matrix. That is, I am comparing my speed up to the time it takes to assemble the matrix using the cm64 executable. You will notice that the speed up is quite bad. Going from 1 to 2 processors, you only get a speed up of 1.2! This, I am sure you would agree, is not acceptable. I have tried a number of things to see if I can improve the speed up.
- The old hpc2, SGI machine gave real good speed up. I remember getting a speed up factor of close to 2 when moving from 1 to 2 processors for my phantom problem. However, the cpus on the machine were so slow, that the parallelised code on SGI was still slower than the serial code on the new hpc. I was thinking that the SGI cpus were so slow, that the cpus were effectively doing a lot of work per element. However, I wondered whether the new, faster hpc cpus were not being given enough work per element to observe the benefits of parallelisation. The main code that takes most time on the new hpc is ZEES.f. Each element stiffness matrix is assembled by ZEES.f in an average of 0.5s. On a single processor, the total time for assembling the global matrix is around 0.5*96....around 50s. So, as I mentioned earlier, you would expect the time with 2 processors to reduce to (theoretically) 25s. However, it only goes down to 43s or so. I decided to artifically increase the amount of work per element by increasing the number of gauss points to work on and just repeatedly calling ZEES.f in a do loop, per element. This increased the total time per element to about 4min. The overall time for 1 newton iteration of this artificially increased work load was about 8hrs on 1 processor. On 2, it was around 7hrs.....the actual speed up factor for the assembly of the global matrix was still 1.2!
- I thought that the atomic statement in ASSEMBLE5_DYNAM.f might be slowing things down. I commented out the atomic statement to see if timing could be improved. I found that the removal of the atomic statement still gave the correct answer for 1 newton iteration. I guess this was just a coincidence. There was no change in speed up factor (1.2). The coincidence meant that the atomic statement was not going to be slowing the code down anyway (for this particular problem).
- I wondered if scheduling of the elements would change the speed.....ie, using the OMP directive SCHEDULE,chunk size. The STATIC schedule is meant to take the least amount of time. With dynamic (default) scheduling, each processor is given 1 element and then once a processor is done with one, it will communicate with the main thread and it will be given a new element. Since the chunk size is only 1, there are 96 communications. But, by setting SCHEDULE(STATIC,48) I ensure that there is only 2 communications....partitioning 1...48 to proc 1 and 49..96 to proc 2. However, speed up was still only 1.2
- I looked through ZEES and just recorded some times for different parts of the code....I was trying to see if there were any do loops I could reorder to take caching into account etc. However, nothing seems to be staring at my face.
- I tried example 8113 and found speed up factor was still 1.2. First all these tests were done on my own version of CMISS. But, I have done some tests with my phantom on the global version of cmiss and I still get the same 1.2 SpF. So, is it to do with how I am measuring time? I am using the timing information given by cmiss using the CPU_TIMER call. I have also used the compute partition timer (the emails we recieve for submitting and completing a job) and found the same speed up factor of 1.2. So I wonder whether its to do with compilation options.
_(II) Comparing with multiprocessor compilation with 1 processor_:
In this plot, my reference time was the time cm64-mt took to assemble my stiffness matrix using 1 processor alone. Now, from 1 to 2 processors, the speed up factor is 1.8. But that is only because the single processor cm64-mt job took longer than the cm64 executable. You will also notice that after 4 processors, the speed up factor deteriorates.
Just as a quick test, I commented out the call to assemble5_dynam inside assemble5 and ran cm64 and cm64-mt with 1 processor on my problem. I just wanted to know how much time is used up in performing only memory allocation and deallocation calls that are inside the parallelised loop. These allocation and deallocation calls are done to set up arrays to be used inside assemble5_dynam -> zees. I found that, in terms of cpu time, the output for cm64 said 0s, while the output for cm64-mt with 1 processor said 0.1324e-2 s. To me that just meant that the allocation and deallocation of memory does not increase the time of computations drastically in a multi-threaded compilation of cmiss.