Random number generation with OpenMP
Generating random numbers in the shared-memory parallel processing environment of OpenMP has some traps for the unwary. The problem is immediately seen when using the Fortran intrinsic function
random_number(), which generates a pseudo-random sequence of uniformly distributed random variates (RVs). Invoking
random_number() from different threads can make an OpenMP program run surprisingly slowly. What’s more, the random performance of the generator may be compromised. The reason for this problem is discussed, and a solution is described.
The Cause of the Problem
All (pseudo) random number generators (RNGs) work in basically the same way. The current ‘state’ of the RNG is held in a bit vector (typically 4 or more bytes) and every time the RNG is called a new state vector is computed from the old one, in a deterministic fashion. The new state replaces the old state, and the RNG function returns a value that is determined from the state value. In the case of
random_number(), the value returned is a real value uniformly distributed between 0.0 and 1.0. Other RNGs return random integers, normally distributed reals, etc. In every case at the heart of the RNG there is a state vector, which without loss of generality we can take to be stored as a 4-byte integer IR4.
Because each new state value is computed from the previous one, the integer variable IR4 must have the
SAVE attribute (equivalently, in C, it is static), i.e. its value is retained in memory between calls to
random_number(). The fact that this variable is not local means that invocations of
random_number() from different threads write to the same memory address for IR4. This leads to contention between threads, and results in big delays.
The simple test program test_random_number.f90 illustrates the behaviour. The program should be built with the command:
xlf90_r –q64 –qsmp=omp –qsuffix=f=f90 test_random_number.f90 –o test_random_f90
and executed with:
where N is the number of threads to use.
The following table shows execution time for generating 100 million uniform random variates using different numbers of threads.
These results show that using
random_number() naively with OpenMP can lead to extreme inefficiency. Adding more threads makes the program run more slowly.
Probably the simplest way to overcome the contention problem with an RNG is to create a number of separate RNGs, one to used by each thread. If there is a straightforward way to do this using the
random_number() intrinsic, I’m not aware of it. In any case, for performance (i.e. raw speed) I haven’t found anything faster than Marsaglia’s Ziggurat algorithm for normal RVs, and I generally implement this in Monte Carlo programs that make very heavy use of a normal RNG. Having the source code for an RNG (in this case Ziggurat) makes it easy to modify it for parallel operation.
The Fortran90 version of Ziggurat has several variables and arrays with the
SAVE attribute. Here are the relevant lines from the original code:
integer, save :: jsr=123456789, kn(0:127), ke(0:255) real(DP), save :: wn(0:127), fn(0:127), we(0:255), fe(0:255)
Of these variables, only
jsr is used for the generation of uniform RVs (this is the ‘state’ variable IR4 referred to above). The other variables are used for generating RVs from the normal and exponential distributions.
Focussing on the uniform RNG case initially, the scalar variable jsr needs to become an array, say
NT is the number of threads that will be required. For flexibility it is advantageous to make
par_jsr() an allocatable array, and allocate it to the desired size when the RNG is initialized. In the scalar case,
jsr is initialized to the specified seed value, so in the parallel case the array
par_jsr() must be initialized to the array of seed values passed to the initialization subroutine.
In order to parallelize the uniform RNG, then, it appears that the required changes are:
integer, save :: jsr
integer, allocatable, save :: par_jsr(:)
Change the initialization subroutine from:
subroutine zigset(jsrseed) integer, intent(in) :: jsrseed jsr = jsrseed … end
subroutine par_zigset( npar, par_jsrseed) integer, intent(in) :: npar, par_jsrseed(0:1) integer :: kpar allocate(par_jsr(npar*par_step)) do kpar = 1,npar par_jsr(kpar) = par_jsrseed(kpar) enddo … end
Then when the function
par_uni() is called to return a uniform RV, we must pass the thread number
kpar, and use the corresponding state value.
function uni() result( fn_val ) real(DP) :: fn_val …
we now have:
function par_uni(kpar) result( fn_val ) integer :: kpar real(DP):: fn_val …
In the calling program the thread number (in the range 0 -
npar -1) is obtained by an OpenMP function call:
kpar = omp_get_thread_num()
The complete code for these modifications is not shown, because in fact the OpenMP performance on the IBM Regatta/AIX with these changes is abysmal. The reason for this (thanks Karl Tomlinson) turned out to be the fetch size and caching protocols employed on the IBM machine. Memory accesses are in 128-byte chunks, implying that a thread attempting to update an entry in the
par_jsr() array is likely to clash with other threads that are accessing other entries in the array. A way to get around this problem is to use an enlarged
par_jsr() array, and space out the values of interest with a gap of at least 128 bytes between each value. In this way we ensure that there will be no caching interaction between calls to
par_uni() from different threads. The size of the spacing between the state values must also be passed to the initialization subroutine, and
SAVE d to be used in accessing the correct entry in
par_jsr(). In the RNG code
par_step is the spacing in 4-byte integers, therefore on the IBM machine
par_step should be at least 32.
Normal and Exponential RNGs
par_shr3() that is called by
par_uni() is also at the heart of the normal and exponential RNGs,
par_rexp(). In addition these functions use a number of
SAVE d arrays, which must be replaced by their parallel versions, e.g.
wn(0:127) is replaced by
par_wn(0:127,0:NT-1) (where of course allocatable arrays are used as with
par_jsr()). Note that the problem of thread contention in accessing adjacent array entries is bypassed here by making the thread number the second array index. If it were made the first index the same problem would arise. Because Fortran arrays are stored in memory column-wise (first index varying fastest), the size of these data arrays (at least 128 8-byte values) ensures that values accessed by different threads are always more than 128 bytes apart.
Ziggurat code incorporating all these modifications is shown in par_zig_mod.f90. A test program to investigate timing of the uniform RNG function
par_uni() is test_par_zig.f90, which generates 1000 million RVs. Timing results with different numbers of processors are shown in the table below. The load average on the IBM machine was <10%.