Personal tools
You are here: Home openCMISS Wiki Random number generation with OpenMP
Views

Random number generation with OpenMP

last edited 7 years ago by mbog002

Introduction

Generating random numbers in the shared-memory parallel processing environment of OpenMP has some traps for the unwary. The problem is immediately seen when using the Fortran intrinsic function random_number(), which generates a pseudo-random sequence of uniformly distributed random variates (RVs). Invoking random_number() from different threads can make an OpenMP program run surprisingly slowly. What’s more, the random performance of the generator may be compromised. The reason for this problem is discussed, and a solution is described.

The Cause of the Problem

All (pseudo) random number generators (RNGs) work in basically the same way. The current ‘state’ of the RNG is held in a bit vector (typically 4 or more bytes) and every time the RNG is called a new state vector is computed from the old one, in a deterministic fashion. The new state replaces the old state, and the RNG function returns a value that is determined from the state value. In the case of random_number(), the value returned is a real value uniformly distributed between 0.0 and 1.0. Other RNGs return random integers, normally distributed reals, etc. In every case at the heart of the RNG there is a state vector, which without loss of generality we can take to be stored as a 4-byte integer IR4.

Because each new state value is computed from the previous one, the integer variable IR4 must have the SAVE attribute (equivalently, in C, it is static), i.e. its value is retained in memory between calls to random_number(). The fact that this variable is not local means that invocations of random_number() from different threads write to the same memory address for IR4. This leads to contention between threads, and results in big delays.

The simple test program test_random_number.f90 illustrates the behaviour. The program should be built with the command:

  xlf90_r –q64 –qsmp=omp –qsuffix=f=f90 test_random_number.f90 –o test_random_f90

and executed with:

  test_random_number N

where N is the number of threads to use.

The following table shows execution time for generating 100 million uniform random variates using different numbers of threads.

Threads Seconds
1 1.6
2 17.8
3 19.6
4 18.3
5 19.6
6 17.8
7 17.6
8 17.1

These results show that using random_number() naively with OpenMP can lead to extreme inefficiency. Adding more threads makes the program run more slowly.

A Solution

Probably the simplest way to overcome the contention problem with an RNG is to create a number of separate RNGs, one to used by each thread. If there is a straightforward way to do this using the random_number() intrinsic, I’m not aware of it. In any case, for performance (i.e. raw speed) I haven’t found anything faster than Marsaglia’s Ziggurat algorithm for normal RVs, and I generally implement this in Monte Carlo programs that make very heavy use of a normal RNG. Having the source code for an RNG (in this case Ziggurat) makes it easy to modify it for parallel operation.

The Fortran90 version of Ziggurat has several variables and arrays with the SAVE attribute. Here are the relevant lines from the original code:

  integer, save  :: jsr=123456789, kn(0:127), ke(0:255) 
  real(DP), save :: wn(0:127), fn(0:127), we(0:255), fe(0:255)

Of these variables, only jsr is used for the generation of uniform RVs (this is the ‘state’ variable IR4 referred to above). The other variables are used for generating RVs from the normal and exponential distributions.

Uniform RNG

Focussing on the uniform RNG case initially, the scalar variable jsr needs to become an array, say par_jsr(NT), where NT is the number of threads that will be required. For flexibility it is advantageous to make par_jsr() an allocatable array, and allocate it to the desired size when the RNG is initialized. In the scalar case, jsr is initialized to the specified seed value, so in the parallel case the array par_jsr() must be initialized to the array of seed values passed to the initialization subroutine.

In order to parallelize the uniform RNG, then, it appears that the required changes are:

Replace:

  integer, save :: jsr

with:

  integer, allocatable, save :: par_jsr(:)

Change the initialization subroutine from:

  subroutine zigset(jsrseed)
  integer, intent(in) :: jsrseed
  jsr = jsrseed
  …
  end

to:

  subroutine par_zigset( npar, par_jsrseed)
  integer, intent(in) :: npar, par_jsrseed(0:1)
  integer :: kpar
  allocate(par_jsr(npar*par_step))
  do kpar = 1,npar
      par_jsr(kpar) = par_jsrseed(kpar)
  enddo
  …
  end

Then when the function par_uni() is called to return a uniform RV, we must pass the thread number kpar, and use the corresponding state value.

Instead of:

  function uni() result( fn_val )
  real(DP) :: fn_val
  …

we now have:

  function par_uni(kpar) result( fn_val )
  integer :: kpar
  real(DP):: fn_val
  …

In the calling program the thread number (in the range 0 - npar -1) is obtained by an OpenMP function call:

  kpar = omp_get_thread_num()

The complete code for these modifications is not shown, because in fact the OpenMP performance on the IBM Regatta/AIX with these changes is abysmal. The reason for this (thanks Karl Tomlinson) turned out to be the fetch size and caching protocols employed on the IBM machine. Memory accesses are in 128-byte chunks, implying that a thread attempting to update an entry in the par_jsr() array is likely to clash with other threads that are accessing other entries in the array. A way to get around this problem is to use an enlarged par_jsr() array, and space out the values of interest with a gap of at least 128 bytes between each value. In this way we ensure that there will be no caching interaction between calls to par_uni() from different threads. The size of the spacing between the state values must also be passed to the initialization subroutine, and SAVE d to be used in accessing the correct entry in par_jsr(). In the RNG code par_step is the spacing in 4-byte integers, therefore on the IBM machine par_step should be at least 32.

Normal and Exponential RNGs

The function par_shr3() that is called by par_uni() is also at the heart of the normal and exponential RNGs, par_rnor() and par_rexp(). In addition these functions use a number of SAVE d arrays, which must be replaced by their parallel versions, e.g. wn(0:127) is replaced by par_wn(0:127,0:NT-1) (where of course allocatable arrays are used as with par_jsr()). Note that the problem of thread contention in accessing adjacent array entries is bypassed here by making the thread number the second array index. If it were made the first index the same problem would arise. Because Fortran arrays are stored in memory column-wise (first index varying fastest), the size of these data arrays (at least 128 8-byte values) ensures that values accessed by different threads are always more than 128 bytes apart.

Results

Ziggurat code incorporating all these modifications is shown in par_zig_mod.f90. A test program to investigate timing of the uniform RNG function par_uni() is test_par_zig.f90, which generates 1000 million RVs. Timing results with different numbers of processors are shown in the table below. The load average on the IBM machine was <10%.

Threads Seconds
1 48.0
2 24.6
4 12.7
8 6.63
16 3.63
32 1.36