Random number generation with OpenMP

Introduction

Generating random numbers in the shared-memory parallel processing environment of OpenMP has some traps for the unwary. The problem is immediately seen when using the Fortran intrinsic function 'random_number()', which generates a pseudo-random sequence of uniformly distributed random variates (RVs). Invoking 'random_number()' from different threads can make an OpenMP program run surprisingly slowly. What’s more, the random performance of the generator may be compromised. The reason for this problem is discussed, and a solution is described.

The Cause of the Problem

All (pseudo) random number generators (RNGs) work in basically the same way. The current ‘state’ of the RNG is held in a bit vector (typically 4 or more bytes) and every time the RNG is called a new state vector is computed from the old one, in a deterministic fashion. The new state replaces the old state, and the RNG function returns a value that is determined from the state value. In the case of 'random_number()', the value returned is a real value uniformly distributed between 0.0 and 1.0. Other RNGs return random integers, normally distributed reals, etc. In every case at the heart of the RNG there is a state vector, which without loss of generality we can take to be stored as a 4-byte integer IR4.

Because each new state value is computed from the previous one, the integer variable IR4 must have the 'SAVE' attribute (equivalently, in C, it is static), i.e. its value is retained in memory between calls to 'random_number()'. The fact that this variable is not local means that invocations of 'random_number()' from different threads write to the same memory address for IR4. This leads to contention between threads, and results in big delays.

The simple test program <a href="test_random_number.f90">test_random_number.f90</a>

illustrates the behaviour. The program should be built with the command:

xlf90_r –q64 –qsmp=omp –qsuffix=f=f90 test_random_number.f90 –o test_random_f90

and executed with:

test_random_number N

where N is the number of threads to use.

The following table shows execution time for generating 100 million uniform random variates using different numbers of threads.

<table width="150" border="0" cellspacing="0" cellpadding="0">

<tr>: <td><strong>Threads</strong></td> <td><strong>Seconds</strong></td>

</tr> <tr>

<td>1</td> <td> 1.6</td>

</tr> <tr>

<td>2</td> <td>17.8</td>

</tr> <tr>

<td>3</td> <td>19.6</td>

</tr> <tr>

<td>4</td> <td>18.3</td>

</tr> <tr>

<td>5</td> <td>19.6</td>

</tr> <tr>

<td>6</td> <td>17.8</td>

</tr> <tr>

<td>7</td> <td>17.6</td>

</tr> <tr>

<td>8</td> <td>17.1</td>

</tr>

</table> <br />

These results show that using 'random_number()' naively with OpenMP can lead to extreme inefficiency. Adding more threads makes the program run more slowly.

A Solution

Probably the simplest way to overcome the contention problem with an RNG is to create a number of separate RNGs, one to used by each thread. If there is a straightforward way to do this using the 'random_number()' intrinsic, I’m not aware of it. In any case, for performance (i.e. raw speed) I haven’t found anything faster than Marsaglia’s Ziggurat algorithm for normal RVs, and I generally implement this in Monte Carlo programs that make very heavy use of a normal RNG. Having the source code for an RNG (in this case Ziggurat) makes it easy to modify it for parallel operation.

The Fortran90 version of Ziggurat has several variables and arrays with the 'SAVE' attribute. Here are the relevant lines from the original code:

integer, save  :: jsr=123456789, kn(0:127), ke(0:255)
real(DP), save :: wn(0:127), fn(0:127), we(0:255), fe(0:255)

Of these variables, only 'jsr' is used for the generation of uniform RVs (this is the ‘state’ variable IR4 referred to above). The other variables are used for generating RVs from the normal and exponential distributions.

Uniform RNG

Focussing on the uniform RNG case initially, the scalar variable jsr needs to become an array, say 'par_jsr(NT)', where 'NT' is the number of threads that will be required. For flexibility it is advantageous to make 'par_jsr()' an allocatable array, and allocate it to the desired size when the RNG is initialized. In the scalar case, 'jsr' is initialized to the specified seed value, so in the parallel case the array 'par_jsr()' must be initialized to the array of seed values passed to the initialization subroutine.

In order to parallelize the uniform RNG, then, it appears that the required changes are:

Replace:

integer, save :: jsr

with:

integer, allocatable, save :: par_jsr(:)

Change the initialization subroutine from:

subroutine zigset(jsrseed)
integer, intent(in) :: jsrseed
jsr = jsrseed
…
end

to:

subroutine par_zigset( npar, par_jsrseed)
integer, intent(in) :: npar, par_jsrseed(0:1)
integer :: kpar
allocate(par_jsr(npar*par_step))
do kpar = 1,npar
    par_jsr(kpar) = par_jsrseed(kpar)
enddo
…
end

Then when the function 'par_uni()' is called to return a uniform RV, we must pass the thread number 'kpar', and use the corresponding state value.

Instead of:

function uni() result( fn_val )
real(DP) :: fn_val
…

we now have:

function par_uni(kpar) result( fn_val )
integer :: kpar
real(DP):: fn_val
…

In the calling program the thread number (in the range 0 - 'npar' -1) is obtained by an OpenMP function call:

kpar = omp_get_thread_num()

The complete code for these modifications is not shown, because in fact the OpenMP performance on the IBM Regatta/AIX with these changes is abysmal. The reason for this (thanks Karl Tomlinson) turned out to be the fetch size and caching protocols employed on the IBM machine. Memory accesses are in 128-byte chunks, implying that a thread attempting to update an entry in the 'par_jsr()' array is likely to clash with other threads that are accessing other entries in the array. A way to get around this problem is to use an enlarged 'par_jsr()' array, and space out the values of interest with a gap of at least 128 bytes between each value. In this way we ensure that there will be no caching interaction between calls to 'par_uni()' from different threads. The size of the spacing between the state values must also be passed to the initialization subroutine, and 'SAVE' d to be used in accessing the correct entry in 'par_jsr()'. In the RNG code 'par_step' is the spacing in 4-byte integers, therefore on the IBM machine 'par_step' should be at least 32.

Normal and Exponential RNGs

The function 'par_shr3()' that is called by 'par_uni()' is also at the heart of the normal and exponential RNGs, 'par_rnor()' and 'par_rexp()'. In addition these functions use a number of 'SAVE' d arrays, which must be replaced by their parallel versions, e.g. 'wn(0:127)' is replaced by 'par_wn(0:127,0:NT-1') (where of course allocatable arrays are used as with 'par_jsr()'). Note that the problem of thread contention in accessing adjacent array entries is bypassed here by making the thread number the second array index. If it were made the first index the same problem would arise. Because Fortran arrays are stored in memory column-wise (first index varying fastest), the size of these data arrays (at least 128 8-byte values) ensures that values accessed by different threads are always more than 128 bytes apart.

Results

Ziggurat code incorporating all these modifications is shown in <a href="par_zig_mod.f90">par_zig_mod.f90</a>. A test program to investigate timing of the uniform RNG function 'par_uni()' is <a href="test_par_zig.f90">test_par_zig.f90</a>, which generates 1000 million RVs. Timing results with different numbers of processors are shown in the table below. The load average on the IBM machine was <10%.

<table width="150" border="0" cellspacing="0" cellpadding="0">

<tr>: <td><strong>Threads</strong></td> <td><strong>Seconds</strong></td>

</tr> <tr>

<td>1</td> <td>48.0</td>

</tr> <tr>

<td>2</td> <td>24.6</td>

</tr> <tr>

<td>4</td> <td>12.7</td>

</tr> <tr>

<td>8</td> <td>6.63</td>

</tr> <tr>

<td>16</td> <td>3.63</td>

</tr> <tr>

<td>32</td> <td>1.36</td>

</tr>

</table> <br />

Personal tools

Random number generation with OpenMP