An application of OpenMP and Graphics Processing Units parallel threads
Numerical distribution functions of unit root tests

Lately, I have been learning to programme on a Graphics Processing Unit (GPU). Thanks to the CUDA computing platform, it is relatively easy to programme on these devices. Despite the syntax is simple, it requires some experimentation, trial and errors, knowledge of some details of the architecture of these systems and sometimes, the algorithm needs to be adapted to the application at hand. Thus, it is not always easy to know beforehand whether implementing an application on the GPU will bring some advantages compared to parallelization on the CPU based for example on OpenMP.

In this post I summarize the results of a simulation exercise where I replicate a small part of the simulations described in Díaz-Emparanza (2014)[1]. I give the details of the exercise in this draft document that I am planning to complete and update. The code is still a development version, it is available upon request.

The table below reports the timings in different settings. The kind of implementation is labelled in the first column; it can be based either on the GSL interface to fit a linear regression model, on OpenMP with 2, 4 or 8 threads and on CUDA to program on the GPU.
The platforms (processor or environment) where the program was run are: a PC with an Intel Pentium(R) G2030 processor @3.00GHz with two cores and one thread per core; a computer with an Intel(R) i7-2760QM processor @2.40GHz with four cores and two threads per core; the GeForce GTX 660 GPU (installed on the PC with the G2030 processor).

Times of computations
Code Platform Time
GSL G2030 12m 39s
GSL-OpenMP-2 G2030 6m 28s
GSL-OpenMP-4 G2030 6m 26s
CUDA GTX 660 2m 04s
GSL-OpenMP-1 i7-2760QM 11m 31s
GSL-OpenMP-4 i7-2760QM 3m 59s
GSL-OpenMP-8 i7-2760QM 2m 53s

Given the G2030 processor, the implementation in CUDA is the fastest. It is around six times faster than the sequential version and around three times faster than the best performance that can be achieved on this CPU. The kernel was launched with 5 blocks and 200 threads per block; each thread carries out 100 iterations out of the total 5 x 200 x 100 = 100,000 iterations.

The processor i7-2760QM can handle up to eight threads. In that case, the timings for the version based on OpenMP (Code=GSL-OpenMP-8) is close to the timings observed for the GPU.

In a more powerful environment, for example the Arina cluster, increasing the threads to be run on the CPU could probably beat the GPU. But a more powerful GPU, such as the Tesla K20, perhaps would be able to reduce the timings of the CUDA programme as well.

The parallel structure of the GPU was useful for the simulations involved in this exercise. Despite the syntax is simple, it required adapting the whole process and the algorithm that is used to obtain the desired test statistics (details are given in the draft document linked above).

[1] Díaz-Emparanza, I. (2014). "Numerical Distribution Functions for Seasonal Unit Root Tests". Computational Statistics and Data Analysis, 76, pp. 237-247.
DOI: 10.1016/j.csda.2013.03.006.
Go to top

This entry was posted in performance, simulations, time series and tagged , , , . Bookmark the permalink.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *