Skip to content

GPGPU Verification

Bob Rubbens edited this page Nov 3, 2021 · 7 revisions

This section explains how to verify GPGPU programs in VerCors. The tool supports both OpenCL and CUDA languages. The synchronization feature (i.e., barrier) in these languages is also supported by VerCors. To demonstrate GPGPU verification, we show two examples in both OpenCl and CUDA, one without barrier and the other one with barrier.

OpenCL/CUDA without Barrier

This simple method shows an OpenCL program that adds two arrays and stores the result in another array:

#include <opencl.h>
/* 1  */ /*@
         context \pointer_index(a, get_global_id(0), read);
         context \pointer_index(b, get_global_id(0), read);
         context \pointer_index(c, get_global_id(0), write);
         ensures c[get_global_id(0)] == a[get_global_id(0)] + b[get_global_id(0)];
         @*/
/* 7  */ __kernel void openCLAdd(int a[], int b[], int c[]) {
/* 8  */    int tid = get_global_id(0);
/* 9  */    c[tid] = a[tid] + b[tid];
/* 10 */ }  

and this method shows the same example in CUDA:

#include <cuda.h>
/* 1  */  /*@
          context \pointer_index(a, threadIdx.x, read);
          context \pointer_index(b, threadIdx.x, read);
          context \pointer_index(c, threadIdx.x, write);
          ensures c[threadIdx.x] == a[threadIdx.x] + b[threadIdx.x];
          @*/
/* 7  */  __global__ void CUDAAdd(int a[], int b[], int c[]) {
/* 8  */     int tid = threadIdx.x;
/* 9  */     c[tid] = a[tid] + b[tid];
/* 10 */  }  

In both examples, first we should include the header files (i.g., opencl.h and cuda.h). Next we obtain thread identifiers and then each thread does the computation (lines 8-9 of OpenCL and 8-9 of CUDA examples). As we can see obtaining the global thread identifiers are different in OpenCL and CUDA.

In the specification of the methods, we specify read permission for each thread in arrays "a" and "b" and write permission in array "c" as pre- and postcondition. The keyword "\pointer_index" is used with three arguments, array name, index and permission to indicate which thread has what permission to which location. Finally in the last postcondition we specify the result of the methods that each location in array "c" contains the sum of the values in the corresponding locations in arrays "a" and "b". Note that in the specification we can use the shorthand keyword "\gid" instead of "get_global_id(0)" and "threadIdx.x" in the tool.

Note: In CUDA, to compute the global id, normally threadIdx is combined with blockDim. Unfortunately, blockDim is not yet supported in VerCors, so the examples in this chapter only use threadIdx.

OpenCL/CUDA with Barrier

In GPU programming, when there is only one thread block, there is a mechanism to sunchronize threads. This example shows an OpenCL program that uses barrier to synchronize threads:

#include <opencl.h>
/* 1  */ 
/* 2  */
/* 3  */ /*@
         context_everywhere opencl_gcount == 1;
         context_everywhere array != null && array.length == size;
         requires get_local_id(0) != size-1 ==> \pointer_index(array, get_local_id(0)+1, 1\2);
         requires get_local_id(0) == size-1 ==> \pointer_index(array, 0, 1\2);
         ensures \pointer_index(array, get_local_id(0), 1);
         ensures get_local_id(0) != size-1 ==> array[get_local_id(0)] == \old(array[get_local_id(0)+1]);
         ensures get_local_id(0) == size-1 ==> array[get_local_id(0)] == \old(array[0]);
         @*/
/* 12 */ __kernel void openCLLeftRotation(int array[], int size) {
/* 13 */    int tid = get_local_id(0);
/* 14 */    int temp;
/* 15 */    if(tid != size-1){
/* 16 */     temp = array[tid+1];
/* 17 */    }else{
/* 18 */     temp = array[0];
/* 19 */    }
/* 20 */
/* 21 */    /*@
            requires get_local_id(0) != size-1 ==> \pointer_index(array, get_local_id(0)+1, 1\2);
            requires get_local_id(0) == size-1 ==> \pointer_index(array, 0, 1\2);
            ensures \pointer_index(array, get_local_id(0), 1);
            @*/
/* 26 */    barrier(CLK_LOCAL_MEM_FENCE);
/* 27 */
/* 28 */    array[tid] = temp;
/* 29 */ }  

And this is the CUDA version of the example:

#include <cuda.h>
/* 1  */ 
/* 2  */
/* 3  */ /*@
         context_everywhere opencl_gcount == 1;
         context_everywhere array.length == size;
         requires threadIdx.x != size-1 ==> \pointer_index(array, threadIdx.x+1, 1\2);
         requires threadIdx.x == size-1 ==> \pointer_index(array, 0, 1\2);
         ensures \pointer_index(array, threadIdx.x, 1);
         ensures threadIdx.x != size-1 ==> array[threadIdx.x] == \old(array[threadIdx.x+1]);
         ensures threadIdx.x == size-1 ==> array[threadIdx.x] == \old(array[0]);
         @*/
/* 12 */ __global__ void CUDALeftRotation(int array[], int size) {
/* 13 */    int tid = threadIdx.x;
/* 14 */    int temp;
/* 15 */    if(tid != size-1){
/* 16 */     temp = array[tid+1];
/* 17 */    }else{
/* 18 */     temp = array[0];
/* 19 */    }
/* 20 */
/* 21 */    /*@
            requires threadIdx.x != size-1 ==> \pointer_index(array, threadIdx.x+1, 1\2);
            requires threadIdx.x == size-1 ==> \pointer_index(array, 0, 1\2);
            ensures \pointer_index(array, threadIdx.x, 1);
            @*/
/* 26 */    __syncthreads();
/* 27 */
/* 28 */    array[tid] = temp;
/* 29 */ }  

The above examples illustrate GPU programs that rotates the elements of an array to the left. Each thread ("tid") stores its right neighbor in a temporary location (i.e., "temp"), except thread "size-1" which stores the first element in the array (lines 15-19). Then each thread synchronizes in a barrier (line 26). After that, each thread writes the value read into its own location at index "tid" in the array (line 28).

We specify that there is only one thread block in the specification of the programs using the keyword "opencl_gcount" (the first precondition). The precondition of the barrier follows from the precondition of the function and the computations before the barrier. If the precondition of the barrier holds, the postcondition can be assumed after the barrier. Then, the postcondition of the method can follow from the postcondition of the barrier. Note that in the specification we can use the shorthand keyword "\lid" instead of "get_local_id(0)" and "threadIdx.x" in the tool.

Clone this wiki locally