GPU Specifics

gpu-cuda-diffusion

cuda_boundaries.cu

Implementation of boundary condition functions with OpenMP threading.

Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)

Initialize flat composition field with fixed boundary conditions.

The boundary conditions are fixed values of \( c_{hi} \) along the lower-left half and upper-right half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.

void boundary_kernel(fp_t *d_conc, const int nx, const int ny, const int nm)

Enable double-precision floats.

Boundary condition kernel for execution on the GPU.

Boundary condition kernel for execution on the GPU

This function accesses 1D data rather than the 2D array representation of the scalar composition field

cuda_discretization.cu

Implementation of boundary condition functions with CUDA acceleration.

Functions

void convolution_kernel(fp_t *d_conc_old, fp_t *d_conc_lap, const int nx, const int ny, const int nm)

Tiled convolution algorithm for execution on the GPU.

This function accesses 1D data rather than the 2D array representation of the scalar composition field, mapping into 2D tiles on the GPU with halo cells before computing the convolution.

Note:

  • The source matrix (conc_old) and destination matrix (conc_lap) must be identical in size

  • One CUDA core operates on one array index: there is no nested loop over matrix elements

  • The halo (nm/2 perimeter cells) in conc_lap are unallocated garbage

  • The same cells in conc_old are boundary values, and contribute to the convolution

  • conc_tile is the shared tile of input data, accessible by all threads in this block

void diffusion_kernel(fp_t *d_conc_old, fp_t *d_conc_new, fp_t *d_conc_lap, const int nx, const int ny, const int nm, const fp_t D, const fp_t dt)

Vector addition algorithm for execution on the GPU.

This function accesses 1D data rather than the 2D array representation of the scalar composition field. Memory allocation, data transfer, and array release are handled in cuda_init(), with arrays on the host and device managed through CudaData, which is a struct passed by reference into the function. In this way, device kernels can be called in isolation without incurring the cost of data transfers and with reduced risk of memory leaks.

void device_boundaries(fp_t *conc, const int nx, const int ny, const int nm, const int bx, const int by)

Apply boundary conditions on device.

void device_convolution(fp_t *conc_old, fp_t *conc_lap, const int nx, const int ny, const int nm, const int bx, const int by)

Compute convolution on device.

void device_composition(fp_t *conc_old, fp_t *conc_new, fp_t *conc_lap, const int nx, const int ny, const int nm, const int bx, const int by, const fp_t D, const fp_t dt)

Step diffusion equation on device.

void read_out_result(fp_t **conc, fp_t *d_conc, const int nx, const int ny)

Read data from device.

void compute_convolution(fp_t **conc_old, fp_t **conc_lap, fp_t **mask_lap, const int bx, const int by, const int nm, const int nx, const int ny)

Reference showing how to invoke the convolution kernel.

A stand-alone function like this incurs the cost of host-to-device data transfer each time it is called: it is a teaching tool, not reusable code. It is the basis for cuda_diffusion_solver(), which achieves much better performance by bundling CUDA kernels together and intelligently managing data transfers between the host (CPU) and device (GPU).

void cuda_diffusion_solver(struct CudaData *dev, fp_t **conc_new, const int bx, const int by, const int nm, const int nx, const int ny, const fp_t D, const fp_t dt, struct Stopwatch *sw)

Reference optimized code for solving the diffusion equation.

Solve diffusion equation on the GPU.

Compare cuda_diffusion_solver(): it accomplishes the same result, but without the memory allocation, data transfer, and array release. These are handled in cuda_init(), with arrays on the host and device managed through CudaData, which is a struct passed by reference into the function. In this way, device kernels can be called in isolation without incurring the cost of data transfers and with reduced risk of memory leaks.

Variables

fp_t d_mask[5 * 5]

Convolution mask array on the GPU, allocated in protected memory.

gpu-openacc-diffusion

openacc_boundaries.c

Implementation of boundary condition functions with OpenMP threading.

Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)

Initialize flat composition field with fixed boundary conditions.

The boundary conditions are fixed values of \( c_{hi} \) along the lower-left half and upper-right half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.

void boundary_kernel (fp_t **__restrict__ conc, const int nx, const int ny, const int nm)
void apply_boundary_conditions(fp_t **conc, const int nx, const int ny, const int nm)

Set fixed value \( (c_{hi}) \) along left and bottom, zero-flux elsewhere.

openacc_discretization.c

Implementation of boundary condition functions with OpenACC threading.

Functions

void convolution_kernel(fp_t **conc_old, fp_t **conc_lap, fp_t **mask_lap, const int nx, const int ny, const int nm)

Tiled convolution algorithm for execution on the GPU.

void diffusion_kernel(fp_t **conc_old, fp_t **conc_new, fp_t **conc_lap, const int nx, const int ny, const int nm, const fp_t D, const fp_t dt)

Vector addition algorithm for execution on the GPU.

gpu-opencl-diffusion

opencl_boundaries.c

Implementation of boundary condition functions with OpenCL acceleration.

Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)

Initialize flat composition field with fixed boundary conditions.

The boundary conditions are fixed values of \( c_{hi} \) along the lower-left half and upper-right half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.

opencl_discretization.c

Implementation of boundary condition functions with OpenCL acceleration.

Functions

void device_boundaries(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by)

Apply boundary conditions on OpenCL device.

void device_convolution(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by)

Compute convolution on OpenCL device.

void device_diffusion(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by, const fp_t D, const fp_t dt)

Solve diffusion equation on OpenCL device.

void read_out_result(struct OpenCLData *dev, const int flip, fp_t **conc, const int nx, const int ny)

Copy data out of OpenCL device.

Looking for something specific?