GPU Specifics¶
gpucudadiffusion¶
cuda_boundaries.cu¶
Implementation of boundary condition functions with OpenMP threading.
Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)
Initialize flat composition field with fixed boundary conditions.
The boundary conditions are fixed values of \( c_{hi} \) along the lowerleft half and upperright half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.

void boundary_kernel(fp_t *d_conc, const int nx, const int ny, const int nm)
Enable doubleprecision floats.
Boundary condition kernel for execution on the GPU.
Boundary condition kernel for execution on the GPU
This function accesses 1D data rather than the 2D array representation of the scalar composition field
cuda_discretization.cu¶
Implementation of boundary condition functions with CUDA acceleration.
Functions

void convolution_kernel(fp_t *d_conc_old, fp_t *d_conc_lap, const int nx, const int ny, const int nm)
Tiled convolution algorithm for execution on the GPU.
This function accesses 1D data rather than the 2D array representation of the scalar composition field, mapping into 2D tiles on the GPU with halo cells before computing the convolution.
Note:
The source matrix (conc_old) and destination matrix (conc_lap) must be identical in size
One CUDA core operates on one array index: there is no nested loop over matrix elements
The halo (nm/2 perimeter cells) in conc_lap are unallocated garbage
The same cells in conc_old are boundary values, and contribute to the convolution
conc_tile is the shared tile of input data, accessible by all threads in this block

void diffusion_kernel(fp_t *d_conc_old, fp_t *d_conc_new, fp_t *d_conc_lap, const int nx, const int ny, const int nm, const fp_t D, const fp_t dt)
Vector addition algorithm for execution on the GPU.
This function accesses 1D data rather than the 2D array representation of the scalar composition field. Memory allocation, data transfer, and array release are handled in cuda_init(), with arrays on the host and device managed through CudaData, which is a struct passed by reference into the function. In this way, device kernels can be called in isolation without incurring the cost of data transfers and with reduced risk of memory leaks.

void device_boundaries(fp_t *conc, const int nx, const int ny, const int nm, const int bx, const int by)¶
Apply boundary conditions on device.

void device_convolution(fp_t *conc_old, fp_t *conc_lap, const int nx, const int ny, const int nm, const int bx, const int by)¶
Compute convolution on device.

void device_composition(fp_t *conc_old, fp_t *conc_new, fp_t *conc_lap, const int nx, const int ny, const int nm, const int bx, const int by, const fp_t D, const fp_t dt)¶
Step diffusion equation on device.

void compute_convolution(fp_t **conc_old, fp_t **conc_lap, fp_t **mask_lap, const int bx, const int by, const int nm, const int nx, const int ny)¶
Reference showing how to invoke the convolution kernel.
A standalone function like this incurs the cost of hosttodevice data transfer each time it is called: it is a teaching tool, not reusable code. It is the basis for cuda_diffusion_solver(), which achieves much better performance by bundling CUDA kernels together and intelligently managing data transfers between the host (CPU) and device (GPU).

void cuda_diffusion_solver(struct CudaData *dev, fp_t **conc_new, const int bx, const int by, const int nm, const int nx, const int ny, const fp_t D, const fp_t dt, struct Stopwatch *sw)¶
Reference optimized code for solving the diffusion equation.
Solve diffusion equation on the GPU.
Compare cuda_diffusion_solver(): it accomplishes the same result, but without the memory allocation, data transfer, and array release. These are handled in cuda_init(), with arrays on the host and device managed through CudaData, which is a struct passed by reference into the function. In this way, device kernels can be called in isolation without incurring the cost of data transfers and with reduced risk of memory leaks.
Variables

fp_t d_mask[5 * 5]
Convolution mask array on the GPU, allocated in protected memory.
gpuopenaccdiffusion¶
openacc_boundaries.c¶
Implementation of boundary condition functions with OpenMP threading.
Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)
Initialize flat composition field with fixed boundary conditions.
The boundary conditions are fixed values of \( c_{hi} \) along the lowerleft half and upperright half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.
 void boundary_kernel (fp_t **__restrict__ conc, const int nx, const int ny, const int nm)

void apply_boundary_conditions(fp_t **conc, const int nx, const int ny, const int nm)
Set fixed value \( (c_{hi}) \) along left and bottom, zeroflux elsewhere.
openacc_discretization.c¶
Implementation of boundary condition functions with OpenACC threading.
gpuopencldiffusion¶
opencl_boundaries.c¶
Implementation of boundary condition functions with OpenCL acceleration.
Functions

void apply_initial_conditions(fp_t **conc, const int nx, const int ny, const int nm)
Initialize flat composition field with fixed boundary conditions.
The boundary conditions are fixed values of \( c_{hi} \) along the lowerleft half and upperright half walls, no flux everywhere else, with an initial values of \( c_{lo} \) everywhere. These conditions represent a carburizing process, with partial exposure (rather than the entire left and right walls) to produce an inhomogeneous workload and highlight numerical errors at the boundaries.
opencl_discretization.c¶
Implementation of boundary condition functions with OpenCL acceleration.
Functions

void device_boundaries(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by)
Apply boundary conditions on OpenCL device.

void device_convolution(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by)
Compute convolution on OpenCL device.

void device_diffusion(struct OpenCLData *dev, const int flip, const int nx, const int ny, const int nm, const int bx, const int by, const fp_t D, const fp_t dt)
Solve diffusion equation on OpenCL device.

void read_out_result(struct OpenCLData *dev, const int flip, fp_t **conc, const int nx, const int ny)
Copy data out of OpenCL device.