High performance computing and optimization,
particularly on graphics processing units (GPUs) using
CUDA, OpenCL, and directive-based languages
Applications of high performance computing,
specifically as used for financial applications,
benchmark suites, and real-time stereo analysis/motion
estimation from a pair/sequence of images
Intial
PolyBench/GPU Benchmark Suite used for results
in paper and updated
version that matches format of Polybench 3.2 and
adds OpenACC/OpenMP versions of each benchmark (any
work using either code should cite above paper)
Describes additional CUDA code optimizations with evaluation on multiple GPUs including Tesla V100.
Introduces parallel CPU implementation using OpenMP, SIMD instructions, and the same optimizations as the CUDA code, with evaluation on multiple CPUs.
TLDR #1: Was able to get a speedup of over 2x on many tests with no loss of accuracy by (1) getting rid of many cudaMalloc()/cudaFree() calls, (2) using 16-bit half precision data rather than 32-bit floats, and (3) improving data alignment.
TLDR #2: Parallel/optimized CPU implementation much faster than initial non-parallel code, but optimized CUDA implementation on Tesla V100 is still 1.9x-4.2x faster than optimized CPU implementation on 24-core Xeon CPU.