High performance computing and optimization,
particularly on graphics processing units (GPUs) using
CUDA, OpenCL, and directive-based languages
Applications of high performance computing,
specifically as used for financial applications,
benchmark suites, and real-time stereo analysis/motion
estimation from a pair/sequence of images
Intial
PolyBench/GPU Benchmark Suite used for results
in paper and updated
version that matches format of Polybench 3.2 and
adds OpenACC/OpenMP versions of each benchmark (any
work using either code should cite above paper)
Describes additional CUDA code optimizations with evaluation on multiple GPUs including Tesla V100.
Introduces parallel CPU implementation using OpenMP, SIMD instructions, and the same optimizations as the CUDA code, with evaluation on multiple CPUs.
TLDR #1: Was able to get a speedup of over 2x on many tests with no loss of accuracy by (1) getting rid of many cudaMalloc()/cudaFree() calls, (2) using 16-bit half precision data rather than 32-bit floats, and (3) improving data alignment.
TLDR #2: Parallel/optimized CPU implementation much faster than initial non-parallel code, but optimized CUDA implementation on Tesla V100 is still 1.9x-4.2x faster than optimized CPU implementation on 24-core Xeon CPU.
ARM CPUs: Amazon Graviton4, Azure Cobalt, NVIDIA Grace
Abstract: Parallel processing is a common way to speed up many computer vision algorithms including stereo matching. This work
looks at optimized parallel implementations of belief propagation for stereo processing on NVIDIA GPU and x86/ARM
CPU architectures and shows runtime comparisons across multiple GPU and CPU processors on a variety of input stereo
sets. The work goes on to present and show results of retrieving an optimized parallel configuration for each input
stereo set, speedups/slowdowns when using 16-bit floats and 64-bit doubles compared to 32-bit floats, and speedups
when using templated disparity counts that allow the iteration counts of loops that iterate through possible disparities
to be known at compile time.