We did some research on how we could use the computing power of GPUs with SciDB. The main problem was to get enough work to the GPU. If the GPU is idle at any time during the query your speedup will be quiet low.
To solve this problem, we introduced following elements:
Async memory copy from pinned memory to the GPU. We didn’t used the pin() function of SciDB. The CPU converts the RLE chunk to a reused pinned memory area (as the CPU also needs to do some work…). In the work of  they convert the chunk on the GPU. As we need the GPU for long running calculations we do not want to offload this work. (see utils/DataAccess.h)
Each SciDB instance works on three chunks concurrent. This allows us to use three GPU streams per instance. In the best case, we use the two memory copy engines and the compute engine simultaneously. This pipelines more work (chunks) to the GPU. (see utils/GPUHandler.h)
The CUDA MPS (multi process service) layer enables the concurrent usage of multiple CPU cores using the GPU without context switching overhead. But there are also some downsides, especially one can’t use dynamic parallelism and call-back functions anymore. But it really is a performance boot if your kernel on a single chunk doesn’t use 100% of the GPU ressources.
This setup works well for our use cases and really uses the compute engine of the GPU 100% of the time. As the CPU and the GPU work completely asynchronous in respect to each other we can hide the complete runtime of the faster running part behind the others runtime.
We run some test in the Amazon cloud and reached a speedup of >7000 for our used algorithm. But I’m a little bit disappointed about the scalability of the SciDB cluster. It is far away from linear. Is this the expected scalability or do you think the cloud setup throttled it (see this plot )?
Do you guys have some experience with SciDB and GPUs? Did you make any other observations? Would be nice if we can discuss this topic and make further progress. More details of our setup here.