SciDB and GPUs


#1

We did some research on how we could use the computing power of GPUs with SciDB. The main problem was to get enough work to the GPU. If the GPU is idle at any time during the query your speedup will be quiet low.
To solve this problem, we introduced following elements:

  1. Async memory copy from pinned memory to the GPU. We didn’t used the pin() function of SciDB. The CPU converts the RLE chunk to a reused pinned memory area (as the CPU also needs to do some work…). In the work of [1] they convert the chunk on the GPU. As we need the GPU for long running calculations we do not want to offload this work. (see utils/DataAccess.h)

  2. Each SciDB instance works on three chunks concurrent. This allows us to use three GPU streams per instance. In the best case, we use the two memory copy engines and the compute engine simultaneously. This pipelines more work (chunks) to the GPU. (see utils/GPUHandler.h)

  3. The CUDA MPS (multi process service) layer enables the concurrent usage of multiple CPU cores using the GPU without context switching overhead. But there are also some downsides, especially one can’t use dynamic parallelism and call-back functions anymore. But it really is a performance boot if your kernel on a single chunk doesn’t use 100% of the GPU ressources.

This setup works well for our use cases and really uses the compute engine of the GPU 100% of the time. As the CPU and the GPU work completely asynchronous in respect to each other we can hide the complete runtime of the faster running part behind the others runtime.

We run some test in the Amazon cloud and reached a speedup of >7000 for our used algorithm. But I’m a little bit disappointed about the scalability of the SciDB cluster. It is far away from linear. Is this the expected scalability or do you think the cloud setup throttled it (see this plot )?
Do you guys have some experience with SciDB and GPUs? Did you make any other observations? Would be nice if we can discuss this topic and make further progress. More details of our setup here.

Cheers,
Simon


#2

Hello, Simon!

Most impressive work! I can say that we are starting to prototype leveraging SciDB Streaming together with TensorFlow. That lets us leverage the GPU but in a very different way from your implementation. Some of our very early notes are here: https://github.com/Paradigm4/stream/blob/master/r_pkg/vignettes/tensorflow.Rmd
We are watching this space with great interest.

In regards to your question about scalability, I can offer a few notes.

  1. the config.ini files should increase the number of SciDB instances as the number of cores increases. Usually 1 instance per core or if you experience lots of multi-user traffic, 1 instance per 2 cores. Also the settings “sg-send-queue-size” and “sg-receive-queue-size” should be set to the total number of instances in the cluster. You didn’t provide your config.ini files but if you share your settings I can take a look.

  2. I noticed (if I’m not mistaken) your array dimensions are 4096 x 4096 total and you increase the chunk size up to 512 x 512 total, is that right? With 512 x 512 chunks you only have 64 total chunks (per attribute). The hash distribution function won’t always distribute them perfectly and per-attribute chunks for the same position would be collocated. So that might explain what you’re seeing as well. If this is the case, then running on a larger image may show a better curve.


#3

SciDB and TensorFlow sounds very interesting. I have a look at this asap. Using the Streaming interface also sounds quiet promising.

Thanks for your feedback. The config is always 1 instance per CPU core available on the node. But I didn’t tuned the sg-send and sg-receive parameters. I will re-run the benchmarks and tell you how it performed. As I had to increase the PostgreSQL limits to be able to handle the communication of all instances, is it possible that there is also a bottleneck? The pg database runs on the coordinator node. Regarding the chunk size you’re right. But I’m aware of this and expected the same behavior as you mentioned. So I’m talking more about the small chunk configs.

I keep you posted.


#4

Yes - Postgres overhead would probably be measurable but I expect that to be fairly consistent. Postgres overhead for a read query would typically not change based on chunk size or the size of the array. It would be a constant-sized penalty that changes only if you change the config.

Are you running on SciDB 16.9 or older?
In SciDB 16.9 you can also set the config
perf-wait-timing=1
And then scidb.log (DEBUG level) will output a breakdown of where we think query time was spent, search for a line called “QueryTiming”.

And yes it could be affected by cloud setups - depending on how you do it - there are some settings for “instance placement groups” and “tenancy” that may affect latency between instances.


#5

Just another thing to keep in mind - when the plot says chunk size is “8KB” does the create array statement have a value of “8” in the chunk size field? I think that’s what the code does but wanted to double check. When used in the create array statement, that number is not in kilobytes but cells, so an 8x8 chunk would be 64 cells. Which is rather small.

If that is all true, then looking at the plot again, as you increase the chunk sizes from 8x8 to 512x512 the overhead goes down significantly. So I wonder if this is actually the chunk map chunk-open overhead which would dominate when chunks are small of course.

It might be especially significant for 3 nodes since the number of chunks isn’t divided evenly by the number of instances, therefore someone is stuck with extra data. Just a hypothesis.


#6

I used 15.12. 8KB corresponds to a chunk size of 64x64 (x16bit). I think the overhead is really the chunk-open part (as it is huge for small chunks and gets less for bigger chunks). Does this also creates a postgresql query for each chunk? I’ll check the Query Timing output of the logs and post it here. SciDB is built in Release mode, so I have to rebuild it with RelWithDebugOutput (something like that). I need a day or two set this up :wink:
What is the scaling you see at your clusters where you also control network and disk setup? Is it near linear?


#7

No there is no postgres query for each chunk. The chunk maps are completely independent and per-instance. Postgres stores only metadata like array schema, instance names. So when you start the query, postgres takes some hits in the beginning but after that there is no per-chunk postgres business.

RelWithDebInfo has to do with debug symbols. I think you should still be able to turn on DEBUG logging even if it’s built in Release mode. Have you tried? It’s controlled in /opt/scidb/…/share/scidb/log4cxx.properties and the log file is written independently by each instance.

In our case of course scaling depends on the op performed. Things like sort / redimension and some joins have to share data and move data around - so they’re not going to be perfect. For things like sums and compute-parallel tasks you can get pretty good. See example here: https://github.com/paradigm4/elastic_resource_groups

Of course - if one instance gets stuck with more data than others, he will slow the whole query down. Also if the task is sub-second, then the little setup/teardown handshakes start to dominate. In essence - because of Amdahl’s law you rarely see just a straight line - eventually the line flattens but the shape depends on the task, data distribution and so on.

Another good test is to double the cores AND double the data at the same time - expecting the curve to stay flat. That is another useful measurement that is less sensitive to overhead.


#8

Thanks a lot for your input. Unfortunately my AWS request for more GPU instances is already expired, so I have to create another one before I can re-run the test. I quickly run some tests on a 2 node cluster. The sg-send and sg-receive parameters didn’t changed the overall runtime much.
As soon as I have more machines online I will do further tests. I’ll run the GPU convolution operator to check the SciDB scaling, as each chunk needs the same amount of work and is completely independent (due to the overlaps). The used DEM algorithm solves an inverse problem and some cells need much longer than others. This effect leads to an unbalanced distribution of work among the cluster. I will dig deeper to check if this is the main cause (and how bad the distribution is) and keep you posted. Of course I do not expect a perfect straight line, but I think it can/should be a bit better than what we got earlier. But lets see how the convolution performs.

And yes, I can use the Debug logging. As soon as I re-run the test on the cluster, I’ll post the logs.


#9

Sorry for my late reply, I was quite busy the last few weeks. I tested the scalability again with the image convolution and it is in fact near linear. The same is true for the used scientific algorithm if I redistribute the hard pixels to different chunks or simply increase the number of images to solve. The following plot shows the speedup factor compared to an implementation of the algorithm running on one CPU core. We reach a speedup of 10⁴ with SciDB and GPUs :smiley: