SciDB Performance curves


#1

Hello,

I’m conducting some analysis on SciDB and I’d like to generate a series of performance curves for each analytic we propose to use. We have built a secondary script which modifies the SciDB and then restarts the database. Is this approach correct?

On the server we are using we have 12 instances of SciDB for 24 physical cores. If I scale the analysis from 1,2,4,8,12. How does the data potentially get reorganized to accommodate less cores while keeping maintaing the original number of instances. I don’t plan on changing the number of instances, because doing so would wipe out or original data stores.


#2

Hey David!

Most queries will use up to 1 core per query per instance. But what happens if you have 4 cores and 8 instances? Usually the OS will time-slice the CPU efficiently and you can expect it to take roughly 2x as much time as if you had 8 cores. There may be some variation depending on what exactly you’re doing. Sometimes hyper threading is viewed as “having another core” and, depending on what you’re running you could see a number better than 2x. For real-world examples, see Figures 5 and following in this paper: https://www.sciencedirect.com/science/article/pii/S0924271617300898

What if you have 8 instances and 16 cores? If only one query is running at a time, then usually the other 8 will be idle - unless you launch some multithreaded streaming or do something special. But sometimes folks do this when they know they will reasonably have at least 2 users at a time running queries. Also, in environments like EC2, adding cores sometimes means adding RAM too and you may see a benefit from that.

For more examples, see: https://github.com/Paradigm4/elastic_resource_groups


#3

Hey Alex,

Well, I guess I am asking if there is a way in SciDB to limit the number of cores or even track the number of cores that are processing on a query. When we setup the database we disabled hyper threading. We tried wrote a command that modified the config of Scidb. We allow to to specifiy the operator threads and fetch threads. But after running the analytics at 4,8,12. I don’t seem to be seeing a strong difference. Perhaps the datasets are too small?


#4

Hey David, sorry about the delay.

What you’re seeing is expected - those “-threads” settings do concurrency control. They make a difference when you have multiple users running queries at the same time. If only one query is running at a time, you won’t see a difference from changing those.

So, like I mentioned, a query will typically use 1 core per instance. So changing the number of instances gives you the most control. Unfortunately, if you don’t have the Enterprise Edition, then changing the number of instances takes longer to do. The best way to do it with the Community Edition is to do a parallel opaque save, then re-load the files into bigger or smaller cluster. https://paradigm4.atlassian.net/wiki/spaces/scidb/pages/241041478/Backing+Up+Restoring+Arrays

With the elastic resource groups plugin (link I gave above), you can somewhat “fake it out.” If you have 16 instances you can say “create an array that lives on instances 0,1,2,3 only”. Then, initially, queries against that array will use only those four instances. So you’d expect aggregates and filters to slow down in proportion. But then if you do big sorts, redimensions or joins, SciDB is liable to re-scatter the data across all 16 instances during the query.

Hope it helps!


#5

I might consider that for another paper, but that seems a bit difficult. I think the dense datasets that we are loading are pushing this to the max. We are loading 90 meter continental US data into 2 dimensional arrays, dimensions are 104423 x 161189 = 1,683,183,947 pixels. Thinking about pushing that to 3 meter and potentially submeter. The biggest issue is loading and the redimensioning. Any strategies around be appreciated.

However, I am interested if there are query that will provide some information about how the data is scattered across the cluster of instances.


#6

Which version of SciDB are you on? If you are on 16.9, then use the summarize plugin from Github operator as such:

iquery -aq "summarize(temp, 'per_instance=1')"

If you are on SciDB version 18.1 (the latest release), then summarize is built into core SciDB. The syntax is slightly changed. See the official docs

summarize(temp, by_instance:true);

#7

Assuming 64 bit representation, your current dataset would be ~12 GB right?

If you go to 3 meters, would the data grow 30*30 = 900 times? (The factor of 30 is for going from 90 meters to 3 meters). We are looking at data the size of 12 TB then.

Independent of the above answers you probably need to consider the following:

loading

Parallel binary loading is the fastest way to load data into SciDB. One benchmark I worked on reached 8 GB / min load using 40 instance SciDB. Some links for you to see in this realm:

redimensioning

One simple trick that comes to mind

Instead of

insert(redimension(LARGE_ARRAY, TARGET_ARRAY), TARGET_ARRAY)

Do it in chunks

# bash / scidbR / scidbPy equivalent of
for i in 1:num_chunks
    insert(
       redimension(
           filter(LARGE_ARRAY, filter_on_some_dimension_at_index_i), 
           TARGET_ARRAY),
    TARGET_ARRAY)

BTW in 16.9, you would need between to get fast dimensional slicing.
As of 18.1, filter handles that under the hood.

Feel free to keep the discussion going.


#8

@Kriti_Sen_Sharma

Thanks for all the helpful information. I should clarify that “loading” issues have really been around my lack of understanding regarding how SciDB loads data. I have a small python library built to handle the reading of the geotiff and subsequent loading of data.

The process currently is reading and writing in binary x tiles which were determined by the chunk size. Next load and redimension those tiles into the large array. I implemented this in parallel through python, but it is really inefficient compared to the parallel load and subsequent redimension. I hope to have time to rewrite the algorithm soon. I will read through the paper you suggested.

Question. Are there any performance benefits with a looping strategy that you suggest compared to a large dataset parallel load redimension. Perhaps there are ways to use both for very big datasets.