Increasing memory usage of SciDB after starting up


#1

I am currently using SciDB 14.3. I found that every time I imported data into SciDB, the memory usage of it increased after starting up. I guess arrays are cached into memory when starting SciDB. But not whole arrays are cached, only a small portion. And actually I created many chunks (3.1 KB per chunk) to hold these data, perhaps what is cached into the memory is the address of each chunk.Is this the reason? Is it possible to keep the memory usage of SciDB at a constant level after initialization?

PS: small chunks are preferred since they can indeed facilitate the querying process.


#2

Hmmmm …

We’re in the habit of advising people to go for chunk sizes of between 1 Meg and 16 Meg, rather than ~3K per chunk. There are several reasons for this, chief among which is that our internal testing has more or less convinced us that (as of the 13.X and 14.X releases and this will change) those chunk sizes yield the best overall performance for a wide range of workload queries.

Now … the reason you’re seeing the memory bloat is that (as of 14.8, and this will change in the next couple of releases) we maintain a single, shared, and in-memory data structure to hold all of the per-chunk metadata. Smaller chunks means more chunks which means more instances of the per-chunk metadata structure. More chunks also means more time spent in per-chunk look-up. Our storage manager needs to go from < Array_Name, Attribute ID, Chunk ID, Version# > to the < file, physical offset > of the chunk’s data within the datastore data file.

We’re planning to change the whole way this works. The initial “put everything in one place” was OK so long as there weren’t many arrays and there weren’t many users. But It’s a point of creakiness in the current implementation – unbounded memory usage, contention between query readers / writers on the shared data structure, slow look-up times for databases with lots of arrays/chunks/attributes – so we’re going to fix it. But for now, I’m afraid the only thing you can do is to increase your chunk size and thereby reduce your chunk count.

But I’m curious … you write that “small chunks are preferred since they can indeed facilitate the querying process”. Is the workload goal to pull / slice a relatively tiny segment out of an array (a couple of K)? I’m asking because we’re not seeing too much of that style of workload. We’re seeing a lot of medium -> large slices of an array being addressed and the data processing / analytics are pretty CPU intensive, so we’re prioritized scalability over slice read performance. Eager to learn what you’re up to …


#3

So < Array_Name, Attribute ID, Chunk ID, Version# > perhaps refers to the whole chunk map or only < Array_Name, Attribute ID, Chunk ID, Version>? This is because according to my test, 14.3 seems to only contain a part of chunk map, so it takes much time to retrieve the chunk map of a large array.

For small chunks, they are preferable for time series extraction. The dataset is a three dimensional precipitation data (float). The size of a grid is 4000 x 4000, and for one day, it has 96 time steps, i.e. time resolution is 15 min. For hydrologic purposes, time series extraction for a single location is frequently queried. And for a 96 x 4000 x 4000 array, this means the result is 96 x 1 x 1. My testing result shows that with 100 x 100 x 1 as the chunk size, time series extraction is faster than 800 x 800 x 1 as the chunk size for a 24 x 4000 x 4000 array.

Time series extraction is indeed a problem for NetCDF classic format which utilizes contiguous storage structure. And it is the aim of my research to investigate whether a multidimensional array can have a better performance on such a query. Until now, small chunks perform good and this is also the reason why I guess the addresses of chunks of an array are loaded into the memory, so SciDB can retrieve related chunks in a fast way.

I did not find specific implementation of space-filling curve (SFC) or index such as R tree from the source code (maybe I missed some part). I guess this is due to SciDB’s general purpose, i.e. not only for spatial data but also business data for example.


#4

You are right, SciDB does not have space-filling curves or R-trees.