SciDB Memory Usage


#1

Hello,

I have a general question regarding memory usage for SciDB. I have close to 260 GB of data on a single-instance system and the SciDB process is consistently using at least 12 GB of system memory. Does this sound right, and how will this change with an increasing data volume?

Here is my current config.ini file:

[testDB] server-0=localhost,0 db_user=testDB db_passwd=testDB install_root=/opt/scidb/12.3 metadata=/opt/scidb/12.3/share/scidb/meta.sql pluginsdir=/opt/scidb/12.3/lib/scidb/plugins logconf=/opt/scidb/12.3/share/scidb/log4cxx.properties base-path=/home/scidb/data base-port=1239 interface=eth0 no-watchdog=true redundancy=0 merge-sort-buffer=1024 network-buffer=1024 mem-array-threshold=1024 smgr-cache-size=1024 execution-threads=16 result-prefetch-queue-size=4 result-prefetch-threads=4 chunk-segment-size=10485760
Thanks,

Nick


#2

Hello Nick,

It could be a lot of different things.
Are you seeing 12GB at rest or 12GB when running a query?

What is your chunking scheme for your largest arrays?
Try to run

iquery -r /dev/null -avq "scan(A)"

Where A is one of the larger arrays that you have stored. What does it say in terms of “cells / chunk” ?

It could be a bug we’ve fixed in 12.10, it could be that your chunks are too small…


#3

Thanks for the reply. The 12 GB load is at rest, it becomes much larger running a query.

Here is the output from one of the larger arrays:

[code]iquery -r /dev/null -avq 'scan(GHCND_sparse)'
Result schema: GHCND_sparse@1 <ID:string, lat:float, lon:float, TMAX:int64, TMIN:int64, PRCP:int64, SNOW:int64, SNWD:int64, colDist:double, EmptyTag:indicator>[row=50:58650,58601,0, col=50:138350,138301,0, dayNumber=0:77600,100,0]
Result size (bytes): 7914811607 chunks: 776 cells: 300085490 cells/chunk: 386708
Query execution time: 60ms
Logical plan:
[lPlan]:

[lInstance] children 0
[lOperator] scan ddl 0
[paramArrayReference] object GHCND_sparse inputNo -1 objectNo -1 inputScheme 1
[opParamPlaceholder] PLACEHOLDER_ARRAY_NAME requiredType void ischeme 1
schema: GHCND_sparse@1ID:string,lat:float,lon:float,TMAX:int64,TMIN:int64,PRCP:int64,SNOW:int64,SNWD:int64,colDist:double [rowGHCND_sparse=50:58650,58601,0,colGHCND_sparse=50:138350,138301,0,dayNumberGHCND_sparse=0:77600,100,0]

Physical plans:
[pPlan]:

[pInstance] physicalScan agg 0 ddl 0 tile 1 children 0
schema GHCND_sparse@1ID:string,lat:float,lon:float,TMAX:int64,TMIN:int64,PRCP:int64,SNOW:int64,SNWD:int64,colDist:double [rowGHCND_sparse=50:58650,58601,0,colGHCND_sparse=50:138350,138301,0,dayNumberGHCND_sparse=0:77600,100,0]
props sgm 1 sgo 1
distr roro
bound start [50, 50, 0] end [54600, 137974, 77581] density 1 cells 583722830939850 chunks 776 est_bytes 2.02552e+17
;
[/code]

This is an array of weather stations I am projecting onto a global grid. I am keeping them in a very sparse array to avoid collisions. I plan on regridding in subsequent steps but would like to keep both arrays (the averaged and raw) for queries. When I run a regrid (100x less resolved in row and col) on this array the SciDB process will gradually consume all system memory (32 GB) before crashing.

Also, I dropped some of the larger arrays (in terms of number of values stored) from the current database and it doesn’t seem to change the memory use.

Any advice would be greatly appreciated.

Thanks,

Nick


#4

Hi Nick,

This is a curious case. I was wondering if maybe you are suffering from the “many small chunks” problem - but that’s not what the output says. Your array is pretty well organized and 386+ thousand elements per chunk is good.

We’ve had a bug where very large, very sparse chunks at the edge of the array occupy too much space. It typically happens at edge chunks. For example, in the array you have, the last chunk is at coordinates {50,77599}. The dimensions are 138301 by 100 but the array ends at day=77600, so the system used to create a mask that contained 138301 “run lengths” of 1 to denote that there is no data after day=77600. That chunk could occupy a lot of memory. And if it were placed in the cache - it could account for the large memory footrpint you are seeing. We’ve fixed that bug in 12.10.

We also saw a few similar bugs with caching, and regrid over very sparse arrays. We’ve fixed those in 12.10 too. It’s likely that once you get to 12.10 many of these issues will go away.

Meanwhile, here are some things to try:

  1. It could still be that another array has too many small chunks. Do you have many arrays in the system? Care to perform that kind of query on several arrays and check to make sure? The numbers “chunks: 776” and “cells/chunk: 386708” are what you want to look for.
  2. Have you done a lot of updates or repeated stores? Do you, maybe, have an array that’s at a high version number? I see that GHCND_sparse is only at version 1. Others? In 12.3 the header for each chunk for each version is kept in memory. In 12.10 it gets better.
  3. Try to stop and start the system - i.e. scidb.py stopall / scidb.py startall. What does the mem footprint look like? Does it grow at startup or does it grow after you run a query or two? Does startup itself take a long time?
  4. I don’t know what your setup is like - but there is an unreleased, unofficial “12.7” kit you can try. You’d have to build it yourself. Source tarball is at scidb.org/tutorial_link/

Please keep me posted.
-Alex Poliakov