Redimension, store large array


#1

I am trying to redimension a 2.5 terrabyte file into scidb. I was able to successfully load it using parallel load. However, I keep receiving a network error, which I traced to a memory error. Is there a reasonable way to calculate when you will encounter these limits.

I’m taking the approach the plumber suggested and will write – parallel load and redimension in smaller chunks. But I was somewhat surprised that the redimension failed.


#2

Hi @dahaynes! Sorry for the trouble and thanks for letting us now.

Could you give us a couple of more pieces of information:

scidb --version
cat /opt/scidb/[VER]/etc/config.ini
iquery -aq "show(INPUT_ARRAY)"
iquery -aq "summarize(INPUT_ARRAY)"
iquery -aq "show(TARGET_ARRAY)"

Where INPUT_ARRAY is the result of your load and TARGET_ARRAY is the array you’re redimensioning into. If you’re not doing a plain redimension/insert, please let us know what query you’re running as well.


#3

Sure here you go.

SciDB 16.9

AFL% show(LoadArray);
{i} schema
{0} 'LoadArrayy1:int64,x1:int64,value_1:uint8 [xy=0:*:0:1000000]'
iquery -anq "load(LoadArray, ‘pdataset.scidb’ ,-1, ‘(int64, int64, uint8)’) ";
iquery -anq “create array meris_2010_3m_500 value:uint8 [y=0:894999,500,0; x=0:2081299,500,0]”;


#4

Hello, dahaynes!

Sorry for interfering, but you asked an interesting question.
I also deal with SciDB data loading, how did you generate your input array? Do you have a piece of software that serializes data in SciDB storage format?

Thank you!


#5

@raro

Yes, I have a function in python that takes a numpy array and writes it as binary. I work with mostly geospatial data (2D and 3D arrays).


#6

@apoliakov

I adjusted the scidb config file, but it still failed after 1-2 days of trying to load

small-memalloc-size=65536
large-memalloc-limit=2147483647

But I have second question regarding strategies for redimensioning large files. I am sure the strategy I am using is naive.

I am reading from a large 2D array (geoTiff) in Python and writing it out as binary, with proper x,y columns. I have 12 SciDB Instances available and my Python script is using multiprocessing. I reached Python’s parallel memory on accident before so write now each process has threshold of 10 million pixels to read. Reading and writing takes about 10 seconds. Loading takes 10-20 seconds with parallel load. Redimension and store take 6-7 minutes.

When I originally started this program my files where much smaller. So I would partition the large array (height & width) along a single dimension (height) and have X number of chunks. When I am working through this large array, I read and write in very different areas of the array. Imagine the block below are the sections of the image. I read one row at 0,1,2… then parallel load and redimension. Would I get better performance from the redimension operator if I read all of block 0 and sequentially moved.

[00000000]
[00000000]
[00000000]
[00000000]

[11111111]
[11111111]
[11111111]
[11111111]

[22222222]
[22222222]
[22222222]
[22222222]


#7

Hey @dahaynes

It depends on the schema of the target array and a few other factors. Optimizing redimension can be a bit tricky before you get a feel for it.

What redimension does is sort data in parallel, then split it up into chunks - according to the target array schema. Those chunks are then passed to insert. If the data is too big to fit in memory, redimension also tries to spill pieces to disk while sorting.

So the run time will depend on the chunking of the target array (square or skinny), how the redimensioned data lands in the target chunks (1 chunk, 3 chunks or a bunch of small fragments all over), and how big of a job it is - if it’s too small, you may not use all the nodes, if it’s too big, it will run out of memory, start spilling to disk and go slower. Then the data is fed into insert and insert can slow down if your new data overlaps the chunks youve loaded before. This all depends on the chunk sizes too.

The config that affects redimenion most is actually mem-array-threshold - I wouldn’t touch those memalloc configs you pointed out.

On 16.9 (this changed in 18.1) you should initially set mem-array-threshold such that (mem-array-threshold + smgr-cache-size) * num_instances <= TOTAL_RAM*0.5 - in megabytes. Basically that tells SciDB to use half your total cluster RAM for caches and temp arrays. The other half would go to mid-query scratch space and sort buffers (depending on your concurrency levels). Then you can reduce smgr-cache-size to as little as 8 and increase mem-array-threshold if needed. Then, if no one else is doing anything on the machine, try increasing that 0.5 number.

The best strategy for load is to ingest, say, 120 chunks at a time (10 per instance, maybe) and load it in such a way that new loads never touch chunks you’ve already loaded - if possible.

There are more notes here SciDB's MAC(tm) Storage Explained
If you show me your array schema and your config.ini I can help more.


#8

I’ll see if I can capture all of those changes


#9

Thanks for the suggestions. Looks like it will take me about 6 days to load 2.5 Terrabytes

create array meris_2010_3m_chunk value:uint8 [y=0:894999,1000,0; x=0:2081299,1000,0]";

Estimated time for loading the dataset in minutes 8809.91573805213: LoadTime: 10.713575050234795 seconds, RedimensionTime: 46.458048194646835 seconds
6 Days.

The chunk size as you said effects the redimension time. This is the estimate for the same array with 500x500
Estimated time for loading the dataset in minutes 13035.94879022638: LoadTime: 14.66616940498352 seconds, RedimensionTime: 53.50885558873415 seconds
9 Days.

Loading 20 million values per instance on the parallel load.


#10

Hey @dahaynes how is it going?
Did you try increasing mem-array-threshold as well?

Some folks we work with see much faster load rates (sometimes multiple TB per day) but that is affected by many factors, including hardware - number of nodes, instances, cores, disk write and read speeds - all play a role. And there are some code performance improvements we are making as well.

Please do keep us posted on how it goes.


#11

Everything finished as far as I can tell. I don’t know the mem-array-threshold was at the default settings or not.

I do have another question… How do you kill a query? Especially load …


Cancel/kill query
#12

When you are in iquery command line mode, CTRL+C should kill an ongoing command.


#13

I need something that is more external to the system like

SELECT pg_cancel_backend(pid);


#14

Sorry I misinterpreted your question the first time.

See this post: Cancel/kill query.

Please let me know on that thread if you have further questions regarding query cancellation.