Slow insert operator performance


#1

i have 800 staging arrays that i am trying to merge into one final array.
each staging array is 100MB so total size of final array is 0.8TB.

i am using 4 instance node cluster with 64GB of ram and 16 CPU per node.
i have 4 scidb instances per node.

the staging data is evenly distributed over the cluster instances
the issue i am running is insert time keep increasing as it some kind of proportion to the size of final array. eg when the target array was empty the insert time to it was very small (20s). But right after that it started increasing significantly - after the third insert time was 1min, after 4th it went up to 3 min, now after 130 inserts it up to 15min per insert. Since i have 800 inserts to do i run them in parallel - 20 insert threads. But 15 min per insert is killing me.

This is my schema:
target final array:
CREATE ARRAY current_speed speed:double [
lat=-2000000:6000000,500000,0,
lon=8000000:16000000,2000000,0,
level=-10000:0,5,0,
time=1979010100:2025123100,100000,0];

all the staging arrays have the same schema:
CREATE ARRAY current_speed1979010100 speed:double [lat=-2000000:6000000,500000,0,lon=8000000:16000000,2000000,0,level=-10000:0,5,0,time=1979010100:2025123100,100000,0];
the staging arrays obviously don’t overlap in time dimensions.

this is the insert command i am using and there is 800 of them
insert(current_speed1979010100,current_speed);


#2

This the document that shows run-time performance of first 150 inserts
https://drive.google.com/file/d/0B0RN46PshZjSd1lqcEJCdkhocW8/view?usp=sharing


#3

I also noticed this slow performance. The reason this occurs is that every insert() operation creates a new version of the array, and every new insert() now has to concern itself with multiple versions. I am not sure it has to be this way, theoretically it should only have to look at the latest version of each chunk, since they are not (yet) stored as diffs. However, perhaps there is some overhead of which I am unaware.

In any case, to solve my problem, I twisted the developer’s arms to get them to create the remove_versions() command (http://www.paradigm4.com/HTMLmanual/14.12/scidb_ug/re44.html), which you can use every so often (determined emperically) to clean up these old, and probably unwanted versions. Note that you’ll have to keep track of which version number you are creating as you go along.


#4

ah, this makes sense sense. I also kept running out of disk space and keeping old versions around explains it.
they used to have an option for immutable array where you can only insert new data and not update. Now it seems it was removed in v13


#5

i am reading suggestions how to remove array versions
paradigm4.com/HTMLmanual/14.3/sc … 09s04.html
is there a better way of doing this? instead of making a copy of entire array every time?

If you want to keep only the latest version of the array, you can perform the following steps to remove all other versions:

AFL% store(stocks,stocks_tmp);

AFL% remove(stocks);

AFL% rename(stocks_tmp,stocks);

To verify, use the versions() operator again:

AFL% project(versions(stocks), version_id);

{VersionNo} version_id
{1} 1

To keep one of the intermediate versions of stocks, rather than the latest, use the version syntax when specifying the array in the store() query. For example, to save the second version of stocks, you would run the following query:

AFL% store(stocks@2,stocks_tmp);


#6

i have tried
remove_versions(current_speed,aggregate(versions(current_speed),max(version_id)));

it does not work because remove_versions needs a number and aggregate returns an single-row array.
How do i convert from single-row array to a number?


#7

Hello,

The time it takes for an individual insert to complete should be proportional to the size of the chunks that are modified by the insertion. If you have a 2D matrix with chunks sizes at 1000 x 1000, the quickest way to add new data to this matrix would be to add N * 1000 rows at a time to the “bottom” of the matrix, or N * 1000 columns at a time to the “right edge”. The slowest way would be to do random insertions of values, 1 at a time. If you can insert in a way that only adds new chunks, that should speed things up.

The versioning is, indeed, a little annoying. We’re planning to make that disappear soon. Unfortunately at the moment you have to do this:

  MAX_VERSION=`iquery -ocsv -aq "aggregate(versions($ARRAY_NAME), max(version_id) as max_version)" | tail -n 1`
  iquery -anq "remove_versions($ARRAY_NAME, $MAX_VERSION)" > /dev/null

Its important to point out though - versions are applied on a per-chunk basis. If you don’t change a chunk with an insert, then a new version of that chunk isn’t created. Hope it helps.


#8

Alex, i have 4d array (timestamp, lat, long, depth) with speed attribute.
the data i am inserting are non-overlapping (eg every 12 hours) so nothing in theory should be modified because timestamp is unique. If you look at the excel chart you will see the insertion time rapidly increasing proportional to number of inserts.
so it seems the versions are created regardless if data being inserted overlaps in dimensional coordinates or not.


#9

Aha.

I’m very sorry, I missed the link to the spreadsheet.
I generated a graph like that myself, not too long ago, using 10,000 files in an insertion that took more than 24 hours total:


In my case I was putting up to 50 files in one chunk, so I got a seesaw plot. I was also issuing a “remove versions” query after every insertion but that should not be a big effect. It took sub-second time every time.

So I would be very surprised if something changed fundamentally between my experiment and yours. I wonder if the extra spikes could be caching effects, or maybe remove_versions is more important than I thought. I am pretty confident the difference is not fundamental and should be correctable.


#10

Another possibility is that you’re creating a great many entries in some of the system catalogs.

I find it hard to believe that this is the problem with only a few hundred arrays, but it’s worth a try. Try running the PostgreSQL vacuum command on the postgres database.

[plumber@localhost trunk]$ sudo -u postgres psql mydb2
Password: 
psql (8.4.20)
Type "help" for help.

mydb2=# VACUUM ANALYZE;
VACUUM
mydb2=# 

We use some fairly sophisticated postgres queries to pull meta-data, and to insert information into the catalogs connected to new array versions, and repeated details about each array’s dimensions and attributes. I suspect by updating the statistics, you’ll find PostgreSQL will choose different physical plans for your queries.

Easy to try … let us know how you go.


#11

Also,

Your chunk size along time is 100000. So (I am not sure if this is true) if one file has time = 1979010100 and another file has time =1979010112 – then those two files would, indeed, go into the same chunk. Does that make sense?


#12

every time i start scidb now i am getting the endless stream of messages like this - how do i stop it?
2015-03-09 22:34:38,458 [0x7f830ff007c0] [ERROR]: Network error in handleSendMessage #104(‘Connection reset by peer’), instance 3 (10.10.101.101)
2015-03-09 22:34:38,458 [0x7f830ff007c0] [DEBUG]: Recovering connection to instance 3
2015-03-09 22:34:38,458 [0x7f830ff007c0] [ERROR]: Network error in handleSendMessage #104(‘Connection reset by peer’), instance 5 (10.10.101.101)
2015-03-09 22:34:38,458 [0x7f830ff007c0] [DEBUG]: Recovering connection to instance 5
2015-03-09 22:34:38,458 [0x7f830ff007c0] [ERROR]: Network error in handleSendMessage #104(‘Connection reset by peer’), instance 15 (10.10.101.104)
2015-03-09 22:34:38,458 [0x7f830ff007c0] [DEBUG]: Recovering connection to instance 15
2015-03-09 22:34:38,459 [0x7f830ff007c0] [ERROR]: Could not get the remote IP from connected socket to/frominstance 3. Error:107(‘Transport endpoint is not connected’)
2015-03-09 22:34:38,459 [0x7f830ff007c0] [DEBUG]: Connected to instance 3, 10.10.101.101:1242
2015-03-09 22:34:38,459 [0x7f830ff007c0] [DEBUG]: Connected to instance 5 (10.10.101.101), 10.10.101.101:1244
2015-03-09 22:34:38,459 [0x7f830ff007c0] [DEBUG]: Connected to instance 15 (10.10.101.104), 10.10.101.104:1247


#13

the versions feature is a big trouble. Because it increases the size of data so much i had out of disk space. (80GB data, 1.6TB disk being used before crash. once it crashes scidb cannot be restored. The only way to recover the installation is to reinstall cluster from scratch.

Alex, what size of raw csv files you are using in your benchmark test?


#14

So … the problem isn’t “versions” per-see. They’re just book-keeping we use to keep transactions separated. The problem is the way SciDB (at the moment) uses a pretty simple minded approach to handling changes to chunk data on each insert(…) operation. We’re looking at ways of remedying this. . .

Anyway … here’s some notes about your overall approach that I think will really help. But before we get into too many details, take a look at the “Loading Multiple Images” slide of theBest Practices Loading Guide

which covers a lot of what follows with lots of colorful pictures…

Looking at your arrays …

CREATE ARRAY current_speed <speed:double> 
[
  lat=-2000000:6000000,500000,0,
  lon=8000000:16000000,2000000,0,
  level=-10000:0,5,0,
  time=1979010100:2025123100,100000,0  <-- this is a potential problem.
];

CREATE ARRAY current_speed1979010100 
<speed:double> [
  lat=-2000000:6000000,500000,0,
  lon=8000000:16000000,2000000,0,
  level=-10000:0,5,0,
  time=1979010100:2025123100,100000,0
];

I think Alex P. is onto something. Even though, as you point out, each load array won’t have any overlap with its peers, the target array (current_speed) will have have multiple data copies coming into each slice along the time dimension. What I suspect is going on is that you’re repeatedly inserting into the same chunk and modifying it, which is what’s causing the size to blow up.

So … how to avoid this problem? Here’s how I would modify your loading approach. The overall idea is to batch the files into a time-slice of the data at a time.

  1. Modify your load operation to load multiple images into a single, 4D load array. That is, have an array names (say):
CREATE ARRAY current_speed_1979010100_to_1980010100 
<speed:double> [
  lat=-2000000:6000000,500000,0,
  lon=8000000:16000000,2000000,0,
  level=-10000:0,5,0,
  time=1979010100:2025123100,1,0  <-- NOTE: one "slice" per input file
];
 Some notes:

[ul]
i. This intermediate array is the same size/shape at the eventual target, but the time dimension is chunked into one slice per original file. You will load multiple original files into it. (See below … )

ii. I worry very, very much about the per-dimension chunking decision you’ve made here and elsewhere. How many cells are you getting per chunk, here? (Also, see below … )
[/ul]

  1. Load all of the original files that will belong together into that array. From your original script, this will involve a loop over each file with a per-file load query that looks (something like) this:
for file_name in 'ls -1 | grep etc'; do

TIME_INDEX='some_complex_little_sed_using_$file_name';

iquery -aq "
insert (
  redimension (
    apply (
      input (
        < speed : double, lat_val : double, lon_val : double, level : int32 > 
        '${file_name}'
        0,
        '(double,double,double,int32)'
      ),
      time,  ${TIME_INDEX},
      lat,   some_function ( lat_val ),
      lon,   some_function ( lon_val )
    ),
    current_speed_1979010100_to_1980010100
  ),
  current_speed_1979010100_to_1980010100
); "

done 
Some notes: 

[ul]
i. The ${TIME_INDEX} is something you substitute in from each file’s name, of from where-ever you get the original “1979010100” values.

ii. The calculation of lat and lon … which in the target are going to be integers … is going to depend on the nature of your original data. But I have a couple of questions here …
[/ul]

   Q1: Is this data derived from a list of location/time/speed? How do you handle cases where there are more than one recording that "collide" in one area in the target? 

   Q2: I think it might be useful to retain the original values. But up to you. 

Something else to think about from an application point of view? It might be useful to retain the complete precision of each original lat / lon from the original data. The per-dimension representation is very useful as an index, but of less utility when calculating things like precise distances.

  1. Having loaded all of the images for the current_speed_1979010100_to_1980010100 array, the next step is to insert these values into the current_speed array. But before doing that, try to do a better job of picking the per-dimension chunk lengths. At this point, you have all of the data you want to go into one “slice” of the current_speed array. So use the “USING” facility to get SciDB to pick the chunk lengths for you …
CREATE ARRAY calc_current_speed_per_dimension_chunk_lengths 
< 
   speed : double 
> 
[
   lat=?:?,?,0,               <--- Using ?:? here will get SciDB to pick the 
   lon=?:?,?,0,               <--- lower and upper from the data.
   level=?:?:5,0,            
   time=?:?,100000,0          <--- keep the desired length here 
] USING current_speed_1979010100_to_1980010100; 

The query above will create an array (but it won’t put any data into it) and it will calculate, for all of the dimensions, values for low / high / per-dimension-chunk-length replacing each of the ‘?’ values with what it finds. It fills out these values by looking at the data in the current_speed_1979010100_to_1980010100 array, and trying to build chunks with as close to 1,000,000 values as possible.

Some notes: 

[ul]
When you use USING, you’ll get a candidate array, and there will be no data in it. There’s no need to use precisely what it says. Depending on what you want to do, experiment with a few options. To get a handle on what SciDB has decided on your behalf, use the showing(…) command … [/ul]

show ( calc_current_speed_per_dimension_chunk_lengths ); 
  1. Finally, having created a “current_speed” array with the chunk organization that makes sense, store(redimension(…)) the contents of the intermediate array into it.
store ( 
  redimension ( 
    current_speed_1979010100_to_1980010100,
    current_speed
  ),
  current_speed
);

Then you can delete the current_speed_1979010100_to_1980010100, and repeat steps 2 and 4 until you’ve loaded all the image files.

 Some notes: 

[ul]
i. Proceeding in this manner means you will have only one “slice” along the time dimension for each insert. This means you won’t be increasing the storage overhead each time.

ii. There should be no need to repeat the per-dimension-chunk-length step for each new “year” of data. I am going to make the assumption that data is fairly consistently distributed, year to year.

iii. You can proceed in a step-wise, parallel manner. That is, you can repeat step 2 (load a year’s worth of files into a new intermediate array) at the same time that you’re doing step 4 on the last set of files you loaded.
[/ul]

Hope this helps!


#15

Some answers to your questions:

For this test, I had 10,000 gzipped GVCF text files. They range in size from 50MB to 180MB uncompressed. The total uncompressed size was about 850GB. The load time in my case included unzipping the file, loading it, and redimensioning it into several target arrays. So that (as can be seen in the plot) added up to an average of 20 seconds per file.

The exact script I used is actually here github.com/Paradigm4/vcf_tools/tree/master/gvcf
You can see my load_file script has a “remove versions” step at the end. I do that after each file. For now.

Now I am curious - were you able to start the system? Was the start issue you reported resolved?


#16

i had to reinstall scidb clusters from scratch…
i have two clusters - community and enterprise - both went dead after running out of disk space.
i have to watch disk space usage super carefully.
in the future i have to test both community and enterprise versions how they react to disk shortage condition when running under heavy load with replication
after recovery
i did add versions_remove script to the end of my insert and and i noticed it limited the elapsed time for insert to 20s whereas without remove_versions insert time is rising exponentially.
Thank you for pointing chunking issue with time dimensions and for providing more scripts - i will test them and let you know