Delete array versions


#1

Hi,
Recently I am doing a task to constantly import data (precipitation grid) at different time steps into an array in SciDB and the problem encountered is the versioning of SciDB.
The schema of the array is ‘MPE_AIRRATE:float [T_idx=0:*,1,0,Y_idx=0:3999,4000,0,X_idx=0:3999,4000,0]’ where T_idx, Y_idx and X_idx are time dimension, latitude and longitude respectively. Each grid imported is a 4000 x 4000 raster at one time step. After importing 8 grids, 8 chunks are created (checked through “chunk map”) and also 8 versions are established. Each chunk stores one grid. However, since I only use the latest version, I deleted all previous versions through the “store(MPE_A,MPE_B)”. MPE_B still contains 8 chunks with the same chunk storage size as MPE_A. Then I benchmarked the two arrays with time series selection, aggregation… and found that MPE_B indeed process queries faster.

Question1, why do MPE_A and MPE_B contain the same chunks (at least chunk storage size) but have different query performance?
Question2, I only tested 8 grids, if more grids are imported, say 1000, the restore, i.e. copying process after importing each new grid can take long time in total. Is it possible to stop the versioning function? (I appreciate the versioning functionality of SciDB, which makes it possible to retrieve historical arrays.)
Question3, more interesting problem occures when the schema of MPE_A is set to ‘IRRATE:float [T_idx=0:*,4,0,Y_idx=0:3999,4000,0,X_idx=0:3999,4000,0]’. In this case, MPE_A contains 8 chunks with data duplication while MPE_B contains only 2 chunks. Is there a way to remove previous versions or useless chunks just through “remove” (not possible currently) without copying all the data into another array?


#2

Hi,

The next release of SciDB (expected in about a month) will have two features of interest:

  1. an efficient way to remove the old versions of an array
  2. temporary array objects which are unversioned and not protected by transactions; they can be modified without the versioning overhead, they can then be stored to a persistent array.

Hopefully both of these features will make your life easier. Question 1 is still interesting. Could be a problem we need to investigate. Could you attach a chunk map dump of MPE_A and MPE_B? Thanks!


#3

Good to hear there will be improvement with versioning. The temporary array object perhaps exists in the memory and then can be restored into another persistent array, which means data are only copied once to the disk. However, I wonder what happens if the temporary array is very large, say, larger than memory size. Will the temporary array then be stored on the disk? If so, then by restoring the temporal array to another array, data is copied again, so totally 2 data copying process, the same as current versioning implementation.

I used “SELECT A.name,C.attid,C.dguid, C.dhdrp, C.doffs,C.addrs, C.coord, C.nelem, C.csize, C.usize, C.asize FROM list(‘arrays’) AS A CROSS JOIN list(‘chunk map’) AS C WHERE A.id = C.uaid AND A.name = ‘arrayName’” to acquire chunk maps.

Chunk map of MPE_A

{No,inst,n} name,attid,dguid,dhdrp,doffs,addrs,coord,nelem,csize,usize,asize {24,0,97321} 'MPE_A',0,3405,1217791312,8388608,1038229872,'{0, 0, 0}',16000000,5147496,5147496,8388608 {24,0,97322} 'MPE_A',0,3405,1217981344,16777216,1038374496,'{1, 0, 0}',16000000,5238880,5238880,8388608 {24,0,97323} 'MPE_A',0,3405,1218171376,25165824,1038518800,'{2, 0, 0}',16000000,5280028,5280028,8388608 {24,0,97324} 'MPE_A',0,3405,1218361408,33554432,1038663152,'{3, 0, 0}',16000000,5259716,5259716,8388608 {24,0,97325} 'MPE_A',0,3405,1218373840,41943040,1038676160,'{4, 0, 0}',16000000,5264896,5264896,8388608 {24,0,97326} 'MPE_A',0,3405,1218386272,50331648,1038689184,'{5, 0, 0}',16000000,5243300,5243300,8388608 {24,0,97327} 'MPE_A',0,3405,1218443104,58720256,1038735072,'{6, 0, 0}',16000000,5209524,5209524,8388608 {24,0,97328} 'MPE_A',0,3405,1221385936,67108864,1040914912,'{7, 0, 0}',16000000,5262640,5262640,8388608 {24,0,97329} 'MPE_A',1,3405,1217792200,0,1038227408,'{0, 0, 0}',16000000,48,48,512 {24,0,97330} 'MPE_A',1,3405,1217982232,512,1038372032,'{1, 0, 0}',16000000,48,48,512 {24,0,97331} 'MPE_A',1,3405,1218172264,1024,1038516336,'{2, 0, 0}',16000000,48,48,512 {24,0,97332} 'MPE_A',1,3405,1218362296,1536,1038660688,'{3, 0, 0}',16000000,48,48,512 {24,0,97333} 'MPE_A',1,3405,1218374728,2048,1038651568,'{4, 0, 0}',16000000,48,48,512 {24,0,97334} 'MPE_A',1,3405,1218387160,2560,1038686240,'{5, 0, 0}',16000000,48,48,512 {24,0,97335} 'MPE_A',1,3405,1218443992,3072,1038732064,'{6, 0, 0}',16000000,48,48,512 {24,0,97336} 'MPE_A',1,3405,1221386824,3584,1040911968,'{7, 0, 0}',16000000,48,48,512

Chunk map of MPE_B

{No,inst,n} name,attid,dguid,dhdrp,doffs,addrs,coord,nelem,csize,usize,asize {25,0,1618618} 'MPE_B',0,4001,1211555776,25165824,1051131808,'{0, 0, 0}',16000000,5147496,5147496,8388608 {25,0,1618619} 'MPE_B',0,4001,1211554000,8388608,1051138576,'{1, 0, 0}',16000000,5238880,5238880,8388608 {25,0,1618620} 'MPE_B',0,4001,1211554888,16777216,1051136112,'{2, 0, 0}',16000000,5280028,5280028,8388608 {25,0,1618621} 'MPE_B',0,4001,1211556664,33554432,1051132352,'{3, 0, 0}',16000000,5259716,5259716,8388608 {25,0,1618622} 'MPE_B',0,4001,1211562880,58720256,1051125856,'{4, 0, 0}',16000000,5264896,5264896,8388608 {25,0,1618623} 'MPE_B',0,4001,1211560216,50331648,1051134784,'{5, 0, 0}',16000000,5243300,5243300,8388608 {25,0,1618624} 'MPE_B',0,4001,1211558440,41943040,1051133568,'{6, 0, 0}',16000000,5209524,5209524,8388608 {25,0,1618625} 'MPE_B',0,4001,1211563768,67108864,1051126464,'{7, 0, 0}',16000000,5262640,5262640,8388608 {25,0,1618626} 'MPE_B',1,4001,1211561104,1024,1051124640,'{0, 0, 0}',16000000,48,48,512 {25,0,1618627} 'MPE_B',1,4001,1211559328,512,1051134176,'{1, 0, 0}',16000000,48,48,512 {25,0,1618628} 'MPE_B',1,4001,1211557552,0,1051132960,'{2, 0, 0}',16000000,48,48,512 {25,0,1618629} 'MPE_B',1,4001,1211561992,1536,1051125248,'{3, 0, 0}',16000000,48,48,512 {25,0,1618630} 'MPE_B',1,4001,1211566432,3072,1051128384,'{4, 0, 0}',16000000,48,48,512 {25,0,1618631} 'MPE_B',1,4001,1211564656,2048,1051127072,'{5, 0, 0}',16000000,48,48,512 {25,0,1618632} 'MPE_B',1,4001,1211565544,2560,1051127728,'{6, 0, 0}',16000000,48,48,512 {25,0,1618633} 'MPE_B',1,4001,1211567320,3584,1051129040,'{7, 0, 0}',16000000,48,48,512

Very interesting that chunks in MPE_A have regular “doffs” which indicates location (offset) of the chunk data within the datastore while it is a chaos for MPE_B. I did a sub-selection query, between(arrayName, 0,327,3643, 7,327,3643) 20 times for each array and calculate average execution time with largest 3 values and smallest 3 values removed, i.e. average over 14 execution time for each array. For MPE_A, it takes 321ms to finish while for MPE_B, it is 195ms. I also tested query aggregate(between(arrayName, 0,327,3643, 7,327,3643), avg(IRRATE), Y_idx, X_idx), it presents similar result. With “between” operator, the “doffs” should not matter that much, so I cannot find the reason leading to different performance from chunk maps.

Interestingly, I found another problem. It seems that selecting chunk map is an expensive operation, I constantly receive the error:

Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.

My SciDB configuration file is

[test] server-0=localhost,1 db_user=xxxx db_passwd=xxxx install_root=/opt/scidb/14.3 pluginsdir=/opt/scidb/14.3/lib/scidb/plugins logconf=/opt/scidb/14.3/share/scidb/log4cxx.properties base-path=/home/scidb/data tmp-path=/tmp base-port=1239 interface=eth0 enable-catalog-upgrade=true max-memory-limit=6144

After rebooting the system, selecting chunk map can then work.


#4

I am having a hard time (with the latest trunk - so 14.7 in essence) reproducing your “different times for MPE_A and MPA_B”. Would you take a look at my efforts below, and let me know where you see differences between what you’re seeing and what I’m doing here?

--
--   File:  Examples/Insert_Performance_Forum/schema.aql 
--
--  About: 
--
--    The purpose of this script is to try to understand the performance 
--  issue identified in http://www.scidb.org/forum/viewtopic.php?f=11&t=1393
--
--  Hygiene ... 
SET LANG AQL;
DROP ARRAY MPE_SLICE;
DROP ARRAY MPE_A;
DROP ARRAY MPE_B;
--
CREATE ARRAY MPE_SLICE 
< IRRATE:float NULL> 
[ Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ];
--
CREATE ARRAY MPE_A
< IRRATE:float NULL> 
[ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ];
--
CREATE ARRAY MPE_B
< IRRATE:float NULL> 
[ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ];
--
SET NO FETCH;
SELECT *
  INTO MPE_SLICE
  FROM build ( MPE_SLICE, float(random()%1000)/100.0 );
--
--------------------------------------------------------------------------------
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AFL;
SET NO FETCH;
insert ( 
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(0) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--   Took from '14:31:09' to '14:36:55', or 346 seconds. 
--
SET LANG AQL;
SET FETCH;
SELECT COUNT(*) FROM MPE_A;
--
--  {i} COUNT
--  {0} 16000000
--
--   What we expect to see. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AFL;
SET NO FETCH;
insert (
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(1) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
-- 
--   Took from '14:39:14' to '14:44:57', or 343 seconds.
--
SET LANG AQL;
SET FETCH;
SELECT COUNT(*) FROM MPE_A;
--
-- {i} COUNT
-- {0} 32000000
--
--   Which again, we expect to see. 
--
------------------------------------------------------------------------------
--
--  Sidebar. What do the chunks look like? 
--
SET LANG AFL;
SET FETCH;
aggregate (
  filter (
    cross_join (
      filter (
        list ('arrays'),
        name = 'MPE_A'
      ) AS A,
      list('chunk map') AS C
    ),
    A.id = C.uaid
  ),
  count(*) AS num_chunks,
  sum ( C.nelem ) AS total_cnt,
  min ( C.nelem ) AS min_per_chunk,
  max ( C.nelem ) AS max_per_chunk,
  sum ( C.usize ) AS total_size_in_bytes
);
--
-- 
--{i}num_chunks, total_cnt, min_per_chunk, max_per_chunk, total_size_in_bytes
--{0}         4,  64000000,      16000000,      16000000,           128645264
--
--  NOTE: So there are 2 "attributes" in this array: The data value attribute
--        IRRATE, and a second, "hidden" attribute that we use to track the 
--        status of whether or not a particulate cell is empty or full. Now,
--        in this case, all of the cells are occupied (we say, "are 
--        non-empty") so the physical size of each bitmask chunk is trivial
--        as we RLE compress them. 
--
--        In this case, there are 2 "data" chunks, and the 2 "data" chunks 
--        each have 16,000,000 x 4 byte values, which is ~61 Megabytes, for
--        a total of ~122 Megabytes. From the total_size_in_bytes we see that
--        the array as 128,645,264 bytes assigned to it, so the overhead of 
--        all of this is about 6 Meg. This figure includes all of the 
--        bit masks, the per-chunk overhead, and some additional padding 
--        that you get when aligning chunk data to physical allocation 
--        boundaries. 
--
--        What this tells us is that we didn't get much data compaction using
--        RLE on the data chunks. That's because there aren't any runs of 
--        repeating value in this particular data set. 
--
------------------------------------------------------------------------------
-- 
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AFL;
SET NO FETCH;
insert (
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(2) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true); 
--
--  From '15:53:56' to '15:59:41', which is 375 seconds. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AFL;
SET NO FETCH;
insert (
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(3) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '16:01:52' to '16:07:47', which is 355 seconds. 
--
SET LANG AQL;
SET FETCH;
SELECT COUNT(*) FROM MPE_A;
--
-- {i} COUNT
-- {0} 64000000
--
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AFL;
SET NO FETCH;
insert (
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(4) AS T_idx,
               IRRATE 
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '16:12:32' to '16:18:13', which is 341. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--  
SET LANG AFL;
SET NO FETCH;
insert (
 (
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (
        SELECT int64(5) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '16:22:52' to '16:28:35', which is 343. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--  
SET LANG AFL;
SET NO FETCH;
insert ( 
 (  
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (     
        SELECT int64(6) AS T_idx,
               IRRATE
          FROM MPE_SLICE          
       )                          
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '16:29:37' to '16:35:21', which is 344. 
--
--  
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
-- 
SET LANG AFL;
SET NO FETCH;
insert (
 ( 
  SELECT MIN ( IRRATE ) AS IRRATE
    FROM (    
        SELECT int64(7) AS T_idx,
               IRRATE
          FROM MPE_SLICE
       )
  REDIMENSION BY [ T_idx=0:*,1,0, Y_idx=0:3999,4000,0, X_idx=0:3999,4000,0 ]
 ),
 MPE_A
);
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '17:47:13' to '17:52:58', which is 343 secs. 
--
SET LANG AQL;
SET FETCH;
SELECT COUNT(*) FROM MPE_A;
--
-- {i} COUNT
-- {0} 128000000
--
--  So. 8 "slices", each of which has 16,000,000 cells. 
--
------------------------------------------------------------------------------
--
--  Sidebar. What do the chunks in MPE_A look like?
--
SET LANG AFL;
SET FETCH;
aggregate (
  filter (
    cross_join (
      filter (
        list ('arrays'),
        name = 'MPE_A'
      ) AS A,
      list('chunk map') AS C
    ),
    A.id = C.uaid
  ),
  count(*) AS num_chunks,
  sum ( C.nelem ) AS total_cnt,
  min ( C.nelem ) AS min_per_chunk,
  max ( C.nelem ) AS max_per_chunk,
  sum ( C.usize ) AS total_size_in_bytes
);
--
-- {i} num_chunks, total_cnt, min_per_chunk, max_per_chunk, total_size_in_bytes
-- {0}         16, 256000000,      16000000,      16000000,           514581056
--
--  Once again, there are 16 chunks, as there are 2 "attributes". 
--
-------------------------------------------------------------------------------
--
--   Let's do a suite of queries that do simple aggregations, apply, filter, 
--  slice, etc. 
--
--   NOTE: What I am not showing in this script is the way I bounced the 
--         server each time between runs at the query, and ensured that my 
--         Linux cache was flushed. That's why each query is repeated. The 
--         first time, it "warms up the cache", and the second time simply 
--         re-uses whatever cached data it can. That's why, just looked at
--         visually, the second query takes much less time than the first. 
--
--  scidb.py stopall [your_instance_name_goes_here]
--  echo "free && sync && echo 3 > /proc/sys/vm/drop_caches && free" | sudo sh 
--  scidb.py startall [your_instance_name_goes_here]
--
--  T1: Global aggregate. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A;
--
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A; 
--
-- {i} CNT,AVG_IRRATE
-- {0} 128000000,4.99574
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '19:13:41' to '19:13:48', which is ~7secs. (Ran this a few times, 
--  with about the same result each time.) 
--
--  T2: Group-By aggregate. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--   From '19:20:02' to '19:23:38', which is 216 secs. 
--
--  T3: Complex (apply(...)) Group-By aggregate. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '21:48:02' to '21:51:52', which is 234 seconds. 
--
--  T4: Slice(...) and aggregate.
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM slice ( MPE_A, T_idx, 3 );
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM slice ( MPE_A, T_idx, 3 );
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '22:01:04' to '22:01:59', which is 55 seconds. 
-- 
-------------------------------------------------------------------------------
--
-- OK. Now copy the data from the MPE_A into MPE_B.
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SET NO FETCH;
SELECT * 
  INTO MPE_B
  FROM MPE_A;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--   From '22:21:48' to '22:22:24', which is 84 seconds. 
--
------------------------------------------------------------------------------
--
--  Sidebar. What do the chunks in MPE_A look like?
--
SET LANG AFL;
SET FETCH;
aggregate (
  filter (
    cross_join (
      filter (
        list ('arrays'),
        name = 'MPE_B'
      ) AS A,
      list('chunk map') AS C
    ),
    A.id = C.uaid
  ),
  count(*) AS num_chunks,
  sum ( C.nelem ) AS total_cnt,
  min ( C.nelem ) AS min_per_chunk,
  max ( C.nelem ) AS max_per_chunk,
  sum ( C.usize ) AS total_size_in_bytes
);
--
-- {i} num_chunks, total_cnt, min_per_chunk, max_per_chunk,total_size_in_bytes
-- {0}         16, 256000000, 16000000,           16000000,          514581056
--
--  Identical to MPE_A. 
--
-------------------------------------------------------------------------------
--
-- Now, repeat the 4 queries above. 
--
--  T1: Global aggregate. 
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A;
--
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A;
--
-- {i} CNT,AVG_IRRATE
-- {0} 128000000,4.99574
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
-- From '22:30:14' to '22:30:21', being 7 seconds. 
--
--  T2: Group-By aggregate.
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SELECT COUNT(*) AS CNT,
       AVG ( IRRATE ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '22:32:15' to '22:35:46', which is 221 seconds. 
--
--  T3: Complex (apply(...)) Group-By aggregate.
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM MPE_A
 GROUP BY T_idx;
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '22:45:25' to '22:49:11', which is 226 seconds. 
--
--  T4: Slice(...) and aggregate.
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM slice ( MPE_A, T_idx, 3 );
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
SET LANG AQL;
SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE
  FROM slice ( MPE_A, T_idx, 3 );
--
SET LANG AQL;
SET FETCH;
SELECT now() FROM build(<dummy:bool>[I=0:0,1,0], true);
--
--  From '22:55:40' to '22:56:35', which is 55 seconds. 
--
-------------------------------------------------------------------------------
--
--  So:       MPE_A          MPE_B 
--
--    T1:         7              7 
--    T2:       216            221
--    T3:       234            226 
--    T4:        55             55
-- 
-------------------------------------------------------------------------------
--
--    I'm not seeing significant differences. 
--
-------------------------------------------------------------------------------

#5

As you may have known, the data I use indeed contain a lot of same values, i.e. 0, which is very common for precipitation data. And when RLE encoding is used, the data can be compressed a lot, as the chunk map shows. I did intensive test with a larger scale of the precipitation data, 16 time steps, using your 4 types of queries. My test script goes like,

[code]#!/bin/bash
#Timemeasure.sh

printf “MPEinSciDB\n” >> MPEinSciDB_time
printf “MPEinSciDB_2\n” >> MPEinSciDB_2_time
printf “Q1\n” >> MPEinSciDB_time
printf “Q1\n” >> MPEinSciDB_2_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPEinSciDB_time iquery -nq "SELECT COUNT() AS CNT, AVG ( IRRATE ) AS AVG_IRRATE FROM MPEinSciDB"
/usr/bin/time -f “%e” -a -o MPEinSciDB_2_time iquery -nq "SELECT COUNT(
) AS CNT, AVG ( IRRATE ) AS AVG_IRRATE FROM MPEinSciDB_2"
done

printf “\nQ2\n” >> MPEinSciDB_time
printf “\nQ2\n” >> MPEinSciDB_2_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPEinSciDB_time iquery -nq "SELECT COUNT() AS CNT, AVG ( IRRATE ) AS AVG_IRRATE FROM MPEinSciDB GROUP BY T_idx"
/usr/bin/time -f “%e” -a -o MPEinSciDB_2_time iquery -nq "SELECT COUNT(
) AS CNT, AVG ( IRRATE ) AS AVG_IRRATE FROM MPEinSciDB_2 GROUP BY T_idx"
done

printf “\nQ3\n” >> MPEinSciDB_time
printf “\nQ3\n” >> MPEinSciDB_2_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPEinSciDB_time iquery -nq “SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE FROM MPEinSciDB GROUP BY T_idx”
/usr/bin/time -f “%e” -a -o MPEinSciDB_2_time iquery -nq "SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE FROM MPEinSciDB_2 GROUP BY T_idx"
done

printf “\nQ4\n” >> MPEinSciDB_time
printf “\nQ4\n” >> MPEinSciDB_2_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPEinSciDB_time iquery -nq “SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE FROM slice ( MPEinSciDB, T_idx, 3 )”
/usr/bin/time -f “%e” -a -o MPEinSciDB_2_time iquery -nq "SELECT AVG ( log ( abs ( IRRATE ) + 1.0 ) * 10.0 ) AS AVG_IRRATE FROM slice ( MPEinSciDB_2, T_idx, 3 )"
done[/code]
Where MPEinSciDB has the same schema with MPE_A, and just by “store(MPEinSciDB, MPEinSciDB_2)”, I observed that the schema of MPEinSciDB_2 is different from MPEinSciDB, i.e. the upper boundary of MPEinSciDB_2 is 15 instead of *. The same is true for MPE_B. But through seeing chunk map, there is no storage difference between the schema with limited dimension and unlimited dimension. So I think it does not influence the query performance.

So for all the 4 queries, the response time of MPEinSciDB is,

[code]MPEinSciDB
Q1
0.38
0.35
0.32
0.33
0.33
0.33
0.33
0.32
0.33
0.34
0.34
0.33
0.33
0.34
0.38
0.35
0.33
0.33
0.42
0.34

Q2
71.21
71.06
70.69
71.32
70.70
69.92
71.01
70.47
70.49
70.84
70.12
71.01
70.74
70.88
70.81
69.02
72.44
70.70
70.41
70.37

Q3
69.52
70.39
69.38
70.05
70.48
71.63
70.55
69.54
70.78
70.96
69.33
69.78
70.38
70.73
70.75
70.65
69.84
70.76
69.37
70.95

Q4
3.98
4.05
4.00
3.94
4.12
4.02
3.97
4.03
4.20
4.10
4.06
4.11
4.02
4.11
4.07
4.00
4.03
3.96
4.00
4.00[/code]
While for MPEinSciDB_2, it is

[code]MPEinSciDB_2
Q1
0.25
0.23
0.25
0.22
0.22
0.22
0.22
0.22
0.22
0.21
0.22
0.22
0.22
0.23
0.29
0.23
0.22
0.23
0.22
0.22

Q2
71.34
70.83
70.77
72.47
71.05
70.99
71.05
70.49
70.57
70.87
70.55
70.50
70.30
70.39
70.85
70.86
70.19
70.57
72.12
70.97

Q3
67.56
67.47
67.92
67.45
69.67
68.43
68.13
67.37
67.71
67.80
69.41
67.88
69.37
68.46
68.05
67.98
68.05
67.58
69.26
67.68

Q4
4.19
3.93
4.04
4.01
4.15
4.09
4.07
4.01
4.23
4.02
4.05
4.04
4.05
4.17
4.06
4.23
3.97
4.09
4.01
4.01[/code]
The unite is second. The mean response time for each query calculated with largest 3 and smallest 3 values removed is

MPEinSciDB, MPEinSciDB_2 Q1: 0.336s, 0.223s Q2: 70.73s, 70.78s Q3: 70.3s, 68s Q4: 4.03s, 4.06s
The similar results as you have tested out.

I also constructed MPE_A and MPE_B using your method, and use following script with my own queries

[code]#!/bin/bash
#Timemeasure.sh

printf “MPE_A\n” >> MPE_A_time
printf “MPE_B\n” >> MPE_B_time
printf “Q1\n” >> MPE_A_time
printf “Q1\n” >> MPE_B_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPE_A_time iquery -anq “between(MPE_A, 0, 267, 2034, 7, 267, 2034)”
/usr/bin/time -f “%e” -a -o MPE_B_time iquery -anq "between(MPE_B, 0, 267, 2034, 7, 267, 2034)"
done

printf “\nQ2\n” >> MPE_A_time
printf “\nQ2\n” >> MPE_B_time
for (( i=1; i<=20; i++ ))
do
/usr/bin/time -f “%e” -a -o MPE_A_time iquery -anq “aggregate(between(MPE_A, 0, 267, 2034, 7, 267, 2034), avg(IRRATE))”
/usr/bin/time -f “%e” -a -o MPE_B_time iquery -anq "aggregate(between(MPE_B, 0, 267, 2034, 7, 267, 2034), avg(IRRATE))"
done[/code]

And the average response time for the two arrays is

MPE_A, MPE_B Q1: 0.11s, 0.11s Q2: 1.95s, 1.61s

However with MPEinSciDB and MPEinSciDB_2 array, the average response time is,

MPEinSciDB, MPEinSciDB_2 Q1: 0.29s, 0.18s Q2: 0.31s, 0.19s
From above test, although the difference might not be significant (actually, if we calculate the relative difference, it is quite noticeable for my query and data ), the fact that MPEinSciDB_2 is all faster than MPEinSciDB (also MPE_B is faster than MPE_A) indicates that the system error exists, which means something should be different for the two arrays with different versions.

I do not know how to explain your tests where both arrays have performance at the same level, my feeling tells me our different test results are due to different operators, i.e. query types we use. Or it might be that the overhead of MPEinSciDB with respect to MPEinSciDB_2 is basically a constant, which does not increase with the data scale getting larger, which is a good sign.


#6

I tested more. With MPEinSciDB and MPEinSciDB_2, the “between” operator does not show significant difference. While for various aggregation types like “avg”, “max” and “stdev”, indeed two arrays have different performance and MPEinSciDB_2 responses faster than MPEinSciDB