Order of dimension, data structure and query performance


#1

From my practical experience with SciDB and the principle of “adddim” and “deldim”, I deduce that the left-most dimension is actually the outer-most dimension used to organize the data. For example, we have two spatial grids at 2 time steps

\ X --> \ Grid at first time step Grid at second time step Y \ 00 01 00 01 | +-----+-----+ +-----+-----+ | 00 | A | B | | E | F | v +-----+-----+ +-----+-----+ 01 | C | D | | G | H | +-----+-----+ +-----+-----+
If we finally want the data on the disk stored as ABCDEFGH, then the schema we use should be letter:string[T=0:1,2,0,Y=0:1,2,0,X=0:1,2,0]. However if we want to store the data as AEBFCGDH on the disk, then we should use the schema letter:string[Y=0:1,2,0,X=0:1,2,0,T=0:1,2,0]. That is modification of dimensions. According to normal understanding, retrieving data which stored close to each other should be faster than data stored separately. So for extracting a time series in our case, e.g. BF, the second schema should be more efficient than the first schema.

To verify this understanding, I did benchmark test. Basically, the data I use has five dimensions. It is a forecast dataset containing modelrun (M), ensemble (E), forecast (F), Y, X. M has 1 value, E has 20 members, F has 40 steps, Y has 181 latitudes while X has 360 longitudes. Totally 8 attributes are stored in the array. Now I use two schemes to store the array,

Schema 1
GEFS_S1<TMP:float,TMAX:float,TMIN:float,RH:float,APCP:float,TCDC:float,UGRD:float,VGRD:float> [M_idx=0:675,1,0,E_idx=0:19,20,0,F_idx=0:39,1,0,Y_idx=0:180,181,0,X_idx=0:359,360,0]

Schema 2
GEFS_S2<TMP:float,TMAX:float,TMIN:float,RH:float,APCP:float,TCDC:float,UGRD:float,VGRD:float> [M_idx=0:675,1,0,F_idx=0:39,1,0,Y_idx=0:180,181,0,X_idx=0:359,360,0,E_idx=0:19,20,0]

Note that M_idx only has a value 675

Some details about the storage

Array name | Dimension order | Chunk size | Chunk count | Average chunk storage size | Array storage size
GEFS_S1 | Modelrun, ensemble, forecast, Y, X | 1 x 20 x 1 x 181 x 360 | 40 | 6.7 M | 268 M
GEFS_S2 | Modelrun, forecast, Y, X, ensemble | 1 x 1 x 181 x 360 x 20 | 40 | 6.2 M | 247 M

And the query I executed was to extracte a time series of total precipitation (APCP) at a spot location containing all ensemble values. The corresponding query w.r.t schemes above are

consume(project(between(GEFS_S1,675,0,0,38,184,675,19,39,38,184),APCP))

consume(project(between(GEFS_S2,675,0,38,184,0,675,39,38,184,19),APCP))

I used above command to benchmark 20 times on each scheme. According to the understanding I described above, schema 2 should work faster, but my test result is (unit is second),

Schema 1
2.50
0.94
0.50
0.56
0.48
0.49
0.49
0.49
0.48
0.48
0.48
0.48
0.49
0.49
0.48
0.49
0.49
0.48
0.48
0.49

Schema 2
3.19
0.97
0.60
0.60
0.53
0.52
0.53
0.52
0.52
0.53
0.52
0.53
0.51
0.55
0.52
0.52
0.51
0.52
0.53
0.52

So my test result currently shows the reverse judgement is true, i.e. retrieve data stored separately is faster than the closer one. It is hard to understand such a behavior, any ideas?


#2

A couple of points:

  1. The performance data for scheme1 & scheme2 don’t look all that different.
  2. The configuration of instaces & physical machines is not cited, but that is a very important factor because that determines the amount of data movement and the level of query parallelism.
  3. The data (i.e. chunks) on disk are not guaranteed to be laid out sequentially no matter how the dimensions are organized.
    What I think you should consider is the most common data access pattern. If you want the parallelism to occur “along the T axis” (i.e. multiple instances working on different data regions along the T dimension), T should be the last dimension. On the other hand, if you want the data for all Ts to be stored on the same instance (because the parallelism is along a different dimension(s)), T should be the first dimension.
    Hope this helps.

#3

[quote=“tigor”]A couple of points:

  1. The performance data for scheme1 & scheme2 don’t look all that different.
  2. The configuration of instaces & physical machines is not cited, but that is a very important factor because that determines the amount of data movement and the level of query parallelism.
  3. The data (i.e. chunks) on disk are not guaranteed to be laid out sequentially no matter how the dimensions are organized.
    What I think you should consider is the most common data access pattern. If you want the parallelism to occur “along the T axis” (i.e. multiple instances working on different data regions along the T dimension), T should be the last dimension. On the other hand, if you want the data for all Ts to be stored on the same instance (because the parallelism is along a different dimension(s)), T should be the first dimension.
    Hope this helps.[/quote]

w.r.t your comments,

  1. The difference is indeed small, however, scheme 1 is constantly smaller than scheme 2 and this should imply something.
  2. the configuration is as follows,
[test]
server-0=localhost,0
db_user=scidb
db_passwd=scidb
install_root=/opt/scidb/14.3
pluginsdir=/opt/scidb/14.3/lib/scidb/plugins
logconf=/opt/scidb/14.3/share/scidb/log4cxx.properties
base-path=/home/scidb/data
tmp-path=/tmp
base-port=1239
interface=eth0
enable-catalog-upgrade=true
max-memory-limit=3072
  1. The sequential is a kind of pattern, while it is not necessary to guarantee a sequential storage on the disk.

From your description, I think you are saying my understanding is incorrect. So according to your explanation, with (X,Y,T) example I provided, if I want to store the grid at different time step into different instance, I should actually use schema letter:string[Y=0:1,2,0,X=0:1,2,0,T=0:1,2,0]. And if this is really the case, then I can understand my 5D benchmark test.


#4

Re 1: The difference may be real, but, I believe, it is not because of some fundamental SciDB feature. So, I would not rely on that as a model to predict any sort of behavior, especially when you transition from the single instance configuration to a more complicated multi-instance-multi-host configuration.

Precisely.


#5

By remanaging the forecast dataset in SciDB with another two schemes,

Schema 1
GEFS_S3<TMP:float,TMAX:float,TMIN:float,RH:float,APCP:float,TCDC:float,UGRD:float,VGRD:float> [X_idx=0:359,360,0,Y_idx=0:180,181,0,F_idx=0:39,1,0,E_idx=0:19,20,0,M_idx=0:675,1,0]

Schema 2
GEFS_S4<TMP:float,TMAX:float,TMIN:float,RH:float,APCP:float,TCDC:float,UGRD:float,VGRD:float> [E_idx=0:19,20,0,X_idx=0:359,360,0,Y_idx=0:180,181,0,F_idx=0:39,1,0,M_idx=0:675,1,0]

Note that M_idx only has a value 675

According to my earlier description of data structure, this time indeed ensemble dimension, i.e. E_idx is the inner most dimension to structure the data for GEFS_S4. And with the sequential pattern I mentioned, GEFS_S4 should have better performance to extracte a time series of total precipitation (APCP) at a spot location containing all ensemble values. So I did benchmark test again,

consume(project(between(GEFS_S3,184,38,0,0,0675,184,38,39,19,675),APCP))
consume(project(between(GEFS_S4,0,184,38,0,675,19,184,38,39,675),APCP))

However, my test result once again betrayed my judgement, (unit is second)

GEFS_S3 2.83 0.64 0.54 0.54 0.64 0.53 0.53 0.53 0.55 0.53 0.57 0.57 0.53 0.54 0.53 0.53 0.54 0.54 0.53 0.54 GEFS_S4 2.87 0.74 0.58 0.58 0.66 0.57 0.57 0.57 0.57 0.58 0.57 0.58 0.59 0.57 0.58 0.57 0.59 0.58 0.58 0.58
It shows that GEFS_S3 performs better. The specific storage information of two arrays is as follows,

Array name | Dimension order | Chunk size | Chunk count | Average chunk storage size | Array storage size
GEFS_S3 | X, Y, Forecast, Ensemble, Modelrun | 360 x 181 x 40 x 1 x 1 | 40 | 6.2 M | 247 M
GEFS_S4 | Ensemble, X, Y, Forecast, Modelrun | 20 x 360 x 181 x 40 x 1 | 40 | 5.8 M | 232 M

Now I started to realize that your comments may be the case. But I am now really wondering how each chunk is stored on disk if not in a contiguous 1D array, e.g. disrupted randomly in the middle and distributed into diverse sectors? Also what I found from experiments above is that the larger array has better performance, i.e. GEFS_S1 and GEFS_S3 with the same chunk size.

For chunked storage structure, we know that chunk size has a great influence on query performance but we also doubt that structure inside a chunk might also has an impact. However from my test above, i.e. modifying the order of dimensions to arrange data on disk, storage structure of a single chunk seems not to be important without considering parallelization as you mentioned. While the art of utilizing chunked storage structure still lies in selecting right chunk size?


#6

All the dimensions ofthe data