Importing 5D ensemble forecast data into SciDB


#1

I try to import a 5D ensemble forecast dataset which is stored in netCDF into a SciDB array. The array schema goes like

<v1: float, v2: float, v3: float>[Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,Y=0:180,181,0,X=0:359,360,0] -- Modelrun: the time to run the forecast model, E_idx: ensemble index, F_idx: forecast index, Y: latitude, X:longitude
When reading data from netCDF, it is possible to read a spatial grid for one variable for 1 forecast, 1 ensemble and 1 model run.
And the importing process then becomes reading all 2D spatial grids for all variables from netCDF and insert them into the 5D cube in SciDB.
I will use picture below to demonstrate the process,

[code] \ F --> \ X -->
\ 4D array \ 2D spatial array of one variable
E \ 00 01 02 Y \ 00 01 02 03 04
| ±-------------±-----±-----+ | ±-----±-----±-----±-----±-----+
| 00 ||v1|,|V2|,|V3|| S2 | S3 | | | 19 | 14 | 08 | 07 | 27 |
v ±-------------±-----±-----+ v ±-----±-----±-----±-----±-----+
01 | S4 | S5 | S6 | | 44 | 28 | 43 | 52 | 22 |
±-------------±-----±-----+ ±-----±-----±-----±-----±-----+
02 | S7 | S8 | S9 | | 09 | 12 | 11 | 48 | 31 |
±-------------±-----±-----+ ±-----±-----±-----±-----±-----+

– Left array is the 4D array to store the forecast weather data with 3 variables, i.e. attributes without modelrun dimension.
– Right array is a sample 2D spatial array containing values of one variable, which can represent v1, v2 or v3 in left array
– F: Forecast index, E: Ensemble index, X: longitude, Y: latitude,
– S* all use the structure |v1|,|V2|,|V3|[/code]
Please note that only a part of the dataset is shown in the picture above, i.e. with less dimension values.
First we load 2D spatial grid of one variable into the right array and let’s assume it corresponds to |v2| at (2,1), i.e. S6 in the left array.
It might be strange that the first 2D spatial grid is |v2| in S6 instead of |v1| in S1. This is because although data are all stored in netCDF, due to parallel reading
for example, time delay of transferring through network, some variable grid might not be readable at a certain time, the order of all spatial grids reaching SciDB
is uncertain. So we suppose next grid to import into SciDB is |v3| at S2. And the whole problem becomes how to populate messy 2D spatial arrays into the 4D array
and then redimension the 4D array into the 5D by adding one dimension.

My current solution is,

  1. When we get the |v2| at S6, we create two other empty arrays for v1 and v3. After it we do joint(join(|v1|,|v2|), |v3|) to create S6. S6 is
    an array with 2 dimensions, i.e. X and Y, and three attributes. Now it only contains data for the second variable.
  2. With insert(redimension(apply(apply(S6, F_idx, 2), E_idx, 1),4D), 4D), the spaital grid is inserted into 4D array.
  3. With insert(redimension(apply(4D, modelrun, xxxxx),5D), 5D), where xxxxx refers to a certain number, the spatial grid of one variable is finally loaded into the 5D cube.
    Step 2 and 3 can be combined into one, that is insert(redimension(apply(apply(apply(S6, F_idx, 2), E_idx, 1), modelrun, xxxxx), 5D), 5D).
  4. Do the same process for other spatial grids until the whole 5D cube is populated.

Two questions,

  1. I wonder if step 1 can be simplified, i.e. import the spatial grid of v2 with v1 and v3 values set as empty to create S6 directly. Current implementation of SciDB only allows
    importing v2 with other variable values being null into S6. However, using insert later, null can overwrite non-null values, which is not the cast for empty value.
  2. Intensive use of apply and redimension is not going to be efficient, can anyone put foreword other solutions for importing such a 5D dataset?

#2

By testing, step 1 cannot go through. With joining empty arrays, the resultant array contains all empty cells (i.e. the original non-empty attribute will be overwrite to empty). I checked the source code of “join”, and see there seems to be the possibility to modify the operator to let it keep the non-empty attribute. This means, a plugin needs to be built.

It is indeed possible to implement attribute as dimension and then build an additional array to store attribute names. According to my understanding, at the low level, i.e. storage, there is no much difference between a 6D array with 1 “val” attribute (the idea of “unfold” operator in SciDB 14.7) and a 5D array with 3 attributes, v1, v2 and v3 since the chunk implementation are the same. However, on the abstract level, or conceptual level, attribute is not equal to dimension. I am still wondering if there is a straightforward 5D solution.


#3

[quote=“rencailhc”]By testing, step 1 cannot go through. With joining empty arrays, the resultant array contains all empty cells (i.e. the original non-empty attribute will be overwrite to empty). I checked the source code of “join”, and see there seems to be the possibility to modify the operator to let it keep the non-empty attribute. This means, a plugin needs to be built.

It is indeed possible to implement attribute as dimension and then build an additional array to store attribute names. According to my understanding, at the low level, i.e. storage, there is no much difference between a 6D array with 1 “val” attribute (the idea of “unfold” operator in SciDB 14.7) and a 5D array with 3 attributes, v1, v2 and v3 since the chunk implementation are the same. However, on the abstract level, or conceptual level, attribute is not equal to dimension. I am still wondering if there is a straightforward 5D solution.[/quote]
actually, i see in the userguide that join doesnot store any result,hava you used the store function to store the results?


#4

[quote=“fate_testarossa”][quote=“rencailhc”]By testing, step 1 cannot go through. With joining empty arrays, the resultant array contains all empty cells (i.e. the original non-empty attribute will be overwrite to empty). I checked the source code of “join”, and see there seems to be the possibility to modify the operator to let it keep the non-empty attribute. This means, a plugin needs to be built.

It is indeed possible to implement attribute as dimension and then build an additional array to store attribute names. According to my understanding, at the low level, i.e. storage, there is no much difference between a 6D array with 1 “val” attribute (the idea of “unfold” operator in SciDB 14.7) and a 5D array with 3 attributes, v1, v2 and v3 since the chunk implementation are the same. However, on the abstract level, or conceptual level, attribute is not equal to dimension. I am still wondering if there is a straightforward 5D solution.[/quote]
actually, i see in the userguide that join doesnot store any result,hava you used the store function to store the results?[/quote]
Yes


#5

Hmmmmm …

  1. So the ultimate target is an array that looks like this:
< v1: float, 
   v2: float, 
   v3: float >
[Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,Y=0:180,181,0,X=0:359,360,0]
--  Modelrun: the time to run the forecast model, E_idx: ensemble index, F_idx: forecast index, Y: latitude, X:longitude

And you’re pulling the data from a set of NetCDF files, each of which contains the data for one model/ensemble/run? So the per-run data looks like this …

< v1: float, 
   v2: float, 
   v3: float >
[ Y=0:180,181,0,X=0:359,360,0 ]

But the feature of the NetCDF files is that you’re getting one variables–v1, v2 or v3–per file. And what complicates matters is that these files arrive in no particular order … so you will have (say) v2, then v1, then v3. So you will have arrays like …

V1_M#_E#_F#
<
   v1 : float
>
[ Y=0:180,181,0,X=0:359,360,0 ];

V2_M#_E#_F#
<
   v2 : float
>
[ Y=0:180,181,0,X=0:359,360,0 ];

V3_M#_E#_F#
<
   v3 : float
>
[ Y=0:180,181,0,X=0:359,360,0 ];

But you want to stitch the contents of the three NetCDF files into a single “slice” of the (ultimately) 5D array. But as the files arrive “out of order”, you can’t use a query form like join ( V1_M#_E#_F#, join ( V2_M#_E#_F#, V3_M#_E#_F# ) ).

OK. Why not create one 4-D array per variable, and insert(…) directly into there? Then once all of the files are loaded, use a join(…) to hook up the three x 4-D arrays. Something like this …

--
--  First, let's create the 5D target. Note that these attributes are all 
-- declared NULL-able, but this can be lifted by using the substitute(...)
-- operator. The point here is to get the broad outline. 
--
SET LANG AFL;
remove ( MODEL_5D );
CREATE ARRAY MODEL_5D
<
  v1 : float NULL,
  v2 : float NULL,
  v3 : float NULL
>
[ Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,
  Y=0:180,181,0, X=0:359,360,0];
--
--  Second, let's create a data set we'll use to test the load. 
--
SET LANG AFL;
SET NO FETCH;
save ( 
  project ( 
    apply (
      build ( < dummy : bool > [ R=0:64799,64800,0], true),
      X, R%360,
      Y, R/360,
      V1, float(random()%1000)/float(100.0)
    ),
    X, Y, V1 
  ),
  '/tmp/temp.bin', 0, '(int64,int64,float)' 
);
--
--  Now. Let's create the per-variable, 5D arrays ... 
SET LANG AFL;
remove ( LOAD_MODEL_V1_5D );
remove ( LOAD_MODEL_V2_5D );
remove ( LOAD_MODEL_V3_5D );
--
CREATE ARRAY LOAD_MODEL_V1_5D
<
  v1 : float NULL
>
[ Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,
  Y=0:180,181,0, X=0:359,360,0];
--
CREATE ARRAY LOAD_MODEL_V2_5D 
<
  v2 : float NULL
>
[ Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,
  Y=0:180,181,0, X=0:359,360,0];
--
CREATE ARRAY LOAD_MODEL_V3_5D 
<
  v3 : float NULL
>
[ Modelrun=0:*,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,
  Y=0:180,181,0, X=0:359,360,0];
--
--   Create a per-variable, 1D array file. 
SET LANG AFL;
remove ( Load_V1_1D );
remove ( Load_V2_1D );
remove ( Load_V3_1D );
--
CREATE ARRAY Load_V1_1D
< X : int64, Y : int64, v1 : float >
[ R=0:64799,64800,0];
CREATE ARRAY Load_V2_1D
< X : int64, Y : int64, v2 : float >
[ R=0:64799,64800,0];
CREATE ARRAY Load_V3_1D
< X : int64, Y : int64, v3 : float >
[ R=0:64799,64800,0];
--
--  Load the file's data into the 1D load array (in the case 
-- you're describing, you'd load each of these once from the 
-- external file). 
SET LANG AFL;
SET NO FETCH;
load ( 
  Load_V1_1D,
  '/tmp/temp.bin', 0, '( int64, int64, float )'
);
load (
  Load_V2_1D,
  '/tmp/temp.bin', 0, '( int64, int64, float )'
);
load (
  Load_V3_1D,
  '/tmp/temp.bin', 0, '( int64, int64, float )'
);
--
--   Having loaded the 1D array, we can then transform the 1D 
-- load data into the shape of the 5D target, but use one 
-- 5D target array per variable.
--
--   NOTE: I'm using the Load_V1_1D array twice here, but in 
--         practice you'd simply drop and re-create it once 
--         for each load file. Each of these insert(...) 
--         queries would slot the new [ X, Y ] model into the 
--         per-variable 5D array at the right location. 
--
--   NOTE: You will need to supply the correct values for 
--         the Modelrun, E_idx and F_idx with each external 
--         file. 
--
SET LANG AFL;
insert ( 
  redimension ( 
    apply ( 
      Load_V1_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(1)
    ),
    LOAD_MODEL_V1_5D
  ),
  LOAD_MODEL_V1_5D
);
--
insert (
  redimension (
    apply (
      Load_V1_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(2)
    ),
    LOAD_MODEL_V1_5D
  ),
  LOAD_MODEL_V1_5D
);
--
--  
insert (
  redimension (
    apply (
      Load_V2_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(1)
    ),
    LOAD_MODEL_V2_5D
  ),
  LOAD_MODEL_V2_5D
);
--
insert (
  redimension (
    apply (
      Load_V2_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(2)
    ),
    LOAD_MODEL_V2_5D
  ),
  LOAD_MODEL_V2_5D
);
--
--
insert (
  redimension (
    apply (
      Load_V3_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(1)
    ),
    LOAD_MODEL_V3_5D
  ),
  LOAD_MODEL_V3_5D
);
--
insert (
  redimension (
    apply (
      Load_V3_1D,
      Modelrun, int64(1),
      E_idx, int64(1),
      F_idx, int64(2)
    ),
    LOAD_MODEL_V3_5D
  ),
  LOAD_MODEL_V3_5D
);
--
--   Finally, having loaded the per-variable, 5D arrays, the final thing 
--  to do is to join(...) them, and then to insert(...) the result of 
--  the join into the 5D target. Of course, this will be done once at 
--  the end of a run of file loads, once you're sure all of the data in 
--  a "batch" has been loaded. 
--
SET FETCH;
insert ( 
  join ( 
    LOAD_MODEL_V1_5D,
    join ( 
      LOAD_MODEL_V2_5D,
      LOAD_MODEL_V3_5D
    )
  ),
  MODEL_5D
);
--
--------------------------------------------------------------------------------

#6

Thanks for the answer. The problem, as you have noticed is that we are not sure when each variable 5D array are totally filled. It might be the case that all the data are populated into variable 5D arrays except for example, V2 in S5. And maybe one hour later, due to network delay, that grid comes into SciDB. You cannot simply check whether all data are ready in arrays, say, every 10 minutes and then manually do the join. I want to automate the importing process, which means when a grid is not ready, it should appear as empty in the final 5D array so that users can still do query on the partly filled final 5D array.

I am actually doing a research concerned with SciDB. According to my current understanding, the multi-attribute 5D array <v1: float, v2: float, v3: float>[Modelrun=0:,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,Y=0:180,181,0,X=0:359,360,0] distributes each variable into different chunks. So the chunk storage should be the same as the array <v: float>[Modelrun=0:,1,0, Variable=0:2,1,0, E_idx=0:19,1,0, F_idx=0:39,1,0,Y=0:180,181,0,X=0:359,360,0] which treats Variable as a dimension. And it is possible to perform my task with the latter implementation. So in principle, it should work with the first schema as well. I did not find the answer. Besides, I personally feel that for development, it might be helpful for SciDB to provide some low-level APIs to manipulate chunks directly.


#7

Hmmm …

First, I’m curious to know how you know a missing S# is ever going to get there? I mean, how will you distinguish between an error condition … the S# never arrived … from the “S# is currently delayed” condition? Also, is the data even meaningful when you have (as you describe) V1 and V3 from S5, but not V2?

Your note did stimulate one thought, though. The last query in the script above? That will only push data to the MODEL_5D array when all of the variable data has arrived for a given [ Modelrun, E_idx, F_idx ]. If one of the variables is missing for an S#, then it doesn’t make it into the MODEL_5D.

Of course, you can will still need to figure out which S# blocks in the per-variable arrays are populated, and which are missing variables. The following query tells you which [ Modelrun, E_idx, F_idx ] combinations have all variables there. . .

aggregate (
  join (
    LOAD_MODEL_V1_5D,
    join (
      LOAD_MODEL_V2_5D,
      LOAD_MODEL_V3_5D
    )
  ),
  COUNT(*) AS CNT,
  Modelrun, E_idx, F_idx
);

Next step is to figure out which S# have “missing” variables, either because there’s some error condition, or else because they’re just not there yet. This means figuring out the total range of S# values that should be there. Now, the range of values in E_idx and F_idx is constrained over the ranges E_idx 0:19 and F_idx=0:39, while the Modelrun is unconstrained. So it would be useful to compute the largest Modelrun value in the various LOAD_MODEL_Vn_5D files. The following query does that.

SET LANG AFL;
SET FETCH;
project (
  apply (
    join (
      aggregate (
        apply ( LOAD_MODEL_V1_5D, MR, Modelrun ),
        MAX ( MR ) AS V1_MAX_MR
      ),
      join (
        aggregate (
          apply ( LOAD_MODEL_V2_5D, MR, Modelrun ),
          MAX ( MR ) AS V2_MAX_MR
        ),
        aggregate (
          apply ( LOAD_MODEL_V3_5D, MR, Modelrun ),
          MAX ( MR ) AS V3_MAX_MR
        )
      )
    ),
    MAX_MR,
    iif (( V1_MAX_MR > V2_MAX_MR ),
      (iif (( V1_MAX_MR > V3_MAX_MR ), V1_MAX_MR ,
        (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
      )),
      (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
    )
  ),
  MAX_MR
);

As a side-note, you can simplify the hell out of this kind of query by using macros. Plonk the following file into a file /tmp/macros.txt, and then use the load_module() script.

get_largest_dim ( _ARRAY_, _DIM_NAME_ ) = 
aggregate (
  apply ( _ARRAY_, MX, _DIM_NAME_ ),
  MAX ( MX )
);

max ( _FIRST_, _SECOND_ ) = iif (( _FIRST_ > _SECOND_ ), _FIRST_, _SECOND_ );

get_max_modelrun() = 
project (
  apply (
    join (
      aggregate (
        apply ( LOAD_MODEL_V1_5D, MR, Modelrun ),
        MAX ( MR ) AS V1_MAX_MR
      ),
      join (
        aggregate (
          apply ( LOAD_MODEL_V2_5D, MR, Modelrun ),
          MAX ( MR ) AS V2_MAX_MR
        ),
        aggregate (
          apply ( LOAD_MODEL_V3_5D, MR, Modelrun ),
          MAX ( MR ) AS V3_MAX_MR
        )
      )
    ),
    MAX_MR,
    iif (( V1_MAX_MR > V2_MAX_MR ),
      (iif (( V1_MAX_MR > V3_MAX_MR ), V1_MAX_MR ,
        (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
      )),
      (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
    )
  ),
  MAX_MR
);

This is how that macro file would be loaded …

load_module ( '/tmp/macros.txt' );

… and now the long query above is reduced to …

project (
  apply (
    join (
      get_largest_dim ( LOAD_MODEL_V1_5D, Modelrun ) AS V1,
      join (
        get_largest_dim ( LOAD_MODEL_V2_5D, Modelrun ) AS V2,
        get_largest_dim ( LOAD_MODEL_V3_5D, Modelrun ) AS V3
      )
    ),
    MAX_MR,
    max ( V1.MX_max, max ( V2.MX_max, V3.MX_max ) )
  ),
  MAX_MR
);

…which can be re-used to compute the ranges for other dimensions as follows …

project (
  apply (
    join (
      get_largest_dim ( LOAD_MODEL_V1_5D, F_idx ) AS V1,
      join (
        get_largest_dim ( LOAD_MODEL_V2_5D, F_idx ) AS V2,
        get_largest_dim ( LOAD_MODEL_V3_5D, F_idx ) AS V3
      )
    ),
    MAX_MR,
    max ( V1.MX_max, max ( V2.MX_max, V3.MX_max ) )
  ),
  MAX_MR
);

So. Which S#s … which means, which combinations of [ Modelrun, E_idx, F_idx ] have incomplete variable information? Well first, what are the list of possible [ Modelrun, E_idx, F_idx ] combinations? The following query produces all of the possible combinations of [ Modelrun, E_idx, F_idx ] over the desired range (assuming the maximum Modelrun is 3, given the results of the query above.

project (
  apply ( 
    cross_join (
      build ( < dummy : bool > [ Modelrun=0:3,1,0 ], true ) AS M,
      cross_join (
        build ( < dummy : bool > [ E_idx=0:19,1,0 ], true ) AS E,
        build ( < dummy : bool > [ F_idx=0:39,1,0 ], true ) AS F
      )
    ),
    CNT, uint64(0)
  )
  CNT
);

Now. What you want to do is to “mask” the S# segments that contain values … which will be produced by the aggregate query introduced above … over a “background” of zeroes. Any S# that is non-zero contains all of the variables for the entire segment. Any S# where the C = 0 means that segment is “missing”. The following query … which is really just the same result as the aggregate … produces a list of the [ Modelrun, E_idx, F_idx ] combinations in the range where all of the variables are present. To reverse the logic … to make the query return all of the [ Modelrun, E_idx, F_idx ] combinations in the range where at least one of the variables is missing … reverse the logic of the C > 0.

filter (
  merge (
    substitute (
      aggregate (
        join (
          LOAD_MODEL_V1_5D,
          join (
            LOAD_MODEL_V2_5D,
            LOAD_MODEL_V3_5D
          )
        ),
        COUNT(*) AS C,
        Modelrun, E_idx, F_idx
      ),
      build ( < C : uint64 > [ N=0:0,1,0 ], '[(0)]', true )
    ),
    project (
      apply (
        cross_join (
          build ( < dummy : bool > [ Modelrun=0:3,1,0 ], true ) AS M,
          cross_join (
            build ( < dummy : bool > [ E_idx=0:19,1,0 ], true ) AS E,
            build ( < dummy : bool > [ F_idx=0:39,1,0 ], true ) AS F
          )
        ),
        C, uint64(0)
      ),
      C
    )
  ),
  C > 0
);

Hope this helps. Important thought is that there are enormous benefits to “thinking arrays”, rather than “thinking low level data”. I’ve managed all of the above in about 30 minutes of futzing. Even if we were to hand out lower level (ie. imperative / procedural ) interfaces to the chunks and their data, developers are going to find it very hard to write C/C++ or Python Code as quickly as they can write the corresponding AFL/AQL.

With SciDB, we’re trying to elevate the degree of abstraction at which you reason about your arrays and the data that’s in them.


#8

[quote=“plumber”]Hmmm …

First, I’m curious to know how you know a missing S# is ever going to get there? I mean, how will you distinguish between an error condition … the S# never arrived … from the “S# is currently delayed” condition? Also, is the data even meaningful when you have (as you describe) V1 and V3 from S5, but not V2?

Your note did stimulate one thought, though. The last query in the script above? That will only push data to the MODEL_5D array when all of the variable data has arrived for a given [ Modelrun, E_idx, F_idx ]. If one of the variables is missing for an S#, then it doesn’t make it into the MODEL_5D.

Of course, you can will still need to figure out which S# blocks in the per-variable arrays are populated, and which are missing variables. The following query tells you which [ Modelrun, E_idx, F_idx ] combinations have all variables there. . .

aggregate (
  join (
    LOAD_MODEL_V1_5D,
    join (
      LOAD_MODEL_V2_5D,
      LOAD_MODEL_V3_5D
    )
  ),
  COUNT(*) AS CNT,
  Modelrun, E_idx, F_idx
);

Next step is to figure out which S# have “missing” variables, either because there’s some error condition, or else because they’re just not there yet. This means figuring out the total range of S# values that should be there. Now, the range of values in E_idx and F_idx is constrained over the ranges E_idx 0:19 and F_idx=0:39, while the Modelrun is unconstrained. So it would be useful to compute the largest Modelrun value in the various LOAD_MODEL_Vn_5D files. The following query does that.

SET LANG AFL;
SET FETCH;
project (
  apply (
    join (
      aggregate (
        apply ( LOAD_MODEL_V1_5D, MR, Modelrun ),
        MAX ( MR ) AS V1_MAX_MR
      ),
      join (
        aggregate (
          apply ( LOAD_MODEL_V2_5D, MR, Modelrun ),
          MAX ( MR ) AS V2_MAX_MR
        ),
        aggregate (
          apply ( LOAD_MODEL_V3_5D, MR, Modelrun ),
          MAX ( MR ) AS V3_MAX_MR
        )
      )
    ),
    MAX_MR,
    iif (( V1_MAX_MR > V2_MAX_MR ),
      (iif (( V1_MAX_MR > V3_MAX_MR ), V1_MAX_MR ,
        (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
      )),
      (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
    )
  ),
  MAX_MR
);

As a side-note, you can simplify the hell out of this kind of query by using macros. Plonk the following file into a file /tmp/macros.txt, and then use the load_module() script.

get_largest_dim ( _ARRAY_, _DIM_NAME_ ) = 
aggregate (
  apply ( _ARRAY_, MX, _DIM_NAME_ ),
  MAX ( MX )
);

max ( _FIRST_, _SECOND_ ) = iif (( _FIRST_ > _SECOND_ ), _FIRST_, _SECOND_ );

get_max_modelrun() = 
project (
  apply (
    join (
      aggregate (
        apply ( LOAD_MODEL_V1_5D, MR, Modelrun ),
        MAX ( MR ) AS V1_MAX_MR
      ),
      join (
        aggregate (
          apply ( LOAD_MODEL_V2_5D, MR, Modelrun ),
          MAX ( MR ) AS V2_MAX_MR
        ),
        aggregate (
          apply ( LOAD_MODEL_V3_5D, MR, Modelrun ),
          MAX ( MR ) AS V3_MAX_MR
        )
      )
    ),
    MAX_MR,
    iif (( V1_MAX_MR > V2_MAX_MR ),
      (iif (( V1_MAX_MR > V3_MAX_MR ), V1_MAX_MR ,
        (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
      )),
      (iif (( V2_MAX_MR > V3_MAX_MR ), V2_MAX_MR, V3_MAX_MR ))
    )
  ),
  MAX_MR
);

This is how that macro file would be loaded …

load_module ( '/tmp/macros.txt' );

… and now the long query above is reduced to …

project (
  apply (
    join (
      get_largest_dim ( LOAD_MODEL_V1_5D, Modelrun ) AS V1,
      join (
        get_largest_dim ( LOAD_MODEL_V2_5D, Modelrun ) AS V2,
        get_largest_dim ( LOAD_MODEL_V3_5D, Modelrun ) AS V3
      )
    ),
    MAX_MR,
    max ( V1.MX_max, max ( V2.MX_max, V3.MX_max ) )
  ),
  MAX_MR
);

…which can be re-used to compute the ranges for other dimensions as follows …

project (
  apply (
    join (
      get_largest_dim ( LOAD_MODEL_V1_5D, F_idx ) AS V1,
      join (
        get_largest_dim ( LOAD_MODEL_V2_5D, F_idx ) AS V2,
        get_largest_dim ( LOAD_MODEL_V3_5D, F_idx ) AS V3
      )
    ),
    MAX_MR,
    max ( V1.MX_max, max ( V2.MX_max, V3.MX_max ) )
  ),
  MAX_MR
);

So. Which S#s … which means, which combinations of [ Modelrun, E_idx, F_idx ] have incomplete variable information? Well first, what are the list of possible [ Modelrun, E_idx, F_idx ] combinations? The following query produces all of the possible combinations of [ Modelrun, E_idx, F_idx ] over the desired range (assuming the maximum Modelrun is 3, given the results of the query above.

project (
  apply ( 
    cross_join (
      build ( < dummy : bool > [ Modelrun=0:3,1,0 ], true ) AS M,
      cross_join (
        build ( < dummy : bool > [ E_idx=0:19,1,0 ], true ) AS E,
        build ( < dummy : bool > [ F_idx=0:39,1,0 ], true ) AS F
      )
    ),
    CNT, uint64(0)
  )
  CNT
);

Now. What you want to do is to “mask” the S# segments that contain values … which will be produced by the aggregate query introduced above … over a “background” of zeroes. Any S# that is non-zero contains all of the variables for the entire segment. Any S# where the C = 0 means that segment is “missing”. The following query … which is really just the same result as the aggregate … produces a list of the [ Modelrun, E_idx, F_idx ] combinations in the range where all of the variables are present. To reverse the logic … to make the query return all of the [ Modelrun, E_idx, F_idx ] combinations in the range where at least one of the variables is missing … reverse the logic of the C > 0.

filter (
  merge (
    substitute (
      aggregate (
        join (
          LOAD_MODEL_V1_5D,
          join (
            LOAD_MODEL_V2_5D,
            LOAD_MODEL_V3_5D
          )
        ),
        COUNT(*) AS C,
        Modelrun, E_idx, F_idx
      ),
      build ( < C : uint64 > [ N=0:0,1,0 ], '[(0)]', true )
    ),
    project (
      apply (
        cross_join (
          build ( < dummy : bool > [ Modelrun=0:3,1,0 ], true ) AS M,
          cross_join (
            build ( < dummy : bool > [ E_idx=0:19,1,0 ], true ) AS E,
            build ( < dummy : bool > [ F_idx=0:39,1,0 ], true ) AS F
          )
        ),
        C, uint64(0)
      ),
      C
    )
  ),
  C > 0
);

Hope this helps. Important thought is that there are enormous benefits to “thinking arrays”, rather than “thinking low level data”. I’ve managed all of the above in about 30 minutes of futzing. Even if we were to hand out lower level (ie. imperative / procedural ) interfaces to the chunks and their data, developers are going to find it very hard to write C/C++ or Python Code as quickly as they can write the corresponding AFL/AQL.

With SciDB, we’re trying to elevate the degree of abstraction at which you reason about your arrays and the data that’s in them.[/quote]

i hava a question that if the data in A and B is too large,then how does the join(A,B) work?


#9

The semantics of join(…) are the same as the semantics of a “natural join” in SQL.

Suppose you are doing join ( A, B ), and the arrays A and B look like this:

  A   I -->                  B  I -->
    \                         \
 J   \  00   01   02       J   \  00   01   02
 |    +--------------+     |    +----+----+----+
 | 00 | A1 |    |    |     | 00 | B1 | B2 |    |
 v    +--------------+     v    +----+----+----+
   01 | A4 | A5 |    |       01 |    | B5 | B6 |
      +--------------+          +----+----+----+
   02 |    | A8 | A9 |       02 |    |    | B9 +
      +--------------+          +----+----+----+

The basic rules are:

  1. If one of the two cells is empty, then the join at that cell “fails”, and the result is another “empty” cell.

  2. If both of the input cells contain values, then the result of the join is the concatenation of the two input cell’s attributes.

So, join ( A, B ) looks like this:

[code]
J I -->
\
J \ 00 01 02
| ±--------±--------±--------+
| 00 | (A1,B1) | | |
v ±--------±--------±--------+
01 | | (A5,B5) | |
±--------±--------±--------+
02 | | | (A9,B5) |
±--------±--------±--------+
[\code]

It doesn’t matter how many dimensions are involved. Them’s the rules!

So, if there’s data in some cells in ‘A’ for which there is no corresponding cell in ‘B’, the join won’t return anything for the “disjoint” cells. So if we have:

  A   I -->                  B  I -->
    \                         \
 J   \  00   01   02       J   \  00   01   02   03 
 |    +--------------+     |    +----+----+----+----+
 | 00 | A1 |    |    |     | 00 | B1 | B2 |    | Bl |
 v    +--------------+     v    +----+----+----+----+
   01 | A4 | A5 |    |       01 |    | B5 | B6 | Bm |
      +--------------+          +----+----+----+----+
   02 |    | A8 | A9 |       02 |    |    | B9 | Bn |
      +--------------+          +----+----+----+----+

join ( A, B ) will produce the same result as the join( A, B ) shown above. The cells with the “Bl”, “Bm” and Bn" have no matching cell in “A” so they drop out in the join.