How to find populated index value in a sparse array?


#1

Hello,

I am trying to evaluate the use of scidb in our project. Data will be stored in a sparse 2 dimensional array. I want to slice the array on one of the indexes - lets say the row index and do some statistics but the index value that I select may not have any values stored in that row. So I want to find the nearest row above the row index of the query row that has been populated with actual data. Is there a way of doing this in scidb?

Thank you in advance for your reply.

Ranga.


#2

So, let me ask a question that might sound slightly odd … Why do you care what the index of the “the nearest row above the row index of the query row that has been populated with actual data” is?

I mean, at one extreme, there might not be any semantically interesting relationship between “next” and “previous” along an array’s dimension. The index might be something like stock symbols, or channel IDs, or some other kind of arbitrary allocated “label”. On the other hand though, if the index’s labels are significant … if the “next” and the “previous” rows also have some notion of “distance” … then you might want the labels because they convey something semantically useful.

Looking back at your previous question, one of the arrays I proposed in my answer looked like this:

  CREATE ARRAY AlternSpectroData 
  <
    power_value  : double NULL,
    frequency    : double NULL,
    sensor_type  : string NULL,
    cnt_readings : uint64 NULL
  >
  [ Loc_Ndx=0:*,10000,0, Time=0:*,40000000,0 ]

And included a query that got you, for each “Loc_Ndx” value, the corresponding “Location” string:

SELECT location
  FROM cross_join ( 
         AlternSpectroData,
         Location_Index,
         Loc_Ndx,
         I 
       );

Now, you say you want to “slice the array on one of the indexes … and do some statistics”. Then “find the nearest row above the row index of the query row that has been populated with actual data”. Well, rather than slice out the rows one at a time and do some statistics on each row, why not instead group the cells by row and do the statistics on every group (that is, those groups / rows with data ) and produce as a result a “vector” (or another “array”, if your input has more than 2 dimensions) showing, for each row, the result? Then the “next” and “previous” rows are just the “next” and “previous” in the vector.

The following query takes the contents of the AlternSpectroData array, gets the location string for each cell (that’s the cross_join(…)), and then calculates a statistics (mean power_value) as well as a count for each “group” of data that correspond to a particular location.

SELECT AVG ( power_value ) AS AVG_POWER,
       COUNT ( * )         AS NUM_EVENTS,
       MIN ( location )    AS LOCATION
  FROM cross_join ( 
         AlternSpectroData,
         Location_Index,
         Loc_Ndx,
         I 
       )
GROUP BY AlternSpectroData.Loc_Ndx;

--
-- Or, if you prefer the AFL ...
--

aggregate ( 
  cross_join ( 
         AlternSpectroData,
         Location_Index,
         Loc_Ndx,
         I 
       ),
   AVG ( power_value ) AS AVG_POWER,
   COUNT ( * ) AS NUM_EVENTS,
   MIN ( location ) AS LOCATION,
   AlternSpectroData.Loc_Ndx
);

So. Why use this GROUP BY / aggregate? In many cases, you don’t want the “next” or “previous” row in the overall array (which you can get from the Location_Index). You want the “next” or the “previous” row in some region of the array, where the region is determined by some filtering. The following query illustrates what this looks like. It applies two “filters” … one on the time range in the AlternSpectroData array (that’s the between(…) ) and the other on the values in the “Location_Index”.

SELECT AVG ( power_value ) AS AVG_POWER,
       COUNT ( * )         AS NUM_EVENTS,
       MIN ( location )    AS LOCATION
  FROM cross_join ( 
         between ( 
           AlternSpectroData,
           null, 10000000, null, 20000000
         ),
         filter ( 
           Location_Index,
           location BETWEEN 'LOCATION_9_10' AND 'LOCATION_9_20'
         ),
         Loc_Ndx,
         I 
       )
GROUP BY AlternSpectroData.Loc_Ndx;

Hope this helps. The key thing? It can be useful, when you’re dealing with SciDB, to “think arrays”, rather than thinking in terms of for each row do X, or for each column do Y. The key is the way SciDB presents a query language that lets you do a single operation over a large number of rows or columns at the same time / in the same query.

Paul