Attribute values storage requirements


#1

Hello experts,

I am evaluating scidb for use in a spectrum database project where we will be storing spectrograms (power spectrum maps) that are gathered from sensors in scidb and querying them later in order to determine spectrum usage patterns.

The attributes for each reading include things like :

power value (dbm) : double
sensor type : string
location : string

My index values are time and frequency.

i.e. For a given time and frequency I want to record the power and the sensor attributes of the sensor that made the measurement and later retrieve it for analysis.
Note that sensor type and location are repeated - i.e. many readings will have the same string except when the sensor is changed. It would be extremely convenient if I could store all of these attributes together so I have a single sparse array where each entry contains the attributes above.

My question is, if I repeat these strings, will scidb take care of managing storage so that I don’t blow up memory usage (i.e. will scidb detect that I am storing the same string again and increment a reference count internally etc. ) ? If not, I’ll have to manage repeated values myself and if so, I can rely on scidb to do storage management for me.

Thank you in advance for your replies.

Ranga


#2

So … it’s really, really, really helpful to take a tilt at problems like this using code. I’ll make a burst of silly assumptions along the way. Correct 'em, please.

Let’s have a go at this with a schema …

CREATE ARRAY Spectrograms 
<
    power_value : double,
    sensor_type  : string,
    location : string
...
>
[ Time=0:*,1000,0, Frequency=0:*,1000,0 ]

Some questions / observations:

  1. What are the native types for Time and Frequency? Asking because, in dealing with the world as arrays, the closer you can get to the underlying 64 bit integers, the better. For things like Time (for example) it’s common practice to write a function that extracts seconds (or fractions of) and convert them into seconds from some epoch, and a second function that takes the 64-bit integer and converts it back into the original time. What about “Location”? Is it an identifying string? Or is there (for example) an [ X, Y ] component? (Is this data geographic? Or are you just looking at sensor series in machinery, for example?)

  2. I’m a bit puzzled by something. The dimensions of an array are similar to the key of a SQL table: each combination of values for the dimensions identifies at most one cell in the array. Yet … what happens here when two different sensor types, or two different locations, produce the same frequency at the same time? You also say that “many readings will have the same string except when the sensor is changed.”

Can I suggest that the independent variables you’re dealing with here are “Location”, and “Time”, with the others being dependent variables? That is …

CREATE ARRAY Spectrograms 
<
    power_value : double,
    sensor_type  : string,
    frequency     : double
...
>
[ Location=0:*,1000,0 Time=0:*,1000,0 ]
  1. Can I also assume that the Sensor Type does not vary (much) by Location? Except when you change the sensor? (See below for more comentery.)

  2. A central consideration in designing this kind of schema is the nature of the workload you’re going to be applying to it. What questions do you want to ask of this data? Variance of power and frequency by time? Distribution of energy ( power * frequency ) over the range of sensor locations? Asking because, once you’ve gotten the schema down, the next thing you’re going to ask is about queries.

OK - on to some hopefully helpful explainations.

  1. SciDB adopts a columnar storage system. Our first act is to break the list of an array’s attributes up into seperate data storage, one per attribute. The purpose of this strategy is two-fold. First, it means that for very “wide” data sets (with lots and lots of columns/attributes) queries which address only a small sub-set of the attributes are executed without bringing on all of the superflous attributes’ data. Second, it means that we can better exploit compression and other space reduction techniques on per-attribute data (less entropy = better compression).

  2. So … take your problematic “sensor_type” string. Given that the sensor_type is pretty much always determined by the Location, and only varies from time to time over the Time, SciDB is able to use run-length encoding to reduce the space used to hold this string to next to nothing.

So the answer to your question “if I repeat these strings, will scidb take care of managing storage so that I don’t blow up memory usage” is “yes”.

  1. There’s a major question we need to answer about the chunk sizes (per-dimension chunk lengths) to use. I am going out on a limb here to suggest that your data might be very “sparse” (lots of points of time at which no data was submitted for a particular sensor location, or alternatively, lots of “spaces” in the frequency / time space that are empty.

Search these forums for “chunk size” related questions. The high order bit idea is that you want the combination of your per-dimension chunk length values to be such that you get about 1,000,000 “cells per chunk”.

  1. Finally, like I said above, I’m not sure your schema design quite reflects what I suspect your data looks like, and how it’s organized. I’ve put together a little script that is my attempt to model your data and illustrate how it might be organized.
#!/bin/sh
#
#   File:   Spectro/Script.sh
#
#  About: 
#
#------------------------------------------------------------------------------
#
#  Useful shell script functions. 
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1};"
};
#
exec_aql_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -q "${1};"
};
#
#------------------------------------------------------------------------------
#
#   Hygiene ... 
CMD_HYGIENE="remove ( SpectroData )"
exec_afl_query "${CMD_HYGIENE}"
CMD_HYGIENE="remove ( SpectroDataRawLoad )"
exec_afl_query "${CMD_HYGIENE}"
#
#
#   Create a 1D array to hold the raw load data ... 
CMD_CREATE_RAW_DATA_ARRAY="
CREATE ARRAY SpectroDataRawLoad 
<
    power_value  : double,
    sensor_type  : string,
    location     : string,
    time         : datetime,
    frequency    : double 
>
[ RowNum ]
"
exec_afl_query "${CMD_CREATE_RAW_DATA_ARRAY}"
#
#   Populate the 1D array to hold the data ... 
CMD_POPULATE_RAW_DATA_ARRAY="
SELECT double(random()%1000)/100.0 AS power_value,
       'SENSOR_TYPE_' + string(random()%10) AS sensor_type,
       'LOCATION_' + string(random()%100) + '_' + string(random()%100) 
                  AS location,
       datetime('2013-1-1 00:00:00') + random()%(365*24*3600) as time,
       double(random()%1000)/1000.0 AS frequency
  INTO SpectroDataRawLoad
  FROM build ( < dummy : int64 > [ RowNum=0:999999,1000000,0], RowNum )
"
exec_aql_query "${CMD_POPULATE_RAW_DATA_ARRAY}" -n 
#
#  What does this data look like? 
CMD_SELECT_FIRST_FEW_RAW_DATA_VALUES="
SELECT * FROM SpectroDataRawLoad WHERE RowNum < 10
"
exec_aql_query "${CMD_SELECT_FIRST_FEW_RAW_DATA_VALUES}"
#
#  NOTE: This is an example only. Because we're using the random() function
#        in the data generation, what you see will differ from this. 
#
#  {RowNum} power_value,sensor_type,location,time,frequency
#  {0} 9.76,'SENSOR_TYPE_0','LOCATION_46_51','2013-10-07 11:46:05',0.67
#  {1} 4.49,'SENSOR_TYPE_7','LOCATION_80_82','2013-06-13 11:55:15',0.883
#  {2} 8.36,'SENSOR_TYPE_6','LOCATION_75_66','2013-01-26 06:03:03',0.704
#  {3} 2.19,'SENSOR_TYPE_0','LOCATION_48_60','2013-09-12 11:28:23',0.792
#  {4} 5.5,'SENSOR_TYPE_5','LOCATION_61_75','2013-09-11 20:28:05',0.498
#  {5} 4.88,'SENSOR_TYPE_3','LOCATION_60_86','2013-01-19 15:11:04',0.525
#  {6} 0.56,'SENSOR_TYPE_3','LOCATION_84_89','2013-11-06 03:52:28',0.899
#  {7} 4.72,'SENSOR_TYPE_4','LOCATION_25_47','2013-10-25 23:05:03',0.26
#  {8} 6.03,'SENSOR_TYPE_2','LOCATION_1_51','2013-11-10 17:18:54',0.056
#  {9} 8.96,'SENSOR_TYPE_7','LOCATION_74_57','2013-07-05 19:40:12',0.459
#
#------------------------------------------------------------------------------
#
#  What does the target look like?
CMD_CREATE_TWO_D_DATA_ARRAY="
CREATE ARRAY SpectroData 
<
  power_value  : double NULL,
  sensor_type  : string NULL,
  location     : string NULL,
  cnt_readings : uint64 NULL
>
???
"
#
#  Several things to note: 
#
#    1. The use of NULL constraints on the attributes is the result of the 
#       fact that we're obliged to use aggregates() to compute the 
#       transform here, because there might be more than one "collision"
#       between the Time / Frequency. For example, what happens if you get,
#       from two different sensor locations, and at the same time, two 
#       different "frequency" and "power_value" readings? 
#
#    2. Setting the per-dimension chunk length for the Time and Frequency 
#       dimensions is an important performance tuning decision. The following 
#       query gives you the kind of information you're going to need. 
#
CMD_CHECK_RANGE_FOR_DIMENSION_CHUNK_LENGTH="
SELECT MIN ( Time )        AS Min_Time,
       MAX ( Time )        AS Max_Time,
       MIN ( Frequency )   AS Min_Freq,
       MAX ( Frequency )   AS Max_Freq,
       COUNT ( * )         AS Cnt
  FROM ( SELECT power_value, sensor_type, location,
                time - datetime('2013-1-1 00:00:00') AS Time,
                floor ( frequency * 10000 ) AS Frequency
           FROM SpectroDataRawLoad
       ) 
"
exec_aql_query "${CMD_CHECK_RANGE_FOR_DIMENSION_CHUNK_LENGTH}"
#
#  {i} Min_Time,Max_Time,Min_Freq,Max_Freq,Cnt
#  {0} 25,31535931,0,9990,1000000
#
#   So: 
#
#   1. We want about 1,000,000 cells per chunk. In this case, there are 
#      1,000,000 cells in total, so we want one chunk.
#
#   2. The range of values for the two dimension's are:
#
#     Time       - 25 -> 31,535,931 
#     Frequency  - 0  -> 9,990
#
#   NOTE: Again, because these values are generated using random(), YMMV. 
#         But the principle's the same. 
#
#   3. So we want each dimension to vary from 0:*, with the Time 
#      chunk length being 40,000,000, and the Frequency chunk length 
#      of 10,000. 
#
#   NOTE: There's help coming here: a script we're preparing to estimate 
#         per-dimension chunk lengths from the kind of data you get in
#         the 1D SpectroDataRawLoad array. It uses the actual data it 
#         find. 
CMD_CREATE_TWO_D_DATA_ARRAY="
CREATE ARRAY SpectroData 
<
  power_value  : double NULL,
  sensor_type  : string NULL,
  location     : string NULL,
  cnt_readings : uint64 NULL
>
[ Frequency=0:*,10000,0, Time=0:*,40000000,0 ]
"
exec_aql_query "${CMD_CREATE_TWO_D_DATA_ARRAY}"
#
#
#  How to get the 2D load data into the 2D target? 
CMD_REDIMENSION_1D_RAW_TO_2D_TARGET="
SELECT AVG ( power_value ) AS power_value,
       MIN ( sensor_type ) AS sensor_type,
       MIN ( location )    AS location,
       COUNT ( *)          AS cnt_readings
  INTO SpectroData
  FROM ( SELECT power_value, sensor_type, location,
                time - datetime('2013-1-1 00:00:00') AS Time,
                floor ( frequency * 10000 ) AS Frequency
           FROM SpectroDataRawLoad
       ) 
REDIMENSION BY [ Frequency=0:*,10000,0, Time=0:*,40000000,0 ]
"
exec_aql_query "${CMD_REDIMENSION_1D_RAW_TO_2D_TARGET}" -n 
#
#   Remember I mentioned the problem of the Frequency/Time duplicates? The 
#  way to figure out if you're getting any, give the schema above, is 
#  as follows: 
#
CMD_CHECK_PROBLEM_WITH_DUPLICATE_FREQUENCY_TIME="
SELECT * FROM SpectroData WHERE cnt_readings > 1
"
exec_aql_query "${CMD_CHECK_PROBLEM_WITH_DUPLICATE_FREQUENCY_TIME}"
#
#  {Frequency,Time} power_value,sensor_type,location,cnt_readings
#  {1650,20491545} 8.6,'SENSOR_TYPE_1','LOCATION_26_31',2
#  {3459,23716984} 6.02,'SENSOR_TYPE_2','LOCATION_12_96',2
#  {3850,1981583} 7.67,'SENSOR_TYPE_0','LOCATION_31_36',2
#  {3980,22737730} 5.64,'SENSOR_TYPE_2','LOCATION_15_25',2
#  {4430,26361448} 5.875,'SENSOR_TYPE_3','LOCATION_51_74',2
#  {4740,26677599} 9.435,'SENSOR_TYPE_1','LOCATION_21_7',2
#  {5050,20075712} 1.8,'SENSOR_TYPE_0','LOCATION_27_11',2
#  {5860,2873874} 8.38,'SENSOR_TYPE_8','LOCATION_13_9',2
#  {7630,9263130} 5.145,'SENSOR_TYPE_5','LOCATION_56_45',2
#  {7690,4652545} 7.125,'SENSOR_TYPE_1','LOCATION_16_26',2
#  {8450,1467332} 4.395,'SENSOR_TYPE_4','LOCATION_32_80',2
#  {9430,11291805} 4.495,'SENSOR_TYPE_8','LOCATION_25_28',2
# 
#-----------------------------------------------------------------------------
#
#  What else might the target look like? 
#
#   So, instead of Frequency / Time as the dimensions, you might be better 
#  off (mentioned above) modelling your data in such a manner that your 
#  independent variables are dimensions, with the dependent variables making 
#  up the attributes. Now ... assuming that the Location and the Time are 
#  the two things you don't "measure" (but are, instead, exogenous to the 
#  experiement ... just things you have to put up with) they're what you 
#  want to use as dimensions. Another way to think about it? There's no 
#  way that you can have two different readings of frequency, power or 
#  sensor_type from the same location, at the same time. 
#
#   Now, the problem is that "Location" is a string. (You can still manage 
#  Time as we've introduced it above.) How to convert a location 'string' 
#  to a dimension index?  Have a look at the following forum topic for a 
#  more detailed explanation of how this works in SciDB. I'll just give 
#  you the script. 
#
#     http://www.scidb.org/forum/viewtopic.php?f=18&t=1280&hilit=uniq
#
CMD_HYGIENE="DROP ARRAY Location_Index"
exec_aql_query "${CMD_HYGIENE}"
#
CMD_CREATE_LOCATION_INDEX="
CREATE ARRAY Location_Index 
<
  location : string 
>
[ I=0:*,10000,0 ]
"
exec_aql_query "${CMD_CREATE_LOCATION_INDEX}"
#
CMD_POPULATE_LOCATION_DIMENSION_INDEX="
SELECT * 
  INTO Location_Index
  FROM uniq ( sort ( project ( SpectroDataRawLoad, location )))
"
exec_aql_query "${CMD_POPULATE_LOCATION_DIMENSION_INDEX}" -n 
#
#  The alternative, 2D array, the one that organizes the data by 
#  Location / Time, looks like this... 
#
CMD_HYGIENE="DROP ARRAY AlternSpectroData;"
exec_aql_query "${CMD_HYGIENE}"
#
CMD_CREATE_ALTERNATIVE_TWO_D_DATA_ARRAY="
CREATE ARRAY AlternSpectroData 
<
  power_value  : double NULL,
  frequency    : double NULL,
  sensor_type  : string NULL,
  cnt_readings : uint64 NULL
>
[ Loc_Ndx=0:*,10000,0, Time=0:*,40000000,0 ]
"
exec_aql_query "${CMD_CREATE_ALTERNATIVE_TWO_D_DATA_ARRAY}"
#
#   Now ... for each Location, you can get get a corresponding "index" 
#  integer, and use this. 
#
CMD_POPULATE_ALTERNATIVE_TWO_D_DATA_ARRAY="
SELECT AVG ( power_value ) AS power_value,
       AVG ( frequency )   AS frequency,
       MIN ( sensor_type ) AS sensor_type,
       COUNT ( *)          AS cnt_readings
  INTO AlternSpectroData
  FROM index_lookup ( 
         apply ( 
            SpectroDataRawLoad,
            Time, 
            time - datetime('2013-1-1 00:00:00')
         )  AS Data, 
         Location_Index,
         Data.location,
         Loc_Ndx
       )
REDIMENSION BY [ Loc_Ndx=0:*,10000,0, Time=0:*,40000000,0 ]
"
exec_aql_query "${CMD_POPULATE_ALTERNATIVE_TWO_D_DATA_ARRAY}" -n 
#
#  What does this look like? 
#
CMD_GIVE_ME_A_SMALL_SAMPLE_OF_THE_DATA="
SELECT * 
  FROM bernoulli ( AlternSpectroData, 0.00001 )
"
exec_aql_query "${CMD_GIVE_ME_A_SMALL_SAMPLE_OF_THE_DATA}"
#
#  {Loc_Ndx,Time} power_value,frequency,sensor_type,cnt_readings
#  {249,20922644} 3.94,0.964,'SENSOR_TYPE_9',1
#  {498,27272601} 0.91,0.362,'SENSOR_TYPE_2',1
#  {2897,17648362} 5.57,0.879,'SENSOR_TYPE_6',1
#  {7105,22664225} 8.47,0.705,'SENSOR_TYPE_9',1
#  {9295,19109432} 7.26,0.154,'SENSOR_TYPE_8',1
#
#   So ... at this point, you have two alternative designs for the array 
#  you'll point you analytic acumen at. 
#
#------------------------------------------------------------------------------
#
#  At this point, you've got the data organized in a way that lets you 
# answer questions like "what's the average power / energy / frequency by 
# location?" or "for time T1 to T2, what was the largest power value?", 
# or "regrid the data to compute the average power per day over all 
# locations for heat map visualization?" or even more complicated things 
# like, "cluster locations using a principle components analysis?"
# 
#  Two other questions ... How to get the original data back? The following 
# query illustrates a couple of things about 'thinking in arrays'. The 
# query is a 'join' query that combines data from two input arrays. These
# input arrays are computed by examining data from the indexed array 
# (in this case, AlternSpectroData) and either manipulating it, or else 
# combining it with data from the Index created earlier. 
#
CMD_GET_ORIGINAL_REFERENCE_VALUES_BACK_FROM_INDEX_DATA="
SELECT * 
  FROM ( 
        SELECT datetime('2013-1-1 00:00:00') + Time AS Original_DateTime,
               power_value, frequency, sensor_type
          FROM AlternSpectroData
       ),
       ( 
         SELECT location
           FROM cross_join ( AlternSpectroData,
                             Location_Index,
                             Loc_Ndx,
                             I 
                           )
       )
"
exec_aql_query "${CMD_GET_ORIGINAL_REFERENCE_VALUES_BACK_FROM_INDEX_DATA}" -n 
#
#  
#-----------------------------------------------------------------------------

#3

Thank you very much for such a detailed reply! You got me thinking again. However, I am confused about how to select attributes vs. dimensions. Lets say, I pick location (i.e. latitude and longitude) to be two dimensions. I can convert all dimensions to integer by simple multiplication. So then I would have an array defined as follows.

reading<sensor_type:string, calibration_data:string>[latitude=-90:90,100,0,longitude=-90:90,100,0,power=-100:20,100,0,frequency=3000:4000,100,0, time=0:*,100,0]

Clearly, here power will be repeated. I could have many readings with the same power value. But does that matter? If I wanted to find all readings at a given power, I can select on just the power index and I will get all such readings. So my question really is, in designing the array, what should I use as an attribute and what should I use as the dimension and does that choice affect performance?

BTW just to introduce myself. I work at NIST in Gaithersburg MD. We are working on a project that involves spectrum sensing at different locations to characterize spectrum usage. I am still at the stage of evaluating which database to use but SciDb looks pretty compelling.

Thanks

Ranga


#4

I would think of things in terms of independent and dependent variables.

Your independent variables here are [latitude=-90:90,100,0,longitude=-90:90,100,0, time=0:*,100,0 ]. What data was “gathered” at each of these [ lat, long, time ] positions? < sensor_type:string, calibration_data:string, power : double, frequency : double >?

You certainly can store the data (initially) with power and frequency as dimensions, but what you’d produce would be a very, very sparse array. And one that’s probably highly skewed to boot. SciDB is OK with that. But there might be no conceptual advantage to doing things that way.

Back to my suggested array … this time slightly modified to include your lat/long/time three dimensions. . .

CREATE ARRAY AlternSpectroData   
<
  power_value  : double NULL,
  frequency    : double NULL,
  sensor_type  : string NULL,
  cnt_readings : uint64 NULL
>
[ Lat=0:*,LAT_CHUNK_LEN,0, Long=0:*,LONG_CHUNK_LEN,0 Time=0:*,TIME_CHUNK_LEN,0 ]

NOTE: I’ve deliberately kept the chunk lengths out of this declaration. Their precise values depend on the data. Also, you might consider using some overlapping chunks to permit the efficient calculation of things like moving windows.

In SciDB, it’s perfectly possible to “filter” the contents of the array by looking at the value in each cell.

SELECT COUNT( * ) AS EVENT_CNT 
  FROM AlternSpectroData
 WHERE power_value BETWEEN 0 AND 19

aggregate ( 
  filter ( 
    AlternSpectroData,
    power_value BETWEEN 0 AND 19
  ),
  count (*) AS EVENT_CNT
);

When you do this kind of query though, you don’t get any “indexing” on the filter on power_value. But that might not be a problem. The search is conducted entirely in parallel, and so long as your filtering ranges aren’t that large, this is actually a better execution strategy than using any kind of indexing. And you can turn the data from this form into (say) frequency v. power using queries. The following query takes the data from (my) AlternSpectroData array and computes the number of “events” that occur in each (range) of power_value and frequency values.

SELECT COUNT(*) AS CNT
  FROM ( SELECT floor ( power_value * 10 ) AS POWER,
                floor ( frequency * 10 )   AS FREQ
           FROM AlternSpectroData
       ) 
REDIMENSION BY [ POWER=0:*,1000,0, FREQ=0:*,1000,0 ];

You can use queries to re-organize this data any way you want, of course. But if you’re getting started, I would just say to use the “independent” v. “dependent” variable distinction.

Paul