Attribute Locality


#1

Hello,

I’m working on redesigning the schema for MODIS satellite data. The most common task for this data is per pixel operations on reflectance values. Ideally, querying a single pixel (or group of nearby pixels) for reflectances should hit the same physical machine. Each pixel has the following information:

Latitude
Longitude
Time
Reflectance per band (17 of these-- i.e., ‘bands’)
Sensor Zenith
Sensor Azimuth
Satellite Range

We’ve been discussing three different ways for loading data:

Reflectance bands as separate arrays (current schema):
–Lat, Long, Time as dimensions
–Single Reflectance per cell as attribute
–Zenith, Azimuth, and Range as attributes
–17 total arrays

Reflectance Bands as dimension:
–Lat, Long, Time, and Band number as dimensions
–Single Reflectance per cell as attribute
–Zenith, Azimuth, and Range as attributes
–1 Array

Reflectance Bands as attributes:
–Lat, Long, Time as dimensions
–17 reflectance attributes per cell
–Zenith, Azimuth, and Range as attributes
–1 Array

Case three is ideal, however each attribute is chunked separately-- what I’d like to know is if these separate chunks reside on the same instance for a given location specified by the data dimensions. If not, can they be forced to be? My hope is that within an array, attributes are chunked separately, but that those chunks are stored with other chunks referring to the same area in ‘array dimensional space’.

For case two above, we can force all reflectances to be stored on the same chunk by making the chunk size of the ‘Band’ dimension equal to number of bands-- at the cost of increased complexity in loading and querying the data.

Case one requires joining the arrays for any query, and seems to guarantee that a query to a single location in array dimension coordinates will involve multiple machines-- presumably the 17 different arrays are independently distributed across the cluster.

Again, our common use case is to grab either all reflectances, or a large subset, and ‘do something’ with those values-- so we want the reflectances for a given pixel (unique lat/long/time) to be ‘close’ and ‘cheap’ to retrieve.

Thank you advance,
Shane


#2

Shane -

I’ve attached a script that I hope answers your question about attribute co-locality. The short answer is “Yes”. All of the attribute values for a given cell in an array are always stored on the same instance. Also … if two arrays have identical chunking in identical dimensions, then the attribute values for given logical cells in those two arrays will also be on the same instance.

Let me ask another question … how do you propose to index the (lat / long) in your space? With MODIS data, you care about the poles and the meridian, no? Are you planning to index the < lat, long > as < X, Y, Z > triple? ie. Something like http://stackoverflow.com/questions/10473852/convert-latitude-and-longitude-to-point-in-3d-space%20this?

#!/bin/sh
#
#   File:   Chunks.sh 
#
#  About: 
#
#------------------------------------------------------------------------------
#
#  Useful shell script functions. 
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1}"
};
#
#------------------------------------------------------------------------------
#
CMD_HYGIENE="remove ( Foo )"
exec_afl_query "${CMD_HYGIENE}"
#
CMD_CREATE_ARRAY="
CREATE ARRAY Foo 
<
    attr1 : double,
    attr2 : string
>
[ I=0:9999,1000,0, J=0:9999,1000,0 ]
"
exec_afl_query "${CMD_CREATE_ARRAY};"
#
CMD_POPULATE_ARRAY="
store ( 
  apply ( 
    build ( 
      < attr1 : double > [ I=0:9999,1000,0, J=0:9999,1000,0 ],
      double(random()%100000)/100.0
    ),
    attr2, 'A-' + string(I) + '-' + string(J)
  ),
  Foo
)
"
exec_afl_query "${CMD_POPULATE_ARRAY};" -n 
#
#  The point here is that this array has two attributes, and 100 "logical"
# chunks, each 1/10th x 1/10th of the array's logical space. 
# 
#   In this example, I am going to use four instances. The details about 
#  port# and instance_path don't really concern us here. Only the fact that 
#  there are 4 instance_ids. 
#
exec_afl_query "list('instances');"
#
#  Query: list('instances');
#  {No} name,port,instance_id,online_since,instance_path
#  {0} 'localhost',1239,0,'2013-11-05 22:47:56','/home/plumber/Devel/Data/000/0'
#  {1} 'localhost',1240,1,'2013-11-05 22:47:56','/home/plumber/Devel/Data/000/1'
#  {2} 'localhost',1241,2,'2013-11-05 22:47:56','/home/plumber/Devel/Data/000/2'
#  {3} 'localhost',1242,3,'2013-11-05 22:47:56','/home/plumber/Devel/Data/000/3'
#
#   Now, I've created a single array in this installation, and called it 
#  "Foo". 
exec_afl_query "list('arrays');"
#  {No} name,id,schema,availability
#  {0} 'Foo',1,'Foo<attr1:double,attr2:string> [I=0:9999,1000,0,J=0:9999,1000,0]',true
#
#   Because SciDB is a multi-version system, when you list all of the arrays 
#  in the database, you can also get all of the versions for each array. 
#
exec_afl_query "list('arrays', true);"
#
#  {No} name,id,schema,availability
#  {0} 'Foo',1,'Foo<attr1:double,attr2:string> [I=0:9999,1000,0,J=0:9999,1000,0]',true
#  {1} 'Foo@1',2,'Foo@1<attr1:double,attr2:string> [I=0:9999,1000,0,J=0:9999,1000,0]',true
#
#   In addition to the meta-data about arrays (and attributes, instances,
#  operators, functions, types, etc) you can get a lot of physical details 
#  about what's going on in SciDB by looking at the list of chunks (the 
#  chunk map). 
#
exec_afl_query "list('chunk map');"
#
exec_afl_query "project ( list('chunk map'), instn );"
#
#   This returns a rather long list of data (and just a heads up ... we    
#  consider this an "internal" interface and reserve the right to change it 
#  whenever we feel the need, so please don't write your application code to 
#  depend on it always being there in its current form!) and we only really 
#  care about a few of it's attributes. 
#
#  {inst,n}    -- inst => instanceID, n => entry number     
#  instn       -- instn => instanceID (repeat of the value in the dims). 
#  uaid        -- uaid  => unique array id 
#  attid       -- attid => attribute id 
#  coord       -- coord => logical coordinates of the chunk. 
#  nelem       -- nelem => number of elements in the chunk
#
#
CMD_LIST_CHUNKS_FOR_NAMED_ARRAY="
project ( 
  filter ( 
    cross ( 
      list('chunk map') AS CHUNKS,
      filter ( 
        list('arrays'),
        name = 'Foo'
      ) AS ARRAYS
    ),
    CHUNKS.uaid = ARRAYS.id
  ),
  ARRAYS.name, CHUNKS.instn, CHUNKS.uaid, CHUNKS.attid, 
  CHUNKS.coord, CHUNKS.nelem
)
"
exec_afl_query "${CMD_LIST_CHUNKS_FOR_NAMED_ARRAY};"
#
#  Now, the coords are just a string. So we can sort by the coords.
# 
CMD_LIST_CHUNKS_FOR_NAMED_ARRAY="
sort ( 
  project ( 
    filter (
      cross (
        list('chunk map') AS CHUNKS,
        filter (
          list('arrays'),
          name = 'Foo'
        ) AS ARRAYS
      ),
      CHUNKS.uaid = ARRAYS.id
    ),
    ARRAYS.name, CHUNKS.instn, CHUNKS.attid, CHUNKS.coord
  ),
  CHUNKS.coord
)
" 
exec_afl_query "${CMD_LIST_CHUNKS_FOR_NAMED_ARRAY};"
#
#   I'll clip out the top few lines of this result to explain what's going 
#  on ... 
#
# {n} name,instn,attid,coord
# {0} 'Foo',0,2,'{0, 0}'
# {1} 'Foo',0,0,'{0, 0}'
# {2} 'Foo',0,1,'{0, 0}'
# {3} 'Foo',1,2,'{0, 1000}'
# {4} 'Foo',1,0,'{0, 1000}'
# {5} 'Foo',1,1,'{0, 1000}'
# {6} 'Foo',2,2,'{0, 2000}'
# {7} 'Foo',2,1,'{0, 2000}'
# {8} 'Foo',2,0,'{0, 2000}'
# {9} 'Foo',3,2,'{0, 3000}'
# ...
#
#   This array actuall has three attributes: the two which are named attr1 and
#  attr2, and a third, "internal attribute" we use to track information about
#  the cells in the array. (Specifically, this internal attribute is a bitmask
#  to inform us whether or not a cell in an array is 'empty' or not. This 
#  attribute is highly compressed.) 
#
#   For each chunk, we record the coordinate at one corner. So the three 
#  chunks for the three attributes--attid = { 0, 1, 2 }--are all the same.
#  The first logical chunk starts at {0, 0}, and the second (sorted simply 
#  by lexical order of the strings) starts at {0, 1000}. Of course, there's 
#  no "natural" order to chunks. You can impose any order that makes sense. 
#
#   Next, have a look at the 'instn' column? That's the instance on which 
#  each chunk resides. You can see, from this little query, that all three 
#  'logical' chunks with their anchor at {0,0} are on instn = 0, and all 
#  three of the 'logical' chunks anchored at {0, 1000} are on instance 1. 
#
#   Hope this answers your question. 

#3

Good morning:

I’ve been using SciDB for a couple of months with MOD09Q1. I’m only storing the reflectance values (sur_refl_b01, sur_refl_b02, sur_refl_qc_250m), first as your case 1 (Single Reflectance per cell as attribute, 3 arrays) and later as case 3 (3 attributes per cell , 1 array, this case is faster for data uploading). After loading a certain number of images for a single tile (25 approx), I’m getting a chunk size error message:

Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_CHUNK_SIZE_TOO_LARGE
Error description: Storage error. Chunk size 530318496 is larger than segment size 268435456.
Failed query id: 1101715217718

Did you have the same problem? Did you solve it?

Thanks,

Alber


#4

Hi Alber,

The MODIS data is a little tricky in that it’s sparse but thankfully it’s mostly regular. You should reduce the chunk sizes along your dimensions. You had a chunk with 530MB in it, a little steep. I would aim for about 10x less total area. So if you have [x=0:,100000,0, y=0:,100000,0] I’d cut each side by about 3 to something like [x=0:,30000,0,y=0:,30000,0]… Does this make sense?

This isn’t affected by the number of attributes (unless you are using long string attributes somewhere).