What is this chunk map thingy?


#1

So …

With the release of 14.3, we’ve substantially changed the way SciDB handles data on disk. We’ve shifted from a single “data” file to something more robust and easier to work with. But in doing so, we’ve made only fairly minor changes to the way we handle chunk meta-data. So this seems like as good a time as any to put something out there that describes what the meta-data is that we make available to users, and provides some illustrations of how to use it.

The following script:

  1. Introduces everyone to the list(‘chunk map’) facility, that gives you a description of the various items of meta-data that it contains.

  2. Provides some queries that let you align the list of arrays–list(‘arrays’)–with the contents of the chunk map, so you can get a handle on how an individual array’s chunks are sized–number of elements, physical data size, etc.

  3. Some queries to answer common questions about how SciDB is managing chunks.

#!/bin/sh
#
#   File:   Chunk_Map/Examples.sh 
#
#  About: 
#
#    The purpose of this script is to illustrate the use of some of the 
#   system catalog features of SciDB. Along the way, we will see how some 
#   of the product's design principles surface in the way we lay out 
#   arrays, attributes, chunks and instances. 
#
#------------------------------------------------------------------------------
#
#  Useful shell script functions. 
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1};"
};
#
exec_aql_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -q "${1};"
};
#
#------------------------------------------------------------------------------
#
#   Although this script is indented to give everyone a tutorial about 
#  how the list('chunk map') works, it includes some partial query results. 
#  To understand what these results mean, it's necessary first to grok the 
#  setup / configuration of the SciDB cluster I'm running these queries 
#  on. The relevant portions of the config file I'm using looks like this: 
# 
# [local4]
# server-0=localhost,3
# db_user=mydb
# db_passwd=mydb
# install_root=/home/plumber/Devel/trunk/stage/install
# pluginsdir=/home/plumber/Devel/trunk/stage/install/lib/scidb/plugins
# logconf=/home/plumber/Devel/trunk/stage/install/share/scidb/log4cxx.properties
# base-path=/home/plumber/Devel/Data
# tmp-path=/tmp
# base-port=1239
#  
#   In other words, this is a 4-instance SciDB clusters. And it's all running 
#  on one box. Note that the redundancy is set to default: k = 1. 
#
#------------------------------------------------------------------------------
#
#   1. Create and Populate a Couple of Arrays
#      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   First, let's create a couple of test arrays that we can use as examples to 
#  show how the list('chunk map') works. These arrays are tiny, and the way 
#  they're chunked is very sub-optimal from a performance point of view. 
#
#  NOTE: This first array has 16 'logical' chunks (4 x 4). For each logical
#        'chunk' in the array, it will have 3 data chunks, one per 
#        attribute, and an additional chunk to hold the logical chunk's 
#        isEmpty bitmask (used to ensure we can accommodate both dense 
#        and sparse arrays). 
#        
CMD_HYGIENE="
DROP ARRAY Test_One
"
exec_afl_query "${CMD_HYGIENE}"
#
CMD_CREATE_ARRAY="
CREATE ARRAY Test_One 
< 
   attr1 : int64 NULL,
   attr2 : double NULL,
   attr3 : string NULL
>
[ I=0:99, 25, 0, J=0:99, 25, 0]
"
exec_afl_query "${CMD_CREATE_ARRAY}"
#
CMD_POPULATE_ARRAY="
INSERT INTO Test_One
SELECT MIN ( a1 ) AS attr1,
       MIN ( a2 ) AS attr2,
       MIN ( a3 ) AS attr3
  FROM ( 
    SELECT ( random()%1000 ) AS a1,
           ( double(random()%1000)/100.0 ) AS a2,
           ( 'XXX-' + string(random()%10000) + '-YYY' ) AS a3,
           ( RowNum * 1021 )%100 AS I,
           ( RowNum * 1031 )%100 AS J
      FROM build ( < num : int32 > [ RowNum = 0:999,1000,0 ], RowNum ) 
   )
   REDIMENSION BY [ I=0:99, 25, 0, J=0:99, 25, 0 ]
"
exec_aql_query "${CMD_POPULATE_ARRAY};" -n 
#
#   NOTE: This second array, Test_Two, has the same size and shape 
#         as Test_One, but is distinguished from it by the fact that
#         the array's chunks each include a degree of overlap. 
#
CMD_HYGIENE="remove ( Test_Two )"
exec_afl_query "${CMD_HYGIENE}"
#
CMD_CREATE_ARRAY="
CREATE ARRAY Test_Two 
< 
   attr1 : int64 NULL,
   attr2 : double NULL,
   attr3 : string NULL
>
[ I=0:99, 25, 5, J=0:99, 25, 5]
"
exec_afl_query "${CMD_CREATE_ARRAY}"
#
# Populate the array. . . 
#
CMD_POPULATE_ARRAY="
INSERT INTO Test_Two
SELECT MIN ( a1 ) AS attr1,
       MIN ( a2 ) AS attr2,
       MIN ( a3 ) AS attr3
  FROM ( 
    SELECT ( random()%1000 ) AS a1,
           ( double(random()%1000)/100.0 ) AS a2,
           ( 'XXX-' + string(random()%10000) + '-YYY' ) AS a3,
           ( RowNum * 1021 )%100 AS I,
           ( RowNum * 1031 )%100 AS J
      FROM build ( < num : int32 > [ RowNum = 0:999,1000,0 ], RowNum ) 
   )
   REDIMENSION BY [ I=0:99, 25, 5, J=0:99, 25, 5 ]
"
exec_aql_query "${CMD_POPULATE_ARRAY};" -n 
#
#------------------------------------------------------------------------------
#
#   First, some queries to get a snapshot of the configuration of the system. 
#
#   Q1: What are the arrays in the system? 
#
CMD_LIST_ARRAYS="list('arrays')"
exec_afl_query "${CMD_LIST_ARRAYS}"
#
#  Query: list('arrays')
#  {No} name,id,schema,availability
#  {0} 'Test_One',1,'Test_One<attr1:int64 NULL DEFAULT null,attr2:double NULL DEFAULT null,attr3:string NULL DEFAULT null> [I=0:99,25,0,J=0:99,25,0]',true
#  {1} 'Test_Two',3,'Test_Two<attr1:int64 NULL DEFAULT null,attr2:double NULL DEFAULT null,attr3:string NULL DEFAULT null> [I=0:99,25,0,J=0:99,25,0]',true
#
#------------------------------------------------------------------------------
#
#   Q2: What are the (physical) instances in the SciDB cluster? 
#
CMD_LIST_INSTANCES="list('instances')"
exec_afl_query "${CMD_LIST_INSTANCES}"
#
#  {No} name,port,instance_id,online_since,instance_path
#  {0} 'localhost',1239,0,'2014-03-18 17:25:30','/home/plumber/Devel/Data/000/0'
#  {1} 'localhost',1240,1,'2014-03-18 17:25:30','/home/plumber/Devel/Data/000/1'
#  {2} 'localhost',1241,2,'2014-03-18 17:25:30','/home/plumber/Devel/Data/000/2'
#  {3} 'localhost',1242,3,'2014-03-18 17:25:30','/home/plumber/Devel/Data/000/3'
#
#------------------------------------------------------------------------------
#
#  Q3: What is the shape of the list('chunk map') result? 
#
CMD_SHAPE_LIST_CHUNK_MAP="show ( 'list(\'chunk map\')', 'afl')"
exec_afl_query "${CMD_SHAPE_LIST_CHUNK_MAP}"
#
# {i} schema
# {0} 'list<svrsn:uint32,instn:uint32,dseg:uint8,dhdrp:uint64,
#           doffs:uint64,uaid:uint64,arrid:uint64,attid:uint64,
#           coord:string,comp:int8,flags:uint8,nelem:uint32,
#           csize:uint32,usize:uint32,asize:uint32,addrs:uint64,
#           clnof:uint64,clons:string,next:uint64,prev:uint64,
#           data:uint64,accnt:int32,nwrit:int32,tstmp:uint64,
#           raw:bool,waitn:bool,lpos:string,fposo:string,
#           lposo:string,strge:uint64,loadr:uint64> 
#       [inst=0:3,1,0,n=0:*,1000000,0]'
#
#
#  What are the semantics of each of those attributes? 
#
#  svrsn    :   uint32     // STORAGE_VERSION
#                          //  As we develop SciDB, we might have to modify 
#                          // some of the details of our storage management 
#                          // implementation. But we don't want these changes
#                          // to mean users must re-load their data. So we're
#                          // tracking, for every chunk, what version of the 
#                          // storage manager (an internal module) was used 
#                          // in it's creation and therefore how to 'read' it.
#  instn    :   uint32     // INSTANCE_ID
#                          //  The instance number within the SciDB cluster 
#                          // on which the chunk is located. Note that this
#                          // is also carried in the 'inst' dimension. 
#  
#  Pre-14.3 {{{ 
#  
#  dseg     :   uint8      // DISK_SEGMENT_NO
#                          //  To avoid disk fragmentation, we allocate space 
#                          // for array data in 'segments' that are typically
#                          // much larger than the typical chunk size. This 
#                          // value identifies the segment in which the
#                          // chunk is located. 
#  dhdrp    :  uint64      // DISK_HEADER_POS
#                          //  Position of chunk's header within the 
#                          // storage.header file. Internal use. 
#  doffs    :  uint64      // DISK_OFFSET
#                          //  Position of the start of the chunk's data 
#                          // within the segment. Internal use. 
#                          // 
#                          //  NOTE: The combination of INSTANCE_ID, 
#                          // DISK_SEGMENT_NO, DISK_HEADER_POS and 
#                          // DISK_OFFSET give us all the information we need 
#                          // to know about the PHYSICAL location of a chunk 
#                          // and it's meta-data.
#  
#  }}}
#  
#  14.3 and after {{{
#  dguid    :  uint64      // DATASTORE_GUID
#                          //  Identifier of the data store within which the
#                          // chunk is stored. 
#  dhdrp    :  uint64      // DATAHEADER_POS
#                          //  Location of the chunk header within the 
#                          // storage.header file. 
#  doffs    :  uint64      // DATA_OFFSET 
#                          //  Location (offset) of the chunk data within the 
#                          // datastore.
#  
#                          //  NOTE: The combination of INSTANCE_ID -> 
#                          // DATASTORE_GUID -> DATA_OFFSET gives you the 
#                          // physical location of the chunk, and the 
#                          // INSTANCE_ID -> DATAHEADER_POS gives you the 
#                          // physical location of the chunk's meta-data. 
#  
#  }}}
#  
#  uaid     :  uint64      // U_ARRAY_ID
#                          //  Every array that is created with a CREATE ARRAY
#                          // statement is given a unique identifier. This 
#                          // entry in the 'chunk map' identifies the logical
#                          // array to which this chunk's data belongs. 
#  arrid    :  uint64      // V_ARRAY_ID
#                          //  Whenever an array's data is modified, SciDB 
#                          // creates a new 'version' of the array to support
#                          // both data provenance and the MVCC transactions. 
#                          // This entry in the gives you the *last* 
#                          // version in this chunk. 
#                          // NOTE: See below for NEXT and PREV. 
#  attid    :  uint64      // ATTRIBUTE_ID
#                          //  The attribute ID within the array identified 
#                          // with the 'uaid' and 'arrid'. The last attid value
#                          // is reserved for chunks that contain a bitmask. 
#                          // Other values--starting at 0--index the array's 
#                          // attributes. 
#  coord    :  string      // COORDINATES
#                          //  A string representation of the coordinates of 
#                          // the cell in the array that's the chunk's origin.
#  comp     :  int8        // COMPRESSION
#                          //  If the array's attribute is compressed, then 
#                          // the chunk is compressed. We're leaving this 
#                          // option to store the type of compression. 
#  flags    :  uint8       // FLAGS 
#                          //  A bitmask of chunks properties. For internal 
#                          // use only. 
#  nelem    :  uint32      // NUM_ELEMENTS
#                          //  The number of (non-empty) elements in the chunk.
#  csize    :  uint32      // COMPRESSED_SIZE
#                          //  The size, in bytes, of the chunk once 
#                          // compressed. That is, the number of bytes of data
#                          // stored on the disk. 
#  usize    :  uint32      // UNCOMPRESSED_SIZE
#                          //  The size, in bytes, of the chunk once un-
#                          // compressed. That is, the number of bytes the 
#                          // chunk's data will take up in memory. 
#  asize    :  uint32      // ALLOCATED_SIZE
#                          //  The amount of space allocated to the chunk. This
#                          // figure includes additional space allocated 
#                          // to hold backwards looking deltas to be applied
#                          // to the current snapshot of the chunk's data 
#                          // to obtain prior versions. 
#  addrs    :  uint64      // ADDRESS
#                          //  Location of chunk header. Internal to SciDB. 
#                          // 
#  clnof    :  uint64      // CLONE_OF  These two are reserved for future 
#  clons    :  string      // CLONES    use. 
#                          // 
#  next     :  uint64      // NEXT
#                          //  Address of "Next" chunk in LRU sequence. 
#                          // Internal to SciDB. 
#  prev     :  uint64      // PREV 
#                          //  Address of "Previous" chunk in LRU sequence. 
#                          // Internal to SciDB. 
#  data     :  uint64      // DATA
#                          //  Address of "This" chunk. In the LRU sequence.
#                          // Internal to SciDB. 
#                          // 
#                          // NOTE: If the chunk is not in cache, then the 
#                          //       values of DATA, PREV and NEXT are all 
#                          //       set to 0. 
#  accnt    :  int32       // ACCESS_CNT 
#                          //  Number of current chunk accessors. 
#  nwrit    :  int32       // N_WRITERS 
#                          //  Reserved for future use. 
#  tstmp    :  uint64      // TIMESTAMP
#                          //  An incremental 'clock' value reflecting the 
#                          // order in which the chunks were were last 
#                          // touched. 
#  raw      :  bool        // RAW 
#  waitn    :  bool        // WAITING
#                          //  'true' if some thread is waiting for the chunk 
#                          // to be loaded from the disk. 
#  lpos     :  string      // LAST_POS
#                          //  A string representation of the coordinates of
#                          // the cell in the array that's the chunk's 
#                          // last position. See also: coord. The chunk's 
#                          // cells are the ones in a rectangle implied by 
#                          // the diagonal vector from coord to lpos. 
#  fposo    :  string      // FIRST_POS_OVERLAP
#                          //  If the array's chunking has been organized to 
#                          // use overlaps, then the data in each chunk 
#                          // extends beyond just the 'inner' coord and lpos.
#                          // This string contains the first position in the 
#                          // chunk including any overlapping data. 
#  lposo    :  string      // LAST_POS_OVERLAP
#                          // This string contains the last position in the 
#                          // chunk including any overlapping data. See also
#                          // fposo. 
#  strge    :  uint64      // STORAGE 
#                          //  Handle to the storage manager that's managing
#                          // the chunk. Internal use only. 
#  loadr    :  uint64      // LOADER
#                          //  Handle to the thread being used to load the 
#                          // chunk. (?)
#                          // 
#
#------------------------------------------------------------------------------
#
#   Getting Information about a Named Array from the Chunk Map
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#  Because the chunk map, along with all of the other output of the list(...)
# operator, are just arrays, we can use the other features of the SciDB 
# query language to examine them, or combine them in interesting ways. For
# example, we know the array names (and array IDs) from list('arrays'). Let's
# use this information to give the name for the array associated with each 
# chunk. 
#
#  Q4: What are the list of chunks in SciDB for each of the arrays in 
#      the system?
#
CMD_WHAT_CHUNKS_EXIST_FOR_EACH_ARRAY="
SELECT A.name,
       C.attid,
       C.coord
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
"
exec_aql_query "${CMD_WHAT_CHUNKS_EXIST_FOR_EACH_ARRAY}"
#
#  No - Array Number. 
#  inst - Instance ID
#  n - Chunk ID ( on that Instance ) 
#
# {No,inst,n} name,      attid,coord
# {0,0,16}    'Test_One',0,    '{0, 0}'
# {0,0,17}    'Test_One',0,    '{25, 25}'
# {0,0,18}    'Test_One',0,    '{50, 50}'
# {0,0,19}    'Test_One',0,    '{75, 75}'
#  ... 
# {1,0,0}     'Test_Two',0,    '{0, 0}'
# {1,0,1}     'Test_Two',0,    '{25, 25}'
# {1,0,2}     'Test_Two',0,    '{50, 50}'
# {1,0,3}     'Test_Two',0,    '{75, 75}'
# {1,0,4}     'Test_Two',1,    '{0, 0}'
#  ... 
#
#  etc. 
#
#------------------------------------------------------------------------------
#
#   Basic Chunk Identifying Information from the Chunk Map
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   Let's focus on the chunks for a particular array. 
#
#  Q5: What are the list of chunks in SciDB for a particular array?
#
CMD_WHAT_CHUNKS_EXIST_FOR_ONE_ARRAY="
SELECT A.name,
       C.attid,
       C.instn,
       C.coord
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
"
exec_aql_query "${CMD_WHAT_CHUNKS_EXIST_FOR_ONE_ARRAY}"
#
#  {No,inst,n} name,attid,instn,coord
#  {0,0,16} 'Test_One',0,0,'{0, 0}'
#  {0,0,17} 'Test_One',0,0,'{25, 25}'
#  {0,0,18} 'Test_One',0,0,'{50, 50}'
#  {0,0,19} 'Test_One',0,0,'{75, 75}'
#  {0,0,20} 'Test_One',1,0,'{0, 0}'
#  ...
#  {0,3,26} 'Test_One',2,3,'{50, 25}'
#  {0,3,27} 'Test_One',2,3,'{75, 0}'
#  {0,3,28} 'Test_One',3,3,'{0, 75}'
#  {0,3,29} 'Test_One',3,3,'{25, 50}'
#  {0,3,30} 'Test_One',3,3,'{50, 25}'
#  {0,3,31} 'Test_One',3,3,'{75, 0}'
#
#  Q5.1: Can we get some basic information about the distribution of 
#        those chunks? 
#
CMD_HOW_MANY_CHUNKS_FOR_THAT_ARRAY_ON_EACH_INSTANCE="
SELECT COUNT(*) AS CNT_PER_INSTANCE,
       SUM ( C.nelem ) AS NUM_CELLS,
       MIN ( C.nelem ) AS MIN_CELLS_PER_CHUNK,
       MAX ( C.nelem ) AS MAX_CELLS_PER_CHUNK,
       AVG ( C.nelem ) AS AVG_CELLS_PER_CHUNK
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
 GROUP BY C.inst
"
exec_aql_query "${CMD_HOW_MANY_CHUNKS_FOR_THAT_ARRAY_ON_EACH_INSTANCE}"
#
#  {inst} CNT_PER_INSTANCE
#  {0} 16
#  {1} 16
#  {2} 16
#  {3} 16
#
#  SELECT COUNT(*) AS CNT_PER_INSTANCE,
#         SUM ( C.nelem ) AS NUM_CELLS,
#         MIN ( C.nelem ) AS MIN_CELLS_PER_CHUNK,
#         MAX ( C.nelem ) AS MAX_CELLS_PER_CHUNK,
#         AVG ( C.nelem ) AS AVG_CELLS_PER_CHUNK
#    FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
#   WHERE A.id = C.uaid
#     AND A.name = 'Test_2D_Three'
#   GROUP BY C.inst
#
#  So, for the Test_One array, there are 16 chunks on each instance. Early
# on, we noted that there are four 'attributes'--three explicit, and one 
# system maintained--in the Test_One array. So there are 4 'logical chunks'
# from Test_One on each of the instances. 
#  
#------------------------------------------------------------------------------
#
#   SciDB Chunk Map and Array Updating
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#  Let's examine what happens to the information in the chunk map when we
# update a chunk. 
#
#   NOTE: This section reveals something about how SciDB works at the 
#         moment that you might not want to know! We're actively working
#         to improve this. But so far, we don't have very many (any?) 
#         customers with significant UPDATE ... workloads. ( More about 
#         INSERT ... below! ) 
#
#  Q6.1 Let's have a look at the data in a particular cell. 
#
CMD_SHOW_ATTRIBUTES_IN_CELL_98_78="
SELECT * FROM Test_One WHERE I = 98 AND J = 78
"
exec_aql_query "${CMD_SHOW_ATTRIBUTES_IN_CELL_98_78}"
#
#  {I,J} attr1,attr2,attr3
#  {98,78} 93,1.26,'XXX-1195-YYY'
#
#
#   Now, we know that the cell Test_One[98,78] is in the chunk that's 
#  anchored at '{75,75}'. What do these chunks look like? 
#
#  Q6.2 What are the details for the chunks that hold the data for 
#       Test_One[ 98, 78 ]? 
CMD_CHUNK_DETAILS_FOR_75_75="
SELECT A.name,
       C.attid,

       C.instn,
       C.coord
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
   AND C.coord = '{75, 75}'
"
exec_aql_query "${CMD_CHUNK_DETAILS_FOR_75_75}"
#
#  {No,inst,n} name,      attid,instn,coord
#  {0,0,19}    'Test_One',0,    0,    '{75, 75}'
#  {0,0,23}    'Test_One',1,    0,    '{75, 75}'
#  {0,0,27}    'Test_One',2,    0,    '{75, 75}'
#  {0,0,31}    'Test_One',3,    0,    '{75, 75}'
# 
#  Q6.3 Update the value of attr1 in Test_One [ 98, 78 ]. 
#
CMD_UPDATE_ATTR1_IN_TEST_ONE="
UPDATE Test_One SET attr1 = attr1 + 1 WHERE I = 98 AND J = 78
"
exec_aql_query "${CMD_UPDATE_ATTR1_IN_TEST_ONE}" -n 
# 
#  Q6.3.1 What does that cell's data look like now? 
#
exec_aql_query "${CMD_SHOW_ATTRIBUTES_IN_CELL_98_78}"
#
#  {I,J} attr1,attr2,attr3
#  {98,78} 94,1.26,'XXX-1195-YYY'
#
#  Q6.4 And what do the cell's chunks look like now? A
exec_aql_query "${CMD_CHUNK_DETAILS_FOR_75_75}"
#
#  {No,inst,n} name,      attid,instn,coord
#  {0,0,22}    'Test_One',0,    0,    '{75, 75}'
#  {0,0,23}    'Test_One',0,    0,    '{75, 75}'
#  {0,0,27}    'Test_One',1,    0,    '{75, 75}'
#  {0,0,31}    'Test_One',2,    0,    '{75, 75}'
#  {0,0,35}    'Test_One',3,    0,    '{75, 75}'
#
#  NOTES: 
#
#   1. In this case, the update to the cell's attribute caused SciDB
#      to create a new chunk (n = 22) to hold the new data. 
#
#  Q6.5 Can we get some more information about those chunks? 
CMD_MORE_CHUNK_DETAILS_FOR_75_75="
SELECT C.attid,
       C.arrid,
       C.tstmp,
       C.coord
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
   AND C.coord = '{75, 75}'
"
exec_aql_query "${CMD_MORE_CHUNK_DETAILS_FOR_75_75}"
#
#  {No,inst,n} attid,arrid,tstmp,coord
#  {0,0,22}    0,    5,    37,   '{75, 75}'
#  {0,0,23}    0,    2,    14,   '{75, 75}'
#  {0,0,27}    1,    2,    15,   '{75, 75}'
#  {0,0,31}    2,    2,    16,   '{75, 75}'
#  {0,0,35}    3,    2,    17,   '{75, 75}'
#
#
#  NOTES: 
#
#     1. Although the uaid (the basic array id, which is found here in the 
#        'No' dimension from the list('arrays') source) is the same for the 
#        new/changed chunk, the change to the cell's value means SciDB created 
#        a new 'version' of the array with arrid = 5. 
#
#    2.  Note the tstmp. This reflects the order in which the chunks were 
#        created. It isn't a clock-time. But for the purposes of figuring 
#        out what happened in your SciDB database in the past, it's very 
#        useful. 
#
#------------------------------------------------------------------------------
#
#  Chunk Element Counts and Size Information from Chunk Map
#  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   
#   One especially useful piece of information that can be gleaned from the 
#  chunk map concerns the sizes of chunks. One of the performance tuning 
#  objectives that we try to maintain is to keep chunk sizes to between 
#  1M and 100M in size, which means (if we assume that there is 8 bytes 
#  per data value), that's about 125,000 values up to about 12,500,000
#  values. 
#
#  Q7: Find the min, max and average number of cells per chunk for the 
#      Test_One array. 
#    
CMD_FIND_RANGE_OF_PER_CHUNK_ELEMENTS="
SELECT MIN ( C.nelem ) AS MIN_CELLS_PER_CHUNK,
       MAX ( C.nelem ) AS MAX_CELLS_PER_CHUNK,
       AVG ( C.nelem ) AS AVG_CELLS_PER_CHUNK
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
"
exec_aql_query "${CMD_FIND_RANGE_OF_PER_CHUNK_ELEMENTS}"
#
# {i} MIN_CELLS_PER_CHUNK,MAX_CELLS_PER_CHUNK,AVG_CELLS_PER_CHUNK
# {0} 6,                  7,                  6.25
#
#  NOTE: In the examples we're using, the arrays and chunks are tiny.
#
#  Q7.1 A more sophisticated idea might be to compute a quantile(...) 
#       distribution of the number of cells per chunk. 
#
CMD_COMPUTE_QUANTILE_OF_CHUNKS_PER_ELEMENT="
SELECT * 
  FROM quantile ( 
         ( SELECT C.nelem 
             FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
            WHERE A.id = C.uaid AND A.name = 'Test_One' 
         ),5
       )
"
exec_aql_query "${CMD_COMPUTE_QUANTILE_OF_CHUNKS_PER_ELEMENT}"
#
#  {quantile} percentage,nelem_quantile
#  {0}        0,         6
#  {1}        0.2,       6
#  {2}        0.4,       6
#  {3}        0.6,       6
#  {4}        0.8,       7
#  {5}        1,         7
#
#   Similarly, you can get details about chunk sizes. 
#
#  Q7.2 Find the range of sizes of the chunks for the Test_One array. 
#
CMD_FIND_RANGE_OF_PER_CHUNK_SIZES="
SELECT MIN ( C.asize ) AS MIN_ALLOC,
       MAX ( C.asize ) AS MAX_ALLOC,
       AVG ( C.asize ) AS AVG_ALLOC,
       MIN ( C.usize ) AS MIN_SIZE,
       MAX ( C.usize ) AS MAX_SIZE,
       AVG ( C.usize ) AS AVG_SIZE
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
"
exec_aql_query "${CMD_FIND_RANGE_OF_PER_CHUNK_SIZES}"
#
# {i} MIN_ALLOC,MAX_ALLOC,AVG_ALLOC,MIN_SIZE,MAX_SIZE,AVG_SIZE
# {0} 120,      198,      144.787,  120,     198,     144.787
#  
#------------------------------------------------------------------------------
#
#   Getting the Per-Chunk Boundary Information
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#    We can also get information about the range of a chunk's extent. 
#
#   Q8.1: Show the outline--start to end coordinate--of a particular chunk
#         in the Test_One array. 
# 
CMD_CHUNK_SIZE_DETAILS="
SELECT C.attid,
       C.arrid,
       C.coord,
       C.lpos
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_One'
   AND C.coord = '{75, 75}'
"
exec_aql_query "${CMD_CHUNK_SIZE_DETAILS}"
#
#  {No,inst,n} attid,arrid,coord,     lpos
#  {0,0,22}    0,    5,    '{75, 75}','{99, 99}'
#  {0,0,23}    0,    2,    '{75, 75}','{99, 99}'
#  {0,0,27}    1,    2,    '{75, 75}','{99, 99}'
#  {0,0,31}    2,    2,    '{75, 75}','{99, 99}'
#  {0,0,35}    3,    2,    '{75, 75}','{99, 99}'
#  
#   Note that, for the Test_Two array, the same 'logical' chunk is 
# located on the same instances. 
#
#   Q8.2: Show the outline--start to end coordinate--of a particular chunk
#         in the Test_Two array. 
#
CMD_CHUNK_OUTLINE_DETAILS_TEST_TWO="
SELECT C.attid,
       C.arrid,
       C.coord,
       C.lpos
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_Two'
   AND C.coord = '{75, 75}'
"
exec_aql_query "${CMD_CHUNK_OUTLINE_DETAILS_TEST_TWO}"
#
#  {No,inst,n} attid,arrid,coord,     lpos
#  {1,0,3}     0,    4,    '{75, 75}','{99, 99}'
#  {1,0,7}     1,    4,    '{75, 75}','{99, 99}'
#  {1,0,11}    2,    4,    '{75, 75}','{99, 99}'
#  {1,0,15}    3,    4,    '{75, 75}','{99, 99}'
#  
#    Note that the same "logical" chunk ... {75,75}->{99,99} ... from 
#  both arrays is found on the same instance (inst = 0). This is fundamental 
#  to the way SciDB parallelizes operations as it scales. 
#
#   Q8.2: Show the outline--start to end coordinate including overlaps--of a 
#         particular chunk in the Test_Two array. ( Recall that the Test_Two 
#         array was created with overlapping chunks. ) 
# 
CMD_MORE_CHUNK_OUTLINE_DETAILS_TEST_TWO="
SELECT C.attid,
       C.arrid,
       C.coord,
       C.lpos,
       C.fposo,
       C.lposo
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND A.name = 'Test_Two'
   AND C.coord = '{75, 75}'
"
exec_aql_query "${CMD_MORE_CHUNK_OUTLINE_DETAILS_TEST_TWO}"
#
# {No,inst,n} attid,arrid,coord,      lpos,     fposo,     lposo
# {1,0,3}     0,    4,    '{75, 75}','{99, 99}','{70, 70}','{99, 99}'
# {1,0,7}     1,    4,    '{75, 75}','{99, 99}','{70, 70}','{99, 99}'
# {1,0,11}    2,    4,    '{75, 75}','{99, 99}','{70, 70}','{99, 99}'
# {1,0,15}    3,    4,    '{75, 75}','{99, 99}','{70, 70}','{99, 99}'
#
#   The fposo reports the starting cell of the chunk, including the 
#  overlaps. The lposo is the same as lpos, in this case, because 
#  the chunk runs up against the edge of the array. 
#
#------------------------------------------------------------------------------
#
#   Chunk Map and Caching
#   ~~~~~~~~~~~~~~~~~~~~~
#
#   The chunk map is not only a reflection of the state of the persistent 
#  chunks. It also contains information about the current state of things like
#  the chunks in the SciDB cache. If a chunk is in cache, then the 'data' 
#  attribute of the chunk map is non-zero. 
#
#   For the purposes of this section of the script, it's necessary that we 
#  re-start SciDB to empty the in-memory cache. 
#
#  scidb.py stopall local4; scidb.py startall local4
#
#  Q9.1: How many chunks are there in the map? 
CMD_HOW_MANY_CHUNKS_IN_TOTAL="SELECT COUNT(*) FROM list('chunk map')"
exec_aql_query "${CMD_HOW_MANY_CHUNKS_IN_TOTAL}"
#
#  {i} COUNT
#  {0} 144
#
#  Q9.2: How many chunks are there in cache? 
CMD_HOW_MANY_CHUNKS_IN_CACHE="
SELECT COUNT(*) 
  FROM list('chunk map') 
 WHERE data != 0
"
exec_aql_query "${CMD_HOW_MANY_CHUNKS_IN_CACHE}"
#
#  {i} COUNT
#  {0} 144
#
#   Now, suppose we run a query that pulls just one attribute from 
#  the Test_One array. 
#
#  Q9.3: Query to pull the data chunks for attr1, and the bitmask chunks, 
#        into cache. 
CMD_ONE_ATTRIBUTE="
SELECT T.attr1 FROM Test_One AS T
"
exec_aql_query "${CMD_ONE_ATTRIBUTE}" 
#
#  Now how many chunks are in cache? 
#
#  Q9.4: How many chunks in cache? 
#
exec_aql_query "${CMD_HOW_MANY_CHUNKS_IN_CACHE}"
#
# {i} COUNT
# {0} 32
#   
CMD_DETAILS_OF_CHUNKS_IN_CACHE="
SELECT A.name, 
       C.attid,
       C.coord
  FROM list('arrays') AS A CROSS JOIN list('chunk map') AS C
 WHERE A.id = C.uaid
   AND C.data != 0
"
exec_aql_query "${CMD_DETAILS_OF_CHUNKS_IN_CACHE}"
#
#  {No,inst,n} name,attid,coord
#  {0,0,16} 'Test_One',0,'{0, 0}'
#  {0,0,18} 'Test_One',0,'{25, 25}'
#  {0,0,20} 'Test_One',0,'{50, 50}'
#  {0,0,22} 'Test_One',0,'{75, 75}'
#  {0,0,32} 'Test_One',3,'{0, 0}'
#  ... 
#  {0,3,22} 'Test_One',0,'{75, 0}'
#  {0,3,32} 'Test_One',3,'{0, 75}'
#  {0,3,33} 'Test_One',3,'{25, 50}'
#  {0,3,34} 'Test_One',3,'{50, 25}'
#  {0,3,35} 'Test_One',3,'{75, 0}'
#
#  NOTE: Only chunks for Test_One, and only chunks for attr1 ( attid = 0 ) 
#        and the isEmpty bitMask ( attid = 3 ) have been read into cache. 
# 
#------------------------------------------------------------------------------

Spurious instance id in list chunk map (tombstones)
#2

Hi,
I want to know that if segment include chunks in SciDB 14.12 ?
And what’s the relationship between segment and chunk?

Thank you. :smile:


#3

Hi—
With 14.12 we’ve changed the way we organize chunks on disk, and we no longer use the concept of a “segment”.
–Steve F
Paradigm4


#4

[quote=“stevef”]Hi—
With 14.12 we’ve changed the way we organize chunks on disk, and we no longer use the concept of a “segment”.
–Steve F
Paradigm4[/quote]

Hi,
Ok,Thank you for your reply.


#5

the code looks like a great idea except it does not work
i am getting errorrs like this
Query:
SELECT A.name,
C.attid,
C.coord
FROM list(‘arrays’) AS A CROSS JOIN list(‘chunk map’) AS C
WHERE A.id = C.uaid

UserQueryException in file: src/query/parser/Translator.cpp function: checkLogicalExpression line: 2033
Error id: scidb::SCIDB_SE_SYNTAX::SCIDB_LE_UNKNOWN_ATTRIBUTE_OR_DIMENSION
Error description: Query syntax error. Attribute or dimension reference ‘A.id’ does not exist.


#6

Sorry about the confusion.
As of 14.12, the field “id” in the list(‘arrays’) output is now split into two fields - uaid and arrid.

Now…
You should definitely use the AFL form because the AQL might not give you the best performance at the moment.
For an AFL example, see
viewtopic.php?f=18&t=1091
That should absolutely work for 14.12

I saw your other post about “read-only chunk”. Definitely something we’ll need to look at.
The chunk map probably won’t help with that - I’ll ping Igor and Steve to see if they can take a look at it.


#7

Hi, I’ve come to this post from

and I still wonder: is it possible to find out the volume in bytes an array occupies on disk (using a single command)?

Thanks