Empty vs null values


#1

Hello,

I have a couple questions regarding the difference between empty cell values and null values, in context of disk storage requirements, chunk size selection and query performance.

When deciding chunk sizes, we only consider the non-empty number of cells. Do “null” values count the same as “empty” values?

For example, on a 100 million rows array such as:

I can choose my dimension chunk size as 100M if I know that only 1 out 100 rows will be non-empty. Does the same apply if instead of ‘empty’ cells I have ‘null’ cells?

In other words,
if my original CSV file was:

100,1,1
200,1,1
300,1,1
400,1,1
500,1,1

What are the implications in (1) query performance and (2) disk utilization if instead my CSV file is:

1,null,null
2,null,null
3,null,null
4,null,null
.
.
100,1,1
101,null,null
102,null,null
.
.
200,1,1
201,null,null
.
.

Thanks
mike


#2

Mike!

The short answer to your question is that our best advice would be to try to create your array to model your data as well as possible, and don’t worry too much about the physical details (for reasons I’ll go into in a minute). The rule of thumb is that ‘cells can be empty, and attributes can be missing’. In your case, there’s an important semantic difference between the two arrays you’ve proposed: they have very different shapes. One is a sparse array, the other is dense, and this difference propagates through your queries and analysis. At the most basic level, the number of cells in the ‘dense’ array (the one with missing codes) is very large. The number of cells in the ‘sparse’ array (the one with empty cells) is 1/100th the count.

But - the physical size taken up to hold these arrays is about the same. Take a look at the script I’ve included below. Feel free to run it yourself.

What I’ve done here is to reproduce your arrays using merge(…) and build_sparse(…) rather than loading the data from a .csv. But I think it’s a fair representation of what you had. What we produce here are two arrays: one called ‘Dense_Result’, which contains 100,000,000 cells. In 1/100th of these (1,000,000 cells) the attrinute contains a value, and in the other cells the attribute contains a missing code. In the ‘Sparse_Result’ there are still logically 100,000,000 cells in the array (according to its definition in the build_sparse(…)) and the range of the two arrays is the same (same start point along the I dimension, same end-point). Note though, that when you count the number of cells in ‘Spare_Array’, you get only 1,000,000.

A note about what’s going on ‘under the covers’. In SciDB, we store data for different attributes in different chunks. We’re a ‘column store’. But to accommodate both sparse and dense chunk data, we use an additional ‘hidden’ attribute that you can think of as a bit mask. If the cell is empty, the bit is set. Otherwise, it’s not. This bitmask is shared by all chunk’s attributes.

Now, in addition, we make aggressive use of techniques like RLE to compress our representation of the data. We do this for both the isEmpty bitmask–which means, for a ‘dense’ array, the bitmask is trivial in size–as well as for data–so in this case, becuase the dense data contains long runs of ‘missing’, we’re able to compress that as well.

As a result, in your example, our storage size reflects the information in the array(s). As both of these arrays contain roughly the same amount of information, we compress them down to about the same size.

Anyway - I hope this helps. Rules of thumb are 1. model your array according to your data model, don’t worry too much about the physical details, 2. because of our underlying representation, we actually do a pretty good job compressing data, and 3. ‘cells can be empty, attributes can be missing’.

#!/bin/sh
#
#  Script on tools and techniques for sizing nulls and missing codes.
#
#------------------------------------------------------------------------------
#
#  Useful shell script functions.
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1}"
};
#
exec_aql_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -q "${1}"
};
#
#
#------------------------------------------------------------------------------
#
#  Start with a small scale to get a handle on what we're up to. 
#
CMD_CREATE_ARRAY_WITH_VALUES="
build_sparse ( < v : int64 null > [I=0:99,10,0 ], I, I%10=0 )
"
exec_afl_query "${CMD_CREATE_ARRAY_WITH_VALUES}"
#
CMD_CREATE_ARRAY_WITH_MISSING_VALUES="
build_sparse ( < v : int64 null > [I=0:99,10,0 ], null, I%10!=0 )
"
exec_afl_query "${CMD_CREATE_ARRAY_WITH_MISSING_VALUES}"
#
CMD_MERGED_VALUES_AND_MISSING_CODES="
merge ( 
  ${CMD_CREATE_ARRAY_WITH_VALUES},
  ${CMD_CREATE_ARRAY_WITH_MISSING_VALUES}
)
"
exec_afl_query "${CMD_MERGED_VALUES_AND_MISSING_CODES}"
#
#------------------------------------------------------------------------------
#
#   Increase the scale. 
#
CMD_CREATE_ARRAY_WITH_VALUES="
build_sparse ( < v : int64 null > [I=0:99999999,1000000,0 ], I, I%100=0 )
"
#
CMD_CREATE_ARRAY_WITH_MISSING_VALUES="
build_sparse ( < v : int64 null > [I=0:99999999,1000000,0 ], null, I%100!=0 )
"
#
#   Store the dense result: the merge of values and missing codes. 
CMD_STORE_MERGED_VALUES_AND_MISSING_CODES="
store ( 
  merge ( 
    ${CMD_CREATE_ARRAY_WITH_VALUES},
    ${CMD_CREATE_ARRAY_WITH_MISSING_VALUES}
  ),
  Merged_Result
)
"
exec_afl_query "${CMD_STORE_MERGED_VALUES_AND_MISSING_CODES};" -n 
#
exec_afl_query "count ( Merged_Result );"
#
#  {i} count
#  {0} 100000000
#  Elapsed Time: 0:00.25
#
#   Store the sparse result: just values.
CMD_STORE_SPARSE_VALUES="
store ( 
    ${CMD_CREATE_ARRAY_WITH_VALUES},
  ),
  Sparse_Result
)
"
exec_afl_query "${CMD_STORE_SPARSE_VALUES}" -n 
#
exec_afl_query "count ( Sparse_Result );"
#
# Query: count ( Sparse_Result );
# {i} count
# {0} 1000000
# Elapsed Time: 0:00.10
#
#------------------------------------------------------------------------------
#
#   Let's have a look at the result of these two approaches to storing the 
#  data. 
#
CMD_ANALYSIS_OF_CHUNKS_WITH_MISSING_CODES="
aggregate (
  project (
   filter (
     cross (
       list ('chunk map') AS C,
       filter (
         list ('arrays', true ),
         name = 'Merged_Result'
       )  AS A
     ),
     C.uaid = A.id
   ),
   C.nelem,
   C.asize
 ),
 count ( * ) AS number_of_chunks,
 sum ( C.nelem ) AS number_of_cells,
 sum ( C.asize ) AS storage_allocated
)
"
#
#  {i} number_of_chunks,number_of_cells,storage_allocated
#  {0} 200,200000000,32010800
#  Elapsed Time: 0:00.25
#
exec_afl_query "${CMD_ANALYSIS_OF_CHUNKS_WITH_MISSING_CODES};"
#
#
CMD_ANALYSIS_OF_SPARSE_CHUNKS="
aggregate (
  project (
   filter (
     cross (
       list ('chunk map') AS C,
       filter (
         list ('arrays', true ),
         name = 'Sparse_Result'
       )  AS A
     ),
     C.uaid = A.id
   ),
   C.nelem,
   C.asize
 ),
 count ( * ) AS number_of_chunks,
 sum ( C.nelem ) AS number_of_cells,
 sum ( C.asize ) AS storage_allocated
)
"
#
exec_afl_query "${CMD_ANALYSIS_OF_SPARSE_CHUNKS};"
#
#  {i} number_of_chunks,number_of_cells,storage_allocated
#  {0} 200,2000000,32009600
#  Elapsed Time: 0:00.24
#
#------------------------------------------------------------------------------

#3

Thanks for the detailed info. The scripts helped a lot with “visualizing” the two scenarios.