Find out the size of a compressed array


#1

Hello,

I try to compress an array with zlib and bzlib options but get the same bytes count for both compression options and the uncompressed array according to the summarize plugin

scidb@SciDBVM0001:~$ iquery -aq "summarize(zlib_t, 'per_attribute=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'band1',569935135,792660666,156,118047,3.65343e+06,4194304,74,5.08116e+06,8356054
{0,1} 'EmptyTag',569935135,7488,156,118047,3.65343e+06,4194304,48,48,48

scidb@SciDBVM0001:~$ iquery -aq "summarize(bzlib_t2, 'per_attribute=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'band1',569935135,792660666,156,118047,3.65343e+06,4194304,74,5.08116e+06,8356054
{0,1} 'EmptyTag',569935135,7488,156,118047,3.65343e+06,4194304,48,48,48

scidb@SciDBVM0001:~$ iquery -aq "summarize(uncompressed_t3, 'per_attribute=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'band1',569935135,792660666,156,118047,3.65343e+06,4194304,74,5.08116e+06,8356054
{0,1} 'EmptyTag',569935135,7488,156,118047,3.65343e+06,4194304,48,48,48

How do I know whether a compression was actually applied to an array or not?
What is the correct way to find out the size of a compressed array?


Diagonal Matrix
#2

Hi @raro - this is expected behavior. Summarize only returns the uncompressed size. To get the compressed, on-disk size, use list('chunk map'). There’s an example here: Method to get array storage size


#3

Summarize always sums the “uncompressed size” of the chunks to obtain the bytes.

There is an undocumented internal list operation ‘chunk map’ (the output/availability is subject to change) which currently provides:

{inst,n} svrsn,instn,dsid,doffs,uaid,aid,attid,coord,comp,flags,nelem,csize,usize,asize

where:

  • svrsn: Storage Version
  • instn: instance id
  • dsid: datastore (disk file) id
  • doffs: datastore (disk file) offset
  • uaid: un-versioned array id
  • aid: array id
  • attid: attribute id
  • coord: starting coordinates of the chunk
  • comp: compression type
  • flags: (chunk is a “tombstone” flag)
  • nelem: number of elements/cells in the chunk
  • csize: the compressed size (bytes) of the chunk
  • usize: the uncompressed size (bytes) of the chunk
  • asize: the allocated size (inside the backing datastore file) for the chunk

Depending on the scope of your needs, you might want to keep this in mind:

  • These values are by chunk, attribute id, array id, instance so aggregating a sum will need to take that into account.
  • The “current array (unversioned array id)” may not have the same ((versioned) array id) for all the chunks, so that too would require some work. (e.g. if you were to delete cells that are only in one chunk, the other chunks will not be copied (and a new aid row would not be created for all the unaffected chunks).

I assume you are using a “pre-18.1 release” since you are using the summarize plugin. The summarize operator has been added to the core SciDB in the 18.1 release.


#4

Thank you! This helps!