Method to get array storage size


#1

Hello, I want to know if there exists commands to get the array storage size, similar to pg_total_relation_size() for postgreSQL. I did not find it, but instead, I know arrays are stored using scidb binary format under the folder data/000/0/datastore. So with “ls -lh”, “du -sh” commands, I can get the storage size for arrays. However, figure, i.e. size returned by different commands differs. So is there a fairly accurate way to assess the storage size of a specific array?

PS: actually for a single array, two files are stored, one ends with .data while the other has .data.fl as the file name extension. If *.data stores the real data, then *.data.fl contains the metadata like the array version information?


What is this chunk map thingy?
#2

Have a look at this post:

viewtopic.php?f=11&t=1330&p=2815&hilit=chunk+map&sid=891f4c74cd7f7ec84e459d8bc6ac2d4a#p2815

It details how to figure out how big SciDB believes arrays to be.


#3

Hi, I’ve checked the post you’ve provided and I still wonder: is it possible to find out the volume in bytes an array occupies on disk (using a single command) and not delving into info of separate chunks?

Thanks


#4

There’s also been a need for this where I work. I implemented a python solution similar to what you describe (sum the file sizes across all arrays in all instances) based on details supplied by my supervisor, who has a very good understanding of Scidb:

    instances = self.get_instances()
    total_sizes = []
    for array_id in array_ids:
        self.log.debug('checking usage for array: %s ', array_id)
        tally = 0
        for instance in instances:
            dir_path = instance[1] + '/datastores/'
            file_name = str(array_id) + '.data'

            # find with -exec avoids a CalledProcess exception if a file
            # is not present
            cmd = 'ssh %s find %s -maxdepth 1 -type f -name %s -exec "du -k {} \\;"' % (
                instance[0], dir_path, file_name)
            self.log.debug('calling: %s', cmd)
            file_size = sp.check_output(cmd, shell=True, stderr=sp.STDOUT, universal_newlines=True).strip()

            if not file_size:
                continue
            size, _ = file_size.split('\t')
            self.log.debug('size: %s', size)
            tally += float(size)
            self.log.debug('tally: %s', tally)

        if units is not None:
            total_sizes.append('%.2f%s' % (convert(tally, units), units))
        else:
            total_sizes.append(humanize(tally))

#5

Hey everyone,

If it helps, there are a few other options.

  1. Interrogate the chunk map. You can do something like
    aggregate(filter(list('chunk map'), uaid = [UAID]), sum(usize), sum(csize), sum(asize), inst)
    where [UAID] is the unversioned array ID seen in list('arrays') and usize,csize,asize are in bytes for the uncompressed, compressed (if the per-attribute compression is used) and allocated (rounded to power of two). In the chunk map, these are presented on a per-chunk per-attribute basis, but the above query will aggregate them, grouped by instance. You can of course compute a grand aggregate total as well. If you have EE and replication turned on, this method will double-count the replica chunks.

  2. Use the summarize plugin. See: https://github.com/paradigm4/summarize . Compared to (1), this gives you only the “usize” values but the advantage is that you can run this on TEMP and mid-query arrays. So you can say summarize(redimension(filter(....))) to get size and chunking estimates before you store something.

Hope it helps. Cheers!


Find out the size of a compressed array
#6

Thanks Alex, I’m gonna write something using summarize.