Question about SciDB setting and storage


#1

After several months of research on SciDB 14.3, I am now in the final period of thesis writing. But there are still some questions confusing me,

  1. Although the number of instance can be set in the config.ini file, but it seems that SciDB can detect the number of CPU cores it uses and thus determines the maximum instance number itself. For example, my laptop has two cores, while when I set four instances, with scidb.py status, only 2 instances are shown. Also on a 2 core server which holds Windows and Linux two virtual machines, each of which utilizes 2 vCPUs. When I set 2 instances on Linux for SciDB, status can only show one. Why?

  2. By checking source code and specific chunk map of array, I deduce that SciDB makes use of run length encoding (RLE) automatically. The effect is significant, by importing NetCDF files (containing a lot of zero values) into SciDB, the array storage size becomes one third of original files. Is RLE automatically adopted really the case?

  3. For chunked storage structure, unlimited dimension makes no difference from limited dimension. In SciDB, if an array is created with unlimited dimension, by populating some data and use “store” to restore it in another array (not existed before), all dimensions become limited. By benchmarking, I did not observe apparent difference between using unlimited dimension array and limited dimension array. So whether a dimension is unlimited or not does not influence query performance. Is this true?


#2

Hi,

The number of instances is always user-controlled - because this number has to do with

  • CPU configuration
  • storage configuration, speed, number of devices / drives
  • network configuration
  • what queries you want to run (region image lookup or matrix multiply?)
    Therefore it is very hard to automatically pick the right number of instances for all cases!

Furthermore, some queries will not use all instances (depending on how the data is distributed and what the query is). Some queries will not use all instances all the time.

Furthermore, some operators are multi-threaded - we are talking many threads running on the same instance. For example store(apply()) or store(filter()), some linear algebra, will run multiple threads on each instance. Some linear algebra ops have a piece of code that auto-detects the number of threads to run. That behavior can be overridden. This number of threads per instance is configurable. Many ops redistribute data and redistribution tends to use two threads (send + receive). So… the real picture is quite complex and hopefully we will simplify it soon.

Yes that is true. RLE is always on by default.

Yes that is true. Array boundaries don’t have an effect on storage.