Error while loading array


#1

Hi,

I have a SciDB cluster running on 8 nodes and I have a problem related to loading data from previously saved array.
When I call:

create array it_datavalue:uint16[time=0:9,10,0,lat=0:7970,7971,0,lon=0:7940,7941,0];
load(it_data, ‘/home/scidbx/italy_data.out’, -2, ‘(uint16 null)’);

I got this message after a couple of minutes:

SystemException in file: src/network/BaseConnection.h function: receive line: 426
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot receive network message: Read failed: End of file (asio.misc:2). An instance may be offline…

The problem is that when I check nodes - all of them seems to be running (I call “list(‘instances’);” on each node, and I got correct response). However, I am not sure if this is correct way to check if the instance is dead or not - the command in “Detecting Dead Instances or Entire Dead Servers” from documentation - $ iquery -aq “list_instances()” - is now working.

And the most strange is that when I create array based on another template, it’s working:

create array it_aut_datavalue:uint16[time=0:;lat=0:;lon=0:*];
load(it_aut_data, ‘/home/scidbx/italy_data.out’, -2, ‘(uint16 null)’);

This works correctly. So my questions are:

  1. is why is it happening?
  2. how to check dead instances?
  3. is there a way to load array with the initial template?

Thanks!


#2

Hello Yevgeniy.

I am not sure why it’s happening but I suspect it’s chunking and memory. Your initial pattern of

create array it_data[time=0:9,10,0,lat=0:7970,7971,0,lon=0:7940,7941,0];

That tells scidb to organize data in chunks that are 10x7971x7941. Essentially you are telling SciDB to place all the data in a single block. That product 10x7971x7941 = 632,997,110 , that times 2 bytes per value is 1.2GB. Add some overhead for the two coordinates - at times they need to be materialized before data is RLE-packed in the block - and we might be using too much memory. I can’t say because I don’t know what machine you are using.

If you are running out of memory you could try some of the tips from this post:

For more info on chunking, see this paper:

One way to diagnose the problem is to look at the scidb.log files. Each instance produces a separate log. The path would be something like BASE_PATH/0/0/scidb.log for instance 0 - where BASE_PATH is specified in your config.ini.

In the log, search for Start SciDB. Every time an instance starts, it prints that to the log. Every time an instance dies unexpectedly, the watchdog restarts it. So, the few lines before Start SciDB are usually very important in diagnosing where the problem happened. If it is an OOM, or (hopefully not) a segfault, it will also usually show up when you run the dmesg command to get the linux kernel messages.

I suppose it doesnt happen when you use the auto-chunk template because in that case the system picks a different chunk size.