SciDB not starting up


#1

Dear SciDB team,

I’m trying to setup a SciDB 3-node cluster on EC2.
Specs (indentical for all EC2 instances):
Ubuntu 14.04
PostgreSQL 9.3
SciDB 14.12

My config.ini:

[single_server] server-0=127.0.0.1,0 server-1='ec2-instance2-ip',1 server-2='other-ec2-instance3-ip',1 install_root=/opt/scidb/14.12 pluginsdir=/opt/scidb/14.12/lib/scidb/plugins logconf=/opt/scidb/14.12/share/scidb/log4cxx.properties db_user=scidb_pg_user db_passwd=scidb_pg_user_pasw base-port=1239 base-path=/home/scidb/scidb_data redundancy=0 execution-threads=4 result-prefetch-threads=2 result-prefetch-queue-size=1 operator-threads=1

Of course ‘single_server’ is a bad name, but that’s what the given config.ini had
and when I tried to use another name, such as ‘cluster1’ it wouldn’t work with error:

Cannot connect to PostgreSQL catalog: 'FATAL:  database "cluster1" does not exist

That’s why I stuck to ‘single_server’.
However, when I run

scidb.py start_all single_server

on the monitor instance it’s not finishing the startup as I can see from stdout, so I had to ctrl+c’ed after some time.
When I inspect the log on the monitor instances, i.e. 000/0/scidb.log this is what I get:

...
2015-07-06 00:19:25,784 [0x7f8bc3023700] [DEBUG]: Prepare physical plan was sent out
2015-07-06 00:19:25,784 [0x7f8bc3023700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 3 instances
2015-07-06 00:19:25,784 [0x7f8bc3023700] [INFO ]: Executing query(1100947785991): list('queries'); from program: 127.0.0.1:42582/opt/scidb/14.12/bin/iquery -c 127.0.0.1 -p 1239 -naq list('queries') ;
2015-07-06 00:19:25,784 [0x7f8bc3023700] [DEBUG]: Waiting notification in queryID from 3 instances
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Connected to instance 1, localhost:1240
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 1
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947785991
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Recovering connection to instance 1
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 2. Error:107('Transport endpoint is not connected')
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Connected to instance 2, localhost:1241
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 2
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947785991
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Recovering connection to instance 2
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 3. Error:107('Transport endpoint is not connected')
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Connected to instance 3, localhost:1242
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 3
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947785991
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [DEBUG]: Recovering connection to instance 3
2015-07-06 00:19:25,785 [0x7f8bce5d5800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
...

pwd-less ssh-login from monitor to itself and to workers is enabled.
Moreover, if on all EC2 instances i do a "ps aux | grep “scidb” i get:

Monitor:

...
scidb     7679  0.0  1.0 186608 10568 ?        S    00:19   0:00 /home/scidb/scidb_data/000/0/SciDB-000-0-single_server -i 127.0.0.1 -p 1239 -k -l /opt/scidb/14.12/share/scidb/log4cxx.properties -s /home/scidb/scidb_data/000/0/storage.cfg --install_root=/opt/scidb/14.12 --result-prefetch-queue-size=1 --redundancy=0 --operator-threads=1 --merge-sort-buffer=75 --mem-array-threshold=150 --pluginsdir=/opt/scidb/14.12/lib/scidb/plugins --execution-threads=4 --sg-send-queue-size=15 --sg-receive-queue-size=15 --result-prefetch-threads=2 --smgr-cache-size=150 -c host=127.0.0.1 port=5432 dbname=single_server user=scidb_pg_user password=scidb_pg_user_pasw
scidb     7682  0.0  1.0 537308 10220 ?        Sl   00:19   0:00 /home/scidb/scidb_data/000/0/SciDB-000-0-single_server -i 127.0.0.1 -p 1239 -k -l /opt/scidb/14.12/share/scidb/log4cxx.properties -s /home/scidb/scidb_data/000/0/storage.cfg --install_root=/opt/scidb/14.12 --result-prefetch-queue-size=1 --redundancy=0 --operator-threads=1 --merge-sort-buffer=75 --mem-array-threshold=150 --pluginsdir=/opt/scidb/14.12/lib/scidb/plugins --execution-threads=4 --sg-send-queue-size=15 --sg-receive-queue-size=15 --result-prefetch-threads=2 --smgr-cache-size=150 -c host=127.0.0.1 port=5432 dbname=single_server user=scidb_pg_user password=scidb_pg_user_pasw
postgres  7683  0.0  0.9 250916  9284 ?        Ss   00:19   0:00 postgres: scidb_pg_user single_server 127.0.0.1(39787) idle 
...

Worker1:

...
scidb     6904  0.0  1.0 186608 10616 ?        S    00:19   0:00 /home/scidbscidb_data/001/1/SciDB-001-1-single_server -i 172.31.24.79 -p 1240 -k -l /opt/scidb/14.12/share/scidb/log4cxx.properties -s /home/scidb/scidb_data/001/1/storage.cfg --install_root=/opt/scidb/14.12 --result-prefetch-queue-size=1 --redundancy=0 --operator-threads=1 --merge-sort-buffer=75 --mem-array-threshold=150 --pluginsdir=/opt/scidb/14.12/lib/scidb/plugins --execution-threads=4 --sg-send-queue-size=15 --sg-receive-queue-size=15 --result-prefetch-threads=2 --smgr-cache-size=150 -c host=127.0.0.1 port=5432 dbname=single_server user=scidb_pg_user password=scidb_pg_user_pasw
...

Worker2:

scidb     6874  0.0  1.0 186608 10612 ?        S    00:19   0:00 /home/scidbscidb_data/002/1/SciDB-002-1-single_server -i 172.31.24.78 -p 1240 -k -l /opt/scidb/14.12/share/scidb/log4cxx.properties -s /home/scidb/scidb_data/002/1/storage.cfg --install_root=/opt/scidb/14.12 --result-prefetch-queue-size=1 --redundancy=0 --operator-threads=1 --merge-sort-buffer=75 --mem-array-threshold=150 --pluginsdir=/opt/scidb/14.12/lib/scidb/plugins --execution-threads=4 --sg-send-queue-size=15 --sg-receive-queue-size=15 --result-prefetch-threads=2 --smgr-cache-size=150 -c host=127.0.0.1 port=5432 dbname=single_server user=scidb_pg_user password=scidb_pg_user_pasw

which means the SciDB daemons are up.
However, when I do a

netstat -tulpn | grep :1239

on all EC2 instances I get:

Monitor:

tcp        0      0 0.0.0.0:1239            0.0.0.0:*               LISTEN      7682/SciDB-000-0-si

Worker1&2:

nothing

which means the daemons are up but only the monitor is listening on the (default) port 1239.

Thanks for your help.


#2

Hi. Do you recall when your SciDB 14.12 packages were downloaded? The 14.12 packages were rebuilt on June 5th to address a problem similar to the one you describe:

paradigm4.com/forum/viewtopi … =14&t=1526

If your packages are older, download again and reinstall.


#3

I downloaded 14.12 on the first of this month.


#4

In the config file(s), try specifying server-0 with its real IP address rather than 127.0.0.1. I think the other servers are trying to talk to themselves rather than to server-0.

(Also, for historical reasons the SciDB ports on servers other than server-0 begin numbering at 1240 rather than 1239. So when looking to see that they are up and listening, that’s the number to look for.)


#5

Ok.
So here are my modified config.ini’s

master:

[single_server]
server-0=127.0.0.1,0
server-1=IPWorker1,1
server-2=IPWorker2,1
install_root=/opt/scidb/14.12
pluginsdir=/opt/scidb/14.12/lib/scidb/plugins
logconf=/opt/scidb/14.12/share/scidb/log4cxx.properties
db_user=scidb_pg_user
db_passwd=scidb_pg_user_pasw
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0
...

worker1:

[single_server]
server-0=IPMaster,0
server-1=127.0.0.1,1
server-2=IPWorker2,1
install_root=/opt/scidb/14.12
pluginsdir=/opt/scidb/14.12/lib/scidb/plugins
logconf=/opt/scidb/14.12/share/scidb/log4cxx.properties
db_user=scidb_pg_user
db_passwd=scidb_pg_user_pasw
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0
...

worker2:

[single_server]
server-0=IPMaster,0
server-1=IPWorker1,1
server-2=127.0.0.1,1
install_root=/opt/scidb/14.12
pluginsdir=/opt/scidb/14.12/lib/scidb/plugins
logconf=/opt/scidb/14.12/share/scidb/log4cxx.properties
db_user=scidb_pg_user
db_passwd=scidb_pg_user_pasw
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0

However, when I start SciDB up again

scidb.py start_all single_server

it again hangs and when I ctrl+c and inspect the scidb.log’s I get

master:

...
2015-07-07 15:38:34,595 [0x7fb73662c700] [DEBUG]: Waiting notification in queryID from 3 instances
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
2015-07-07 15:38:34,595 [0x7fb741bde800] [DEBUG]: Connected to instance 1, localhost:1240
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 1
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947927180
2015-07-07 15:38:34,595 [0x7fb741bde800] [DEBUG]: Recovering connection to instance 1
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 2. Error:107('Transport endpoint is not connected')
2015-07-07 15:38:34,595 [0x7fb741bde800] [DEBUG]: Connected to instance 2, localhost:1241
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 2
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947927180
2015-07-07 15:38:34,595 [0x7fb741bde800] [DEBUG]: Recovering connection to instance 2
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 3. Error:107('Transport endpoint is not connected')
2015-07-07 15:38:34,595 [0x7fb741bde800] [DEBUG]: Connected to instance 3, localhost:1242
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 3
2015-07-07 15:38:34,595 [0x7fb741bde800] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100947927180
2015-07-07 15:38:34,596 [0x7fb741bde800] [DEBUG]: Recovering connection to instance 3
2015-07-07 15:38:34,596 [0x7fb741bde800] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
...

worker1:

...
2015-07-07 15:38:34,323 [0x7fd8109f1800] [DEBUG]: Network manager is intialized
2015-07-07 15:38:34,323 [0x7fd8109f1800] [DEBUG]: NetworkManager::run()
2015-07-07 15:38:34,323 [0x7fd8109f1800] [ERROR]: Error during SciDB execution: UserException in file: src/network/NetworkManager.cpp function: run line: 142
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_STORAGE_NOT_REGISTERED
Error description: Storage error. Storage is not registered in system catalog.
2015-07-07 15:38:34,323 [0x7fd8109f1800] [INFO ]: SciDB instance. SciDB Version: 14.12.9095. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc. is exiting.
...

worker2:

...
2015-07-07 15:38:35,488 [0x7fd8c72d6800] [DEBUG]: Network manager is intialized
2015-07-07 15:38:35,488 [0x7fd8c72d6800] [DEBUG]: NetworkManager::run()
2015-07-07 15:38:35,489 [0x7fd8c72d6800] [ERROR]: Error during SciDB execution: UserException in file: src/network/NetworkManager.cpp function: run line: 142
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_STORAGE_NOT_REGISTERED
Error description: Storage error. Storage is not registered in system catalog.
2015-07-07 15:38:35,489 [0x7fd8c72d6800] [INFO ]: SciDB instance. SciDB Version: 14.12.9095. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc. is exiting.
...

Why would the master try to connect to

[ul]
localhost:1240
localhost:1241
localhost:1242
[/ul]

because my config.ini’s clearly say that there are two remote workers, not three local ones.
It seems to me that it’s somehow working with the default single_server setting because that one has one master
and 3 local workers (it’s a standalone setup) if I remember correctly.

Also,

netstat -tulpn | grep :1239

on the monitor gives

tcp        0      0 0.0.0.0:1239            0.0.0.0:*               LISTEN      10549/SciDB-000-0-s

whereas on the workers

netstat -tulpn | grep :1240

gives nothing (because they obviously couldn’t startup properly) because of

Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_STORAGE_NOT_REGISTERED

error. Also, how would I get rid of that error? Obviously I have to ‘register the storage’ but here’s my general problem.
I’m somehow stuck with using single_server instead of my own instance name, say e.g. ‘cluster1’. And maybe it is still trying to use those old single_server settings and maybe that requires me to register a new cluster instance to the database similar to what I would have to do to ‘register storage’ but how can I register a new instance name and/or register storage if I can’t startup the DB in the first place?


#6

(Apologies for the delayed response, the forum server was having trouble!)

There are a couple of things going on here. I think what’s giving the most trouble is that rather than trying to reuse the singler_server configuration, you should add a new cluster to the config.ini files. You can keep the original [single_server] part of the file as-is, but add a new [cluster1] part with your desired configuration info.

It’s best practice to always use actual IP addresses rather than localhost or 127.0.0.1, so that the same config.ini file can be pushed out to all server nodes.

You will also need to choose where to keep the cluster’s data stores. You do this using the base-path=/some/path/to/a/directory configuration directive. Choose a different path name than the one being used by the single_server cluster. The named directory should exist on all servers and be read/write for the scidb user.

Once you have the config.ini files set up and pushed (and presuming you have called your cluster “cluster1”), you will need to run

This is the step that creates and initializes the data directories (below the base-path) and also sets up the system catalog. The system catalog is kept inside Postgres, with one set of tables for each initialized cluster. It’s where SciDB keeps various metadata.

Give that a try and let me know how it goes!

Mike