Cannot start SciDB after installation


#1

Hello

I’m trying to install SciDB on two Amazon AWS EC2 servers. On the step to install postgres (https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks#Pre-InstallationTasks-InstallingPostgres) on host0, I ran into a problem - I had been referring to host0 as “matDB1”, I had defined it in /etc/hosts, but when I attempted to use it in the postgress installation command as part of the netmask it failed - here is the command I tried initially:
deployment/deploy.sh prepare_postgresql postgres postgres matDB1/24 matDB1

(based on the instructions and ifconfig results on my system, I set the netmask parameter to matDB1/24)

I tried to run it again using the IP address for matDB1 instead of the name, it failed. I tried uninstalling postgres, deleting any postgres files, without luck.

I continued the installation using my second host (matDB2) as host0, and the installation steps worked, however when I try to start the system it fails. I’m looking to re-do the installation from scratch (start from a fresh OS install) however I’d like to understand the right way to do the postgres install before doing that.

Thank you in advance
Dave


#2

ps. The short version of my question is: What should I use for the netmask when during the postgres installation step? It appears it has to contain the IP address and not the alias / host name; will that cause conflicts with other parts of the installation where host0 is referred to by the alias / host name?


#3

Hi,

Your google doc isn’t showing up for me, alas.

Note that EC2 offers multiple layers of security. There’s the postgres setup at postgresql.conf and pg_hba.conf. Then there’s the EC2 Security Group feature that lets you open certain ports between machines. Sometimes there’s also “iptables” - the OS level firewall, common on RHEL/CentOS.

So, one “very easy” way is to do this:
First, set postgres on node0 to listen to connections from everywhere. Postgresql.conf has

listen_addresses = '*'

And pg_hba.conf has:

local   all         all                          trust
host    all         all         0.0.0.0/0        trust

Then edit the “security group” in EC2 to allow specific IP addresses to access that node on port 5432. They have a nice UI for this. But you may need to create multiple entries - one for each IP. Turn off iptables or anything like that, if running. The effect is that, underneath, postgres listens to connections from everywhere, but then the EC2 firewall is used to only allow certain connections through.

That help?


#4

Thank you Alex! I did stumble over the ports initially, but I’m confident that port 5432 is open on node0 - tested with telnet and nmap, both confirmed it is open, so that doesn’t seem to be the problem.

Based on another post in this forum, I also opened ports 1239-1242, but that did not help either.

I should have stated this initially, I’m using ubuntu 14 (no firewall by default).

Another piece of information - there does not appear to be any scidb processes starting up on node1 at any point during the running of the startup script. NB scidb processes do start on node0, and do not terminate, even though the startup script reports the failure to start.


#5

Ah - allright let’s see your config.ini file. And, as you said, scidb processes start on node 0, so let’s look at the scidb.log from process 0 - it’ll be in the data directory.


#6

Thank you - here’s my config.ini (from /opt/scidb/15.12/etc/):

[mydb]
server-0=matDB2,1
server-1=matDB1,1
db_user=mydb
redundancy=1
install_root=/opt/scidb/15.12
pluginsdir=/opt/scidb/15.12/lib/scidb/plugins
logconf=/opt/scidb/15.12/share/scidb/log4cxx.properties
base-path=/louis/matrixDB/mydb-DB
base-port=1239
interface=eth0
security=trust

I’ve attached a tgz containing the scidb log and other logs from the process 0 data directory
scidb_logs.tgz (3.8 KB)


#7

Yep - that config looks OK, even though we recommend setting redundancy = 0 without SciDB EE and may want to be explicit about some performance settings: see Running Out of Memory! and some other posts about that.

The log is unremarkable. Likely we’ll need to see a log from matDB1.

One possible cause: the initall script creates the .pgpass file, at a place like “/home/user” or similar. That file needs to be copied to the other nodes, same location. That could be it. If that is the case, you should see a “can’t authenticate for postgres” type error on matDB1.


#8

Thank you Alex, will definitely adjust the config / performance settings and read the other thread!

I actually think I have it up and running - the short version is I misinterpreted the start instructions - I didn’t realize I needed to run the start on each node. I did that now and it appears to be working!!

Quick sanity check - I see these processes running on node0 (matDB2):

SciDB-0-0-mydb
SciDB-0-1-mydb
SciDB-0-2-mydb
SciDB-0-3-mydb
SciDB-0-0-mydb
SciDB-0-1-mydb
SciDB-0-2-mydb
SciDB-0-3-mydb

and I see these processes running on node1 (matDB1):

SciDB-1-0-mydb
SciDB-1-1-mydb
SciDB-1-0-mydb
SciDB-1-1-mydb

Does that seem reasonable / make sense?

long version: the datetime stamp on the logs on node1 (matDB1) were not consistent with my latest attempt to start sciDB. There was no postgres error in them and the .pgpass file was present and contained the correct info (same as on node0 (matDB2). That caused me to try to start scidb on both nodes in good time, which now appears to be working - I’m tooling around with some basic iquery, which did not work before.


#9

No you only run initall and startall on one node. Moreover, you set up the config.ini on one node (on CE). You possibly just set up two independent clusters: list('instances') should tell you for sure.


#10

Here’s the result of list(‘instances’):

{No} name,port,instance_id,online_since,instance_path
{0} 'matDB2',1239,0,'2017-04-22 19:52:37','/louis/matrixDB/mydb-DB/0/0'
{1} 'matDB2',1240,1,'2017-04-22 19:52:37','/louis/matrixDB/mydb-DB/0/1'
{2} 'matDB1',1239,4294967298,'2017-04-22 19:52:38','/louis/matrixDB/mydb-DB/1/0'
{3} 'matDB1',1240,4294967299,'2017-04-22 19:52:38','/louis/matrixDB/mydb-DB/1/1'

#11

Yep that looks good. Except somehow there are 8 processes on matDB2, should only be 4.


#12

Hmm. I ran the stopall command on node0 (matDB2) and that stopped all the processes on node0, but not on node1 (matDB1). I manually killed the processes on node1, and then tried running the startall command from node0 - no luck. I’ve attached the log files as they are now from both, although note the node1 (matDB1) scidb.log is from the previous run (the timestamp does not match the latest on e.g. node0 and/or node1/storage.header)

Should I re-install from scratch? Or alternatively email about getting access to an AMI to use? Thanks again for your help! I’m looking forward to using SciDB.

edit: attached the latest log files
2017-05-10-11AM_scidb_logs_matDB1.tgz (12.2 KB)
2017-05-10_11AM_scidb_logs_matDB2.tgz (3.9 KB)


#13

You shouldn’t need to re-install. Looks like all the binaries are there. It should just be a matter of getting the config right.

In scidb-stderr.log on matDB2 I do see things like this:

2017-05-10 07:22:40 (ppid=2064): Started.
2017-05-10 07:22:50 (ppid=2064): bind: Address already in use. Exiting.

That does that when we try to start a process but there’s already a process listening on that port. That’s likely a “double start” kind of scenario where someone tries to launch a process but it’s already running.

To recap, what you should have is as follows:

  1. postgres running on matDB2
  2. config.ini at matDB2 (as above)
  3. execute initall, then startall on matDB2
  4. after startall you should see 4 processes on matDB2 and 4 processes on matDB1:
    a. two “watchdog” processes
    b. two actual instance processes
  5. the two instance processes (4.b.) should write logs, so you should see two sets of logs on each node:
    a. /louis/matrixDB/mydb-DB/0/0 on matDB2
    b. /louis/matrixDB/mydb-DB/0/1 on matDB2
    c. /louis/matrixDB/mydb-DB/1/0 on matDB1
    d. /louis/matrixDB/mydb-DB/1/1 on matDB1

To avoid confusion I recommend stopping / disabling postgres on matDB1, removing any config.ini on matDB1 and making sure there are no scidb processes. Note the “watchdogs” will restart processes that are killed.

I will email you with the AMI doc.