MPI init issue


#1

On SciDB 15.12, I am having trouble getting mpi_init to succeed:

# iquery -aq "load_library('dense_linear_algebra')"
Query was executed successfully
# iquery -aq "mpi_init()"
SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation 'MPI launcher process already terminated' failed.

mpi_init seems to be cranking for a few seconds before it fails. Here is a relevant snippet from the scidb.log:

2017-01-27 21:48:28,627 [0x7f6a10445700] [DEBUG]: Next query ID: 0.1485553708621453161
2017-01-27 21:48:28,627 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: .
2017-01-27 21:48:28,627 [0x7f6a10647700] [DEBUG]: 
[pPlan]:
>[pNode] MPIInitPhysical ddl 0 tile 0 children 0
  schema not empty public.mpi_init_array<mpi_init_attribute:string NOT NULL> [mpi_init_dimension=0:0 (4611686018427387903:-4611686018427387903),0,0] ArrayId: 0 UnversionedArrayId: 0 Version: 0 Flags: 0 Distro: dist: hash ps: 1 ctx:  redun: 0 off: {} shift: 0 res: [0, 1] <mpi_init_attribute:string NOT NULL>
  output full chunks: yes
  changes dstribution: no
  props sgm 1 sgo 1
  diout dist: hash ps: 1 ctx:  redun: 0 off: {} shift: 0 res: [0, 1]
  bound start {0} end {0} density 1 cells 1 chunks 0 est_bytes 22

2017-01-27 21:48:28,627 [0x7f6a10647700] [DEBUG]: (Pre)Single executing queryID: 0.1485553708621453161
2017-01-27 21:48:28,627 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: ..
2017-01-27 21:48:28,627 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next pid object: .
2017-01-27 21:48:28,627 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next pid object: ..
2017-01-27 21:48:28,628 [0x7f6a10647700] [DEBUG]: The query plan is: [pPlan]:
>[pNode] MPIInitPhysical ddl 0 tile 0 children 0
  schema not empty public.mpi_init_array<mpi_init_attribute:string NOT NULL> [mpi_init_dimension=0:0 (4611686018427387903:-4611686018427387903),0,0] ArrayId: 0 UnversionedArrayId: 0 Version: 0 Flags: 0 Distro: dist: hash ps: 1 ctx:  redun: 0 off: {} shift: 0 res: [0, 1] <mpi_init_attribute:string NOT NULL>
  output full chunks: yes
  changes dstribution: no
  props sgm 1 sgo 1
  diout dist: hash ps: 1 ctx:  redun: 0 off: {} shift: 0 res: [0, 1]
  bound start {0} end {0} density 1 cells 1 chunks 0 est_bytes 22

2017-01-27 21:48:28,628 [0x7f6a10647700] [DEBUG]: The serialized form of the physical plan: queryID=0.1485553708621453161, physicalPlan='22 serialization::archive 10 8 1 0
0 0 0 0 1 1 8 mpi_init 15 MPIInitPhysical 0  0 0 0 0 0 0 6 public 14 mpi_init_array 0 0 1 0 0 0 0 18 mpi_init_attribute 0 0 0 0 6 string 0 0 0 0 0 0 -1 0 0 0  0 0 1 0 0 0 0 0 18 mpi_init_dimension 0 0 1 0 0 0 18 mpi_init_dimension 0 0 0 4611686018427387903 -4611686018427387903 0 0 0 0 1 0  0 0 0 0 2 0 0 1'
2017-01-27 21:48:28,628 [0x7f6a10647700] [DEBUG]: Prepare physical plan was sent out
2017-01-27 21:48:28,628 [0x7f6a10647700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 1 instances
2017-01-27 21:48:28,628 [0x7f6a10647700] [INFO ]: Executing query(0.1485553708621453161): mpi_init(); from program: 127.0.0.1:51882/usr/src/scidbtrunk/stage/install/bin/iquery -aq mpi_init() ;
2017-01-27 21:48:28,628 [0x7f6a10647700] [DEBUG]: Waiting notification in queryID from 1 instances
2017-01-27 21:48:28,630 [0x7f6a1bbcd700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 0.1485553708621453161 from instance 1
2017-01-27 21:48:28,630 [0x7f6a10647700] [DEBUG]: Send message from coordinator for waiting instances in queryID: 0.1485553708621453161
2017-01-27 21:48:28,631 [0x7f6a10647700] [DEBUG]: MPIPhysical::launchMPISlaves(query, maxSlaves: 2) called.
2017-01-27 21:48:28,631 [0x7f6a10647700] [DEBUG]: syncBarrier: barrierId = 0
2017-01-27 21:48:28,631 [0x7f6a10647700] [DEBUG]: Sending barrier to every one and waiting for 1 barrier messages
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: All barrier messages received - continuing
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: syncBarrier: returning
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: syncBarrier: barrierId = 1
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: Sending barrier to every one and waiting for 1 barrier messages
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: All barrier messages received - continuing
2017-01-27 21:48:28,633 [0x7f6a10647700] [DEBUG]: syncBarrier: returning
2017-01-27 21:48:28,635 [0x7f6a10647700] [DEBUG]: MPI launcher process spawned, pid=792
2017-01-27 21:48:28,635 [0x7f6a10647700] [DEBUG]: MPIPhysical::launchMPISlaves(): slave->waitForHandshake() 1 called.
2017-01-27 21:48:28,635 [0x7f6a10647700] [DEBUG]: MpiSlaveProxy::waitForHandshake: launchId=1
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: Next query ID: 0.1485553708621453161
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: .
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: ..
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: SciDB-e1cb1772-e4d8-11e6-90ba-0242ac110002-0.1485553708621453161-0-1.launch_exec
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next SHM object: SciDB-e1cb1772-e4d8-11e6-90ba-0242ac110002-0.1485553708621453161-0-1.launch_hosts
2017-01-27 21:48:33,632 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next pid object: .
2017-01-27 21:48:33,633 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next pid object: ..
2017-01-27 21:48:33,633 [0x7f6a10445700] [DEBUG]: MpiErrorHandler::cleanAll: next pid object: 0.1485553708621453161.1.mpirun
2017-01-27 21:48:38,636 [0x7f6a10647700] [DEBUG]: Query::done: queryID=0.1485553708621453161, _commitState=0, errorCode=33
2017-01-27 21:48:38,636 [0x7f6a10647700] [ERROR]: executeClientQuery failed to complete: SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation 'MPI launcher process already terminated' failed.
2017-01-27 21:48:38,636 [0x7f6a10647700] [DEBUG]: Query (0.1485553708621453161) is being aborted
2017-01-27 21:48:38,636 [0x7f6a10647700] [ERROR]: Query (0.1485553708621453161) error handlers (2) are being executed
2017-01-27 21:48:38,637 [0x7f6a10647700] [DEBUG]: MpiManager::removeCtx: queryID=0.1485553708621453161
2017-01-27 21:48:38,637 [0x7f6a10647700] [WARN ]: MpiErrorHandler::cleanupSlavePidFile: cannot read pids for: /usr/src/scidbtrunk/stage/DB-mydb/0/0/mpi_pid/0.1485553708621453161.1.mpi_slave_scidb
2017-01-27 21:48:38,637 [0x7f6a10647700] [DEBUG]: MpiErrorHandler::cleanupLauncherPidFile: killing launcher pid group = 792
2017-01-27 21:48:38,637 [0x7f6a10647700] [DEBUG]: MpiErrorHandler::killProc: killing process group pid = -792
2017-01-27 21:48:38,637 [0x7f6a10647700] [DEBUG]: MpiErrorHandler::killProc: failed to kill pid = -792 err = 3
2017-01-27 21:48:38,637 [0x7f6a10647700] [DEBUG]: MpiErrorHandler::cleanupLauncherPidFile: removing file: /usr/src/scidbtrunk/stage/DB-mydb/0/0/mpi_pid/0.1485553708621453161.1.mpirun
2017-01-27 21:48:38,637 [0x7f6a10647700] [ERROR]: SciDB MPI launcher (pid=-792) exited with status = 248
2017-01-27 21:48:38,638 [0x7f6a10647700] [ERROR]: Failed to destroy launcher for launch = 1 (0.1485553708621453161) because: SystemException in file: src/mpi/MPILauncher.cpp function: completeLaunch line: 296
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation 'MPI launcher process' failed.
2017-01-27 21:48:38,638 [0x7f6a10647700] [DEBUG]: Broadcast ABORT message to all instances for query 0.1485553708621453161
2017-01-27 21:48:38,638 [0x7f6a10647700] [DEBUG]: MpiManager::removeCtx: queryID=0.1485553708621453161
2017-01-27 21:48:38,638 [0x7f6a10647700] [DEBUG]: Releasing locks for query 0.1485553708621453161

I followed the Troubleshooting steps in the documentation. I checked the ssh connectivity. I am using a single server localhost/127.0.0.1 with two SciDB instances. Here are the hosts, shm, and config.ini settings:

# df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
shm             8.0G     0  8.0G   0% /dev/shm
# cat /etc/hosts
127.0.0.1	localhost
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters
172.17.0.2	ac937743b965
# cat stage/install/etc/config.ini 
[mydb]
server-0=localhost,1
db_user=mydb
...

I tried using both localhost and 127.0.0.1 for server-0.

Is there anything else I could check or try? Thanks!


#2

On a closer look, I located the MPI log file:

# cat /usr/src/scidbtrunk/stage/DB-mydb/0/0/mpi_log/0.1485815279601335452.1.mpirun.log 
LAUNCHER pid=633, pgid=633, ppid=316
LAUNCHER: maxfd = 65536
LAUNCHER pid=633 args for /opt/scidb/15.12/3rdparty/mpich2/bin/mpiexec.hydra are ready
LAUNCHER arg[0] = /opt/scidb/15.12/3rdparty/mpich2/bin/mpiexec.hydra
LAUNCHER arg[1] = -bootstrap-exec
LAUNCHER arg[2] = /dev/shm/SciDB-c7c7b6b2-e728-11e6-bae8-0242ac110002-0.1485815279601335452-0-1.launch_exec
LAUNCHER arg[3] = -f
LAUNCHER arg[4] = /dev/shm/SciDB-c7c7b6b2-e728-11e6-bae8-0242ac110002-0.1485815279601335452-0-1.launch_hosts
LAUNCHER arg[5] = -wdir
LAUNCHER arg[6] = /tmp
LAUNCHER arg[7] = -n
LAUNCHER arg[8] = 1
LAUNCHER arg[9] = /usr/src/scidbtrunk/stage/DB-mydb/0/0/mpi_slave_scidb
LAUNCHER arg[10] = 0
LAUNCHER arg[11] = 1239
LAUNCHER arg[12] = :
LAUNCHER arg[13] = -n
LAUNCHER arg[14] = 1
LAUNCHER arg[15] = /usr/src/scidbtrunk/stage/DB-mydb/0/1/mpi_slave_scidb
LAUNCHER arg[16] = 1
LAUNCHER arg[17] = 1240
[mpiexec@2c33fde3b24c] HYDU_create_process (./utils/launch/launch.c:72): execvp error on file /dev/shm/SciDB-c7c7b6b2-e728-11e6-bae8-0242ac110002-0.1485815279601335452-0-1.launch_exec (Permission denied)
Unknown signal 120 (signal 120)

Which indicates an exec permission problem on /dev/shm. Turns out that, /dev/shm was mounted with noexec:

# mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,seclabel,size=8388608k)

Once I got /dev/shm to mount without the noexec flag, mpi_init worked:

# iquery -aq "mpi_init()"
{mpi_init_dimension} mpi_init_attribute