Scidb.py initall hangs idefinitely


#1

We have an amazon ec2 32 node cluster, each has 61430MB of RAM (as reported by free -m). I am configuring 3 instances per node. As I type scidb.py initall aesdb, it hangs for an hour (I went to a meeting and it was hanging)
There was no new information in /data/scidb/0/000/

the scidb.log was not regenerated, and there were no new files.
/opt/scidb is owned by user scidb.
ssh localhost and 127.0.0.1 as well as between all hosts works without password.

What can I check to see why it is hanging?

Here is the config.ini file:

[code]
[aesdb]
server-0=i32dft-master,2
server-1=i32dft-node001,3
server-2=i32dft-node002,3
server-3=i32dft-node003,3
server-4=i32dft-node004,3
server-5=i32dft-node005,3
server-6=i32dft-node006,3
server-7=i32dft-node007,3
server-8=i32dft-node008,3
server-9=i32dft-node009,3
server-10=i32dft-node010,3
server-11=i32dft-node011,3
server-12=i32dft-node012,3
server-13=i32dft-node013,3
server-14=i32dft-node014,3
server-15=i32dft-node015,3
server-16=i32dft-node016,3
server-17=i32dft-node017,3
server-18=i32dft-node018,3
server-19=i32dft-node019,3
server-20=i32dft-node020,3
server-21=i32dft-node021,3
server-22=i32dft-node022,3
server-23=i32dft-node023,3
server-24=i32dft-node024,3
server-25=i32dft-node025,3
server-26=i32dft-node026,3
server-27=i32dft-node027,3
server-28=i32dft-node028,3
server-29=i32dft-node029,3
server-30=i32dft-node030,3
server-31=i32dft-node031,3

install_root=/opt/scidb/14.12
pluginsdir=/opt/scidb/14.12/lib/scidb/plugins
logconf=/opt/scidb/14.12/share/scidb/log4cxx.properties
db_user=alt_user
db_passwd=alt_passwd
base-port=1239
base-path=/data/scidb
redundancy=0

Threading: max_concurrent_queries=2, threads_per_query=2

max_concurrent_queries + 2:

execution-threads=4

max_concurrent_queries * threads_per_query:

result-prefetch-threads=4

threads_per_query:

result-prefetch-queue-size=2
operator-threads=2

Memory: 20476MB per instance, 15357MB reserved

network: 6143MB per instance assuming 5MB average chunks

in units of chunks per query:

sg-send-queue-size=307
sg-receive-queue-size=307

caches: 6143MB per instance

smgr-cache-size=3071
mem-array-threshold=3071

sort: 3071MB per instance (specified per thread)

merge-sort-buffer=768

NOTE: Uncomment the following line to set a hard memory limit;

NOTE: queries exceeding this cap will fail:

max-memory-limit=20476[/code]

I can’t figure this out. I have done this a million times, but now it is hanging …
George


#2

In case anyone is reading this thread, let me post an update. We (p4) have been working this problem off-line with George’s help and we have figured out that the root cause of the issue reported is a bug in the “scidb.py” logic for spawning remote commands on cluster nodes. The logic for detecting when all commands were finished is faulty, and can lead to a hang, especially if the number of nodes is large. We are preparing a fix for the issue, and after running it through QA we plan to cut a new 14.12 release and update the packages in our repositories. Stay tuned.

Steve F
P4


#3

To be complete, the 14.12 packages have been updated. See viewtopic.php?f=14&t=1526