SciDB in Distributed Docker Network


#1

Hi SciDB Folks,

I’m trying to set up a SciDB Community Edition cluster on several physical hosts. On each host, I’m running a
SciDB instance using the Docker container here. The docker containers are assigned static IPs and communicate on an overlay network (similar to docker swarm). At the end of this message is a short code snippet which illustrates how I’m setting up the cluster. My config.ini file looks like the following:

[scidb]
base-path=/opt/scidb/18.1/DB-scidb
base-port=1239
db_user=scidb
install_root=/opt/scidb/18.1
logconf=/opt/scidb/18.1/share/scidb/log4cxx.properties
pluginsdir= /opt/scidb/18.1/lib/scidb/plugins
server-0=127.0.0.1,23
server-1=10.0.0.31,23
server-2=10.0.0.32,23
server-3=10.0.0.33,23
server-4=10.0.0.34,23
server-5=10.0.0.35,23
server-6=10.0.0.36,23
server-7=10.0.0.37,23

I have confirmed that I can ssh and ping the container running each server from the master container. When I run scidb.py initall scidb, log messages appear that each server has been initialized. However, when I run iquery -aq "list('instances')", I only see the 23 instances on server-0. Furthermore, if I run a query on server-0 and then SSH to any other server, I see a bunch of SciDB processe in top, but none of them are using any CPU. I’m unsure how to proceed debugging this issue, so any advice would be greatly appreciated. Thanks!

Docker Setup Script:

# create a swarm and a shared network that containers can communicate with

docker swarm init
docker network create --driver overlay --attachable scidb-network

for w in 1 2 3 4 5 6 7; do
    ssh mycluster-slave-${w} "docker swarm leave"
    ssh mycluster-slave-${w} "docker swarm join --token SWMTKN-1-2kshzuxd5sr8pgfe5tkqshvqv79hizpuj7pxzq34nlyjq38g0h-eka3knvj4emmd575p4pvw5nph 10.11.10.22:2377";
done

# now we annoyingly need to create a useless service to expose the network to
# all workers in the swarm

docker service create --replicas 8 --network=scidb-network --name nginx nginx
docker service ps nginx

# now we can launch our SciDB containers and connect them to this overlay network

docker login --username athomas9t --password yyyy
docker pull athomas9t/scidb:v2

docker run --tty -d --name scidb-master -v /dev/shm \
    --net scidb-network --ip 10.0.0.30                     \
    --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=90g         \
    --volume postgres1:/var/lib/postgresql/9.3/main        \
    --volume scidb1:/opt/scidb/18.1/DB-scidb               \
    -p 8080:8080 athomas9t/scidb:v2          

for w in 1 2 3 4 5 6 7; do
    ssh mycluster-slave-${w} "docker run --tty -d --name scidb-worker-${w} -v /dev/shm --net scidb-network --ip 10.0.0.3${w} --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=90g --volume postgres1:/var/lib/postgresql/9.3/main --volume scidb1:/opt/scidb/18.1/DB-scidb -p 8080:8080 athomas9t/scidb:v2";
done

# now connect to the running master container and setup as usual

#2

Hey Anthony. I’m not sure if this is the root cause, but it could be. In this line:

server-0=127.0.0.1,23

Replace 127.0.0.1 with the real IP.
The problem scenario is that, say, Node 2 sees that node 0 is at “127.0.0.1” and so he sends messages to “127.0.0.1” which ends up going to the same machine! So don’t use “localhost” or “0.0.0.0” or “127.0.0.1” in multi-node configs.

Another helpful thing is

iquery -aq "list('instances')"

gives you a sense of who actually is part of the cluster.

Hope this helps!


#3

Thanks for pointing that out! Unfortunately, it looks like that hasn’t solved the problem. I suspect this is a communication issue though. Also, when I run scidb.py initall scidb there is a message scidb.py: ERROR: (REMOTE) Remote command exceptions: Abnormal return code: 1 stderr:. Do you know if there’s a log file somewhere which would have a more detailed error message?


#4

Yeah - each instance keeps a data directory. For your config, look in:
/opt/scidb/18.1/DB-scidb/0
/opt/scidb/18.1/DB-scidb/0/0
/opt/scidb/18.1/DB-scidb/0/1
And so on…

Inside each instance’s subdirectory there should be a scidb.log and usually that’s informative. There’s also a scidb-stderr.log that gets more critical errors.


#5

Turns out this was because postgres on the worker nodes could not connect to the master. They were expecting an entry in .pgpass with the scidb password. Once I added this entry to .pgpass on all the hosts things seem to be working.


#6

A related question: each node in the cluster I am running (with 8 nodes in total) has 24 cores and 200 GB of RAM. In the single node setting I found that using 24 Scidb instances resulted in the best performance. However, running 24 instances per node seems to result in large overheads when scaling up to 8 nodes. For example, running `iquery -a -q “list(‘instances’)” takes over 10 minutes to complete. By comparison, when I reduce the number of instances per node to 8 the same query executes almost immediately. Do you have any sense of what a typical number of instances to run per node would be, or guidance on how to tune this?


#7

Hi, @ahthomas - yes it is common to see 1 instance per core. Typically we would run fewer (1 instance per 2 cores or more) if we expect a constant multi-user load. 1 query is expected to take up to 1 core per instance in most cases.

The overhead you are showing (10 minutes) is larger than we typically see. That’s just regular “iquery” from the command line? Anything peculiar about your network setup?


#8

One thing to try: set these like so in your config.ini and restart SciDB:

sg-send-queue-size=192   
sg-receive-queue-size=192

#9

Thanks @apoliakov - yep, just running standard iquery from the command line. I’m running the SciDB instances inside docker containers which are communicating in a swarm network. Speeds for transferring data between docker containers are almost exactly the same as for the host machines (about 7s to transfer 1GB via SCP), so I don’t think there’s serious overhead coming from docker.

I’ve been using gemm(X, X, Z, transa: true) as a simple benchmark, with X = 10,000,000 x 100. If I ssh into one of the SciDB worker nodes and run top, CPU and memory use seem suspiciously low (i.e. the instances are basically inactive). Additionally, I only see mpi_slave processes coming up on the master node.


#10

Hi,

Hmm. My first guess is that the problem may have to do with our broadcast algorithm, which is known to need some improvement. The list(‘instances’) query is a “coordinator only” query, but the executor does not know that and so still broadcasts the query plan to 191 workers, and each one hands back an empty MemArray. The situation is possibly exacerbated by low-level issues in the overlay network.

I wonder if subsequent list(‘instances’) queries also take as long? Instance-to-instance TCP connections are set up lazily, so perhaps the first one takes a long time (10 minutes?! Yikes!!!) and subsequent queries are faster?

This is a shot in the dark, but I wonder if list(‘arrays’) also takes as long? Like list(‘instances’) it is a coordinator-only query that basically returns metadata from the Postgres catalog. If they don’t perform similarly that would be… interesting.

That the MPI slaves are not coming is puzzling, but it’s likely that a 10,000,000 x 100 array is not big enough to have chunks resident on every node. Use summarize(X, by_instance:true) to see where most of the chunks live, that would be a better node to ssh into. It may be that only instances that have chunks are going to spawn the MPI slaves; I have to go look at the gemm() code.


#11

Hi @mjl - thanks for the feedback! Somewhat surprisingly, the first call to list('instances') is actually faster:

time iquery -aq "list('instances')"
real 1m49.213s

time iquery -aq "list('instances')"
real 5m29.789s

list('arrays') takes about the same amount of time:

time iquery -aq "list('arrays')"
real	7m30.883s

Also maybe the following error will elucidate things (this is with 24 instances per host = 192 instances total):

iquery -n -aq "store(build(<val:double>[row=0:9999999:0:1000;col=0:99:0:1000], (RANDOM() % 100) / 10.0), X)"

SystemException in file: include/util/WorkQueue.h function: reserve line: 218
Error id: scidb::SCIDB_E_NO_MEMORY::SCIDB_E_RESOURCE_BUSY
Error description: Not enough memory. Not enough resources: too many requests. Try again..
Failed query id: 0.1528230020846934951

The same query succeeds in less than one minute though when using 64 instances in total (8 per host)