Fatal Network Failure on EC2


#1

I was loading data into SciDB on EC2, but network error occurred repeatedly during the data loading. I simply used the community AMI with the default dual-instance configuration. It seems if any processes among the total 4 is dead, then SciDB can never work again. Either restarting SciDB or rebooting EC2 instance couldn’t fix it.

I can show my manipulations step by step. Assume Array1 and Array2 are two 8-GB or 4-GB double precision arrays with the same schema, and I simplify my data loading by just populating Array1 with random integers.

  1. ./iquery -aq “store(build(array1, random()%100/1.0), array2);” -r /dev/null

SystemException in file: src/network/BaseConnection.h function: receive line: 294
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot send or receive network messages.

  1. If I ran the command “scidb.py check-pids mydb”:

checking (server 0 (localhost) local instance 0) 1354…
checking (server 0 (localhost) local instance 1) 1357…
Found 2 scidb processes

Clearly, two processes were dead at this time.

  1. I tried to restart all the SciDB instances, so first I ran “scidb.py stopall mydb”

stop(server 0 (localhost) local instance 1)
stop(server 0 (localhost) local instance 0)
checking (server 0 (localhost) local instance 0) 1695…
checking (server 0 (localhost) local instance 1) 1694…
Found 2 scidb processes
checking (server 0 (localhost) local instance 0) …
checking (server 0 (localhost) local instance 1) 1694…
Found 1 scidb processes
checking (server 0 (localhost) local instance 0) …
checking (server 0 (localhost) local instance 1) …
Found 0 scidb processes

  1. Finally, ran “scidb.py startall mydb”:

checking (server 0 (localhost) local instance 0) …
checking (server 0 (localhost) local instance 1) …
Found 0 scidb processes
start(server 0 (localhost) local instance 0)
Starting SciDB server.
start(server 0 (localhost) local instance 1)
Starting SciDB server.
Failed to start SciDB!

Note that at this time, even rebooting the EC2 instance cannot save SciDB in this case.

  1. Sometimes if I was lucky, SciDB could restart with all the 4 processes. However, after I tried to issue my first query, e.g., check if data was stored in Array1 (select count(*) from Array1;), the previous error in step 1 appeared again:

SystemException in file: src/network/BaseConnection.h function: receive line: 294
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot send or receive network messages.

  1. At this time, my iquery interface was forced to exit, and if I tried to ran the “./iquery” command:

3 [0x7f0bd4c04780] FATAL scidb.services.network null - Error #system:111 when connecting to localhost:1239
3 [0x7f0bd4c04780] FATAL scidb.services.network null - Error #system:111 when connecting to localhost:1239
./iquery SystemException in file: src/network/BaseConnection.cpp function: connect line: 262
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR
Error description: Network error. Error #system:111 when connecting to localhost:1239.

  1. scidb.py check-pids mydb;

checking (server 0 (localhost) local instance 0) 2496…
checking (server 0 (localhost) local instance 1) 2499 2504…
Found 3 scidb processes

Clearly, 1 process was dead…

  1. If I ran stopall and then startall, it went to the step 4, and SciDB failed eventually…

Note that I once managed to load the same data into SciDB which was installed on my own machine with the single-instance configuration. It seems that the SciDB with multi-instance configuration is more prone to network failures.

-Yi


#2

Hi!

Man … there’s no good deed that doesn’t go unpunished.

We put out the AMI to try to get folks the simplest possible mechanism for getting started with SciDB. We deliberately configured the whole thing to be very small, simple, and self contained. It’s really not intended for anything more than playing around. I suspect what you’re looking at is that the size of the AMI physical instance is too small to run the SciDB default configuration–cache memory–you have in mind. When the OS sees that the SciDB (default) is asking for more memory than the OS thinks SciDB has any right to, the OS will kill SciDB.

If you want to use the Amazon EC2 infrastructure to test / evaluate / work with SciDB, here’s what I suggest you do:

  1. Provision out as many (virtual) EC2 instances as you think you need. SciDB is designed to be a “shared nothing” system, but that doesn’t mean you have to run it on a bunch of little physical nodes. We’re quite happy, for example, running a 4-instance SciDB installation on one of Amazon’s mid-sized. Have a look http://aws.amazon.com/ec2/instance-types/%20here%20for%20a%20list%20of%20EC2%20instance%20types. Depending on what you want to do, any of the medium or large instance sizes are good places to start.

  2. Grab a copy of the manual–which includes install and configuration instructions for SciDB–and install SciDB on that machine (or cluster of 'em). Our basic recommendation is “one instance for 4 cores”. Now … Amazon obscures some of the details about what they mean by “ECU” (elastic compute unit) in this context. But as an ECU is supposed to deliver the “CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”, and as those are quad-core chips, we figure 4 ECUs is about 16 cores, which is enough to run 4 SciDB instances.

Hope this helps!

Paul