Persistenting data to remote nodes


#1

I wonder if any one has tried scidbpy for persisting data to remote nodes. This should be a common workload and I/O pattern for parallel algorithms. Here’s my use case:

I have a 4 GB numpy array and I want to store it to 4 instances on 2 physical nodes (i.e., 2 instances per node). I split the data on the 4th dimension ranging [0,288). The code looks like the following (‘0’ and ‘1’ are local instance IDs, ‘4294967298’ and ‘4294967299’ are remote instance IDs)

data_sdb_0 = sdb.from_array(data[:,:,:,0:72],0,name=‘mri_4_0’)
data_sdb_1 = sdb.from_array(data[:,:,:,72:144],1,name=‘mri_4_1’)
data_sdb_2 = sdb.from_array(data[:,:,:,144:216],4294967298,name=‘mri_4_2’)
data_sdb_3 = sdb.from_array(data[:,:,:,216:288],4294967299,name=‘mri_4_3’)

The first two lines for local instances completed pretty quick (roughly 80 seconds each), and yet the 3rd line seemed running forever… Any ideas or suggestions?

PS: I could store an array to remote nodes in AQL without any issues.

-Don


#2

Some updates:

I wasn’t able to get the “persistent” option working on remote instances so I switched to using the scidbpy wrapper of AQL “save”, something like the following:

data_sdb_0 = sdb.from_array(data[:,:,:,0:72])
sdb.query(“save({A},‘mri_0.file’,0,‘csv’)”, A=data_sdb_0)

It roughly took 15 minutes to save a 1 GB array to the disk, which implied roughly a 1.x MB/s throughput. Is this normal? (I tested it on EC2 r3.2xlarge instances with SSDs)

-DFZ


#3

Ok, more issues:

I was trying to load the persisted files back to scidb. Local loading seemed fine but remote loading took forever. Here’s my code:

#DFZ: local read, completed in 90 seconds
sdb.query(“create array mri_4_0 <x:double> [i]”)
sdb.query(“load(mri_4_0,‘mri_0.file’, 0, ‘csv’)”)

#DFZ: remote read, never completed (well, at least not in 20 minutes…)
sdb.query(“create array mri_4_3 <x:double> [i]”)
sdb.query(“load(mri_4_3,‘mri_3.file’, 4294967299, ‘csv’)”)

This is like a show stopper since I couldn’t do anything parallel. Please help.

PS: I also wonder how exactly a parallel application works; in the above example would any subsequent queries over mri_4_3 be running on the remote node? (assuming the loading part succeeds)

-DFZ


#4

Hey Don,

Sorry about the trouble. Consider trying the accelerated_io_tools package. Note there are two operators available: aio_input and aio_save. They also let you read and write to various instances. The transfer rates should be better. There are some tutorials available as well: http://rvernica.github.io/2016/08/load-data-table


#5

Thanks for the pointer! I tried the aio_save and aio_input functions, they were much faster than the native save/load APIs. Here are some numbers I had on EC2 r3.2xlarge:

  • aio_save writes 1.1 GB to disk in roughly 4 minutes (v.s. ~15 minutes with native save);
  • aio_input reads 1.1 GB from disk in 45 seconds (v.s. reading from remote node with native load took 20+ minutes)

The numbers are about the same for both local and remote instances; so network is not the bottleneck here.


#6

:confused: that’s still a lot slower than usual.
We should check your config.ini. In particular:

sg-send-queue-size=4
sg-receive-queue-size=[NUM_INSTANCES]

Try that. But to be safe, copy your whole config here.


#7

[update] I add the following two lines to my config file, re-init and re-start SciDB, and the performance is pretty much the same as before.
sg-send-queue-size=4
sg-receive-queue-size=4


Interestingly I didn’t see the options you mentioned in my config file; here it is:
ubuntu@ip-172-31-38-168:~/scidb_mri$ cat /opt/scidb/15.12/etc/config.ini
[mydb]
server-0=ip-172-31-38-168,1
server-1=ip-172-31-38-169,1
db_user=mydb
redundancy=1
install_root=/opt/scidb/15.12
pluginsdir=/opt/scidb/15.12/lib/scidb/plugins
logconf=/opt/scidb/15.12/share/scidb/log4cxx.properties
base-path=/home/ubuntu/mydb-DB
base-port=1239
interface=eth0
security=trust

And BTW, I was using the Python API to issue the query, something like the following (I assume the Python wrapper didn’t affect the performance much):
sdb = connect(‘http://localhost:8080’)
sdb.query(“store(aio_input(‘paths=/tmp/mri_3.out’, ‘instances=4294967299’, ‘num_attributes=1’), mri_4_3)”)


#8

I see - you are running only two SciDB instances on each machine (that 1 after each IP address means 2). We generally run one instance per 1-2 CPU cores. I don’t know how many CPU cores you have here.

You should unset redundancy, i.e. “redundancy=0” or just get rid of it. It will slow you down and you can’t use it without enterprise edition. That change will require a DB reinit.