A question for the performance of a computation

Hello,

I am trying to run some linear algebra computations on SciDB 18.1, and found that the performance seemed not desirable. Therefore, I would like to get some advices from you.

The input data I have are two matrices, x and m. The matrix x is 10^5 by 10, and the matrix m is 10 by 10. Both matrices are stored as one dimensional array in two CSV files.

I use the following queries to load the matrix x and m, and reshape them into two dimensional arrays:

# [1] 
create array xflat <val: double> [i=0:(100000*10-1), 100000, 0];

# [2] 
load(xflat,'/data/x.csv', -2, 'CSV');

# [3] 
select * into x from reshape(xflat,<val: double>[i=0:99999,1000,0,j=0:9,1000,0]);

# [4] 
create array mflat <val: double> [i=0:(10*10-1), 1000, 0];

# [5] 
load(mflat,'/data/m.csv', -2, 'CSV');

# [6] 
select * into m from reshape(mflat,<val: double>[i=0:9,1000,0,j=0:9,1000,0]);

# Then, I compute a matrix-multiplication chain with the following query:
# [7] 
select * into mxt from gemm(m, transpose(x), build(<val: double>[t1=0:9,1000,0,t2=0:99999,1000,0], 0));

# [8] 
select * into all_distance from filter(gemm(x, mxt, build(<val: double>[t1=0:99999,1000,0,t2=0:99999,1000,0], 0)), t1<>t2);

I ran those queries on a cluster with one master node, and ten worker nodes. Each node has 8 cores and 64GB memory. I launched 1 server instance on the master node, and launched 8 server instances on each worker node. For the config file, I set max-memory-limit = 7000, and all of the other settings were kept as defaults.

So here is my question. I found that the query [3] would take 16 hours to finish, and the query [8] would take 32 minutes to finish. I think that this running time might be too long for SciDB. I tested the same queries with the same settings on SciDB 14.1 and each query just took several minutes to finish. So I am wondering why the newer SciDB gives a worse performance.

Any suggestions will be appreciated!

@Seraph

Sorry for the delay in responding, but can you confirm a few things first.

1.
You describe initially that x has 10^6 * 10 elements. But in your create statement [1], the array created has 10^5 * 10 elements. Can you confirm that your array x has 10^6 * 10 elements? We will work on a repro accordingly.

2.
Also can you provide some guidance as to why you chose a 10 node (80 instance) scidb? Your current array sizes are small, but maybe your final target array sizes are much larger?

Hi,

Thank you for reply!

Sorry I had a typo in my previous post. The element number is 10^5 * 10.

We will test this computation on a larger dataset later. But we want to start from a relatively small dataset first to see its performance.

Thanks!

@Seraph

Sorry for the delay.

I was able to complete the computation quickly on a much smaller machine.

# [1, 2]
iquery -naq "store(build(<val: double> [i=0:(100000*10-1), 100000, 0], random()), xflat)" 
# [3]
time iquery -naq "store(reshape(xflat,<val: double>[i=0:99999,1000,0,j=0:9,1000,0]), x)"
# real	0m8.716s

# [4, 5]
iquery -naq "store(build(<val: double> [i=0:(10*10-1), 1000, 0], random()), mflat)"
# [6]
iquery -naq "store(reshape(mflat,<val: double>[i=0:9,1000,0,j=0:9,1000,0]), m)"

# Then, I compute a matrix-multiplication chain with the following query:

# [7] 
iquery -aq "load_library('dense_linear_algebra')"
time iquery -naq "store(gemm(m, transpose(x), build(<val: double>[t1=0:9,1000,0,t2=0:99999,1000,0], 0)), mxt);"
# real    0m0.652s

# Cleanup
iquery -aq "remove(x); remove(xflat)"
iquery -aq "remove(m); remove(mflat)"

Machine specs

16 core machine; cat /proc/cpuinfo gives

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping        : 4
microcode       : 0x200003a
cpu MHz         : 2499.994
cache size      : 33792 KB
physical id     : 0
siblings        : 16

Roughly 64 GB memory; cat /proc/meminfo gives

MemTotal:       64300196 kB

In fact, we are using only 4 of the cores in this case; but I noted similar timings on a 8 core and 16 core scidb also.

Hi,

Thank you for the reply!

I notice that a difference between your queries and my queries is how to load the data, which I think to be the reason for the runtime difference. Could you try to load the data in the same way as I do? That is, create the data files x.csv and m.csv first, and then loading the data from the files (instead of generating them by calling random()). The data format is like 0.1, 0.2, 0.3, …, and the length is 100000*10 (basically, a one dimensional array).

Thank you!

Hi Seraph. You say you are comparing SciDB 18.1 with

with the same settings on SciDB 14.1

AFAIK, Paradigm4 did not release a “14.1” Can you clarify what release you are comparing 18.1 with?

Hi,

Sorry preciously I ran the same computations on version 14.8.

@seraph’s original post had some typos (e.g. 10^5 was noted as 10^6; discussed later in the thread). So I updated the original post to reflect the actual values.

I just saw @seraph’s note about his seeing different performance with the load CSV method. Reverted my edit to the original post. Effectively you can ignore my previous comment.

Hi Seraph.

Unfortunately I don’t have a 14.8 cluster to compare with. However, scidb 19.3 was just released (see SciDB Release 19.3) and I did make a script of my attempt to reproduce your issue on that. I only have a desktop available to me at the moment, and it does not have enough memory to run your query [8]. However, on 19.3, it runs your case many many times faster than you report, so I wonder whether your issue is configuration-specific.

(For example, I do not put my coordinator on a separate machine … I always use clusters that distribute the load symmetrically across all machines, and I use no more than one instance per physical core [not hyperthread]. So to me, putting your coordinator on 1 machine and then 10 instances on 8 other machines makes me wonder whether your hardware configuration is asymmetric as well.)

What follows is my repo script, and the results I get from it when run on 19.3, on a desktop machine with 4 instances.

For me, query [3] takes 6.3s and query [7] takes 0.8 s.

#!/bin/bash
#
set -x

# [re]create x.csv

cat <<EOF > genx.c

#include <stdio.h>
int main() {
    size_t i;
    for(i=0; i<100000*10; i++) {
        printf("%.1f\n", i/10.0); 
    }
    return 0;
}

EOF

gcc genx.c -o genx
./genx > /tmp/x.csv

# check that we have 1M lines
wc /tmp/x.csv

# [re]create m.csv
head -n100 /tmp/x.csv > /tmp/m.csv
# check that we have 100 lines
wc /tmp/m.csv

# path to IQUERY
PATH=stage/install/bin:$PATH

# [0]
# what are we working with?
scidb -V
# remove arrays from prior runs
iquery -aq "remove(xflat)" 2>/dev/null
iquery -aq "remove(x)" 2>/dev/null
iquery -aq "remove(mflat)" 2>/dev/null
iquery -aq "remove(m)" 2>/dev/null
iquery -aq "remove(mxt)" 2>/dev/null
iquery -aq "remove(all_distance)" 2>/dev/null
iquery -aq "remove(zeros_a)" 2>/dev/null
iquery -aq "remove(xt)" 2>/dev/null
iquery -aq "remove(zeros_b)" 2>/dev/null
# and load the library for gemm
iquery -aq "load_library('dense_linear_algebra')"
# and just show the elapased time, for brevity
TIMEFORMAT='%3R s'

echo "[1]"
time iquery -aq "create array xflat <val: double> [i=0:(100000*10-1), 100000, 0]"

echo "[2]"
time iquery -naq "load(xflat,'/tmp/x.csv', -2, 'CSV')"

echo "[3]"
# dim(x) = (100K,10)
time iquery -nq "select * into x from reshape(xflat,<val: double>[i=0:99999,1000,0,j=0:9,1000,0])"

echo "[4]"
time iquery -aq "create array mflat <val: double> [i=0:(10*10-1), 1000, 0]"

echo "[5]"
time iquery -naq "load(mflat,'/tmp/m.csv', -2, 'CSV')"

echo "[6]"
# dim(m) = (10,10)
time iquery -nq "select * into m from reshape(mflat,<val: double>[i=0:9,1000,0,j=0:9,1000,0])"

echo "[7]"
# dim(mxt) = (10,100K)
# gemm flops = 10 * 10 * 100K = 10M flop
time iquery -nq "select * into mxt from gemm(m, transpose(x), build(<val: double>[t1=0:9,1000,0,t2=0:99999,1000,0], 0))"
echo "[7 alt]"
# let's estimate the relative costs of that query
# note that one need not populate zeros in an array given to gemm.
# this is handy for the C matrix -- it only needs to be created, not filled
time iquery -aq "create TEMP array zeros_a <val: double>[t1=0:9,1000,0,t2=0:99999,1000,0]"
time iquery -aq "create TEMP array xt  <val: double>[a=0:9,1000,0, b=0:99999,1000,0]"
time iquery -naq "store(transpose(x), xt)"
time iquery -nq "select * into mxt from gemm(m, xt, zeros_a)"

echo "[8]"
# dim(gemm()) = (100Kx100K).  Requires 10G doubles (80G byte)
#                             gemm flops = 2 * 100K * 100K = 20G flop
if false ;
then
    #  I don't have that much memory available.
    time iquery -nq "select * into all_distance from filter(gemm(x, mxt, build(<val: double>[t1=0:99999,1000,0,t2=0:99999,1000,0], 0)), t1<>t2)"
else
    echo "I don't have 80Gbyte of memory, I can't do [8] at this time"
fi

and my results are:

+ cat
+ gcc genx.c -o genx
+ ./genx
+ wc /tmp/x.csv
1000000 1000000 7888900 /tmp/x.csv
+ head -n100 /tmp/x.csv
+ wc /tmp/m.csv
100 100 400 /tmp/m.csv
+ PATH=stage/install/bin:/home/james/bin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/lib64/openmpi/bin:/home/james/.local/bin:/usr/lib64/openmpi/bin:stage/install/bin
+ scidb -V
SciDB Version: 19.3.0
Build Type: RelWithDebInfo
Commit: fae26e9f
Copyright (C) 2008-2019 SciDB, Inc.
+ iquery -aq 'remove(xflat)'
Query was executed successfully
+ iquery -aq 'remove(x)'
Query was executed successfully
+ iquery -aq 'remove(mflat)'
Query was executed successfully
+ iquery -aq 'remove(m)'
Query was executed successfully
+ iquery -aq 'remove(mxt)'
Query was executed successfully
+ iquery -aq 'remove(all_distance)'
+ iquery -aq 'remove(zeros_a)'
Query was executed successfully
+ iquery -aq 'remove(xt)'
Query was executed successfully
+ iquery -aq 'remove(zeros_b)'
+ iquery -aq 'load_library('\''dense_linear_algebra'\'')'
Query was executed successfully
+ TIMEFORMAT='%3R s'
+ echo '[1]'
[1]
+ iquery -aq 'create array xflat <val: double> [i=0:(100000*10-1), 100000, 0]'
Query was executed successfully
0.016 s
+ echo '[2]'
[2]
+ iquery -naq 'load(xflat,'\''/tmp/x.csv'\'', -2, '\''CSV'\'')'
Query was executed successfully
0.423 s
+ echo '[3]'
[3]
+ iquery -nq 'select * into x from reshape(xflat,<val: double>[i=0:99999,1000,0,j=0:9,1000,0])'
Query was executed successfully
6.305 s
+ echo '[4]'
[4]
+ iquery -aq 'create array mflat <val: double> [i=0:(10*10-1), 1000, 0]'
Query was executed successfully
0.017 s
+ echo '[5]'
[5]
+ iquery -naq 'load(mflat,'\''/tmp/m.csv'\'', -2, '\''CSV'\'')'
Query was executed successfully
0.035 s
+ echo '[6]'
[6]
+ iquery -nq 'select * into m from reshape(mflat,<val: double>[i=0:9,1000,0,j=0:9,1000,0])'
Query was executed successfully
0.047 s
+ echo '[7]'
[7]
+ iquery -nq 'select * into mxt from gemm(m, transpose(x), build(<val: double>[t1=0:9,1000,0,t2=0:99999,1000,0], 0))'
Query was executed successfully
0.802 s
+ echo '[7 alt]'
[7 alt]
+ iquery -aq 'create TEMP array zeros_a <val: double>[t1=0:9,1000,0,t2=0:99999,1000,0]'
Query was executed successfully
0.020 s
+ iquery -aq 'create TEMP array xt  <val: double>[a=0:9,1000,0, b=0:99999,1000,0]'
Query was executed successfully
0.018 s
+ iquery -naq 'store(transpose(x), xt)'
Query was executed successfully
0.371 s
+ iquery -nq 'select * into mxt from gemm(m, xt, zeros_a)'
Query was executed successfully
0.549 s
+ echo '[8]'
[8]
+ false
+ echo 'I don'\''t have 80Gbyte of memory in my cluster, so I can'\''t do [8] at this time'
I don't have 80Gbyte of memory, I can't do [8] at this time

@jmcq

Hi,

Thank you very much for the help!

I have several questions.

  1. What’s your running time for query [2]?

  2. Can you share your config.int file?

Here are the settings from my config.int file:

[mydb]

server-0=ip0,0

server-1=ip1,7

server-2=ip2,7

server-3=ip3,7

server-4=ip4,7

server-5=ip5,7

server-6=ip6,7

server-7=ip7,7

server-8=ip8,7

server-9=ip9,7

server-10=ip10,7

install_root=/opt/scidb/18.1

pluginsdir=/opt/scidb/18.1/lib/scidb/plugins

logconf=/opt/scidb/18.1/share/scidb/log4cxx.properties

base-port=1239

base-path=/data1/scidb_data

redundancy=0

security=trust

smgr-cache-size = 4096

mem-array-threshold = 1024

merge-sort-buffer = 512

network-buffer = 4096

replication-send-queue-size = 1024

replication-receive-queue-size = 1024

max-memory-limit = 7000

result-prefetch-threads=3

result-prefetch-queue-size=1

operator-threads=1

sg-send-queue-size=4

sg-receive-queue-size=4

max-arena-page-size=8

Sorry I may have to keep the coordinator (server-0) and worker machines in this way. We may need to run those queries in a much larger dataset so we cannot run them on only one machine. And those settings were what I used for the version 14.8.

So, is it possible for you to run those queries in this “one coordinator + ten worker machines” manner as well?

Thank you!

  1. The time for all the queries follows them in the results file that I included in my response. If you look there you can see that [2] took 0.423 s.
  2. I can share a version of my config.ini file where I have redacted local information. It follows
[mydb]
server-0=localhost,1
server-1=REDACTED,2-3
redundancy=1
db-user=mydb
install-root=/REDACTED/install
pluginsdir=/REDACTED/install/lib/scidb/plugins
logconf=/REDACTED/install/share/scidb/log1.properties
base-path=/REDACTED/DB-mydb
base-port=1239
interface=eth0
io-paths-list=/tmp:/dev/shm:/public/data
no-watchdog=true
security=trust
datastore-punch-holes=1
ccm-use-tls=false
ccm-session-time-out=30
ccm-read-time-out=3

Seraph wrote: So, is it possible for you to run those queries in this “one coordinator + ten worker machines” manner as well?

No cluster of that size is available to me for that, sorry.

1 Like