Not able to execute matrix multiplication


#1

Hi,
I am trying to execute [color=#0000BF]gemm[/color] for multiply two matrices,I am getting error. I used arrays as per the specification in the user guide. The out put obtained is as folows.

[color=#0000FF]AFL% gemm(m2x3,m3x2,z);
SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation ‘MPI launcher process already terminated’ failed.[/color]

I have executed [color=#0000FF]mpi_init()[/color]; command and following out put listed

[color=#0000FF]AFL% mpi_init();
SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation ‘MPI launcher process already terminated’ failed.[/color]

Again made IP address change in [color=#0000BF]/etc/hosts[/color]. The same error is again generating.

Please help…


#2

Hi,
Please see: viewtopic.php?f=11&t=1460#p3339
Thanks.


#3

Hi,
Even after I tried every possibilities specified in the http://www.paradigm4.com/HTMLmanual/14. … pbs01.html link,the error is still there when execute the gemm function.

AFL% gemm(m2x3,m3x2,z);
SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation 'MPI launcher process already terminated' failed.

is there any possibility to come out of this error?

One more doubts to be cleared in connection with the statement mentioned in the link like, If it does not help, please provide ALL <base-path>/*/*/scidb.log, <base-path>/*/*/mpi_log/*, and /etc/hosts.. I didn’t get what you what exactly is?

expecting your positive replay.
Regards
Subu


#4

Hi Subu,

is the configuration option specified in your SciDB config.ini file.
We need to get details from the log files which are present under the directory tree.
Please provide logs from the specified paths.

Thanks,
Sunny


#5

The config.ini file content is listed below

[cluster]
server-0=subu-desktop,3
install_root=/opt/scidb/14.8
metadata=/opt/scidb/14.8/share/scidb/meta.sql
pluginsdir=/opt/scidb/14.8/lib/scidb/plugins
logconf=/opt/scidb/14.8/share/scidb/log4cxx.properties
db_user=pguser
db_passwd=pguserpwd
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0
execution-threads=1
result-prefetch-threads=1
result-prefetch-queue-size=1
operator-threads=1
pg-port=5433

#6

Hi Subu,

Please execute the following command in a terminal on subu-desktop:

Please attach the file /tmp/my_debug_log.txt in the reply to help us understand the issue in a better way.

Thanks,
Sunny


#7

Hi Sunny,
Thank you for your replay.As you suggested i executed the command [color=#FF0000]find /home/scidb/scidb_data -name ‘*scidb.log’ | xargs tail -n 20 > /tmp/my_debug_log.txt[/color]
and the content of /tmp/my_debug_log.text is listed below.

[color=#0000FF]==> /home/scidb/scidb_data/000/0/scidb.log <==
2014-11-01 11:42:45,383 [0x7fed0e1d47c0] [ERROR]: Exception in message handler: source instance ID = instance 3
2014-11-01 11:42:45,383 [0x7fed0e1d47c0] [ERROR]: Exception in message handler: SystemException in file: src/query/Query.cpp function: getQueryByID line: 812
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_NOT_FOUND
Error description: Query processor error. Query 1100926610124 not found.
2014-11-01 11:42:45,383 [0x7fed0e1d47c0] [DEBUG]: Query 1100926610124 is not found
2014-11-01 11:42:45,384 [0x7fed0e1d47c0] [ERROR]: Exception in message handler: messageType = 12
2014-11-01 11:42:45,384 [0x7fed0e1d47c0] [ERROR]: Exception in message handler: source instance ID = instance 1
2014-11-01 11:42:45,384 [0x7fed0e1d47c0] [ERROR]: Exception in message handler: SystemException in file: src/query/Query.cpp function: getQueryByID line: 812
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_NOT_FOUND
Error description: Query processor error. Query 1100926610124 not found.
2014-11-01 12:16:20,543 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:37,332 [0x7fed0e1d47c0] [INFO ]: Got std input event. Terminating myself.
2014-11-01 12:16:40,185 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,203 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,245 [0x7fed0e1d47c0] [INFO ]: SciDB is going down …
2014-11-01 12:16:40,245 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,245 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,245 [0x7fed0e1d47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,251 [0x7fed0e1d47c0] [INFO ]: SciDB instance. SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright © 2008-2014 SciDB, Inc. is exiting.

==> /home/scidb/scidb_data/000/3/scidb.log <==
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled.
2014-11-01 11:42:45,383 [0x7fc2e436c700] [ERROR]: ServerMessageHandleJob::run: Error occurred in message handler: SystemException in file: src/query/Query.cpp function: handleAbort line: 549
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled., messageType = 2, sourceInstance = 0, queryID=1100926610124
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: ServerMessageHandleJob::run: Execution of query 1100926610124 is aborted on worker
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: Query::done: queryID=1100926610124, _commitState=2, errorCode=62
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: Query (1100926610124) is being aborted
2014-11-01 11:42:45,383 [0x7fc2e436c700] [ERROR]: Query (1100926610124) error handlers (1) are being executed
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: Deallocating query (1100926610124)
2014-11-01 11:42:45,383 [0x7fc2e436c700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 12:16:37,331 [0x7fc2ef8587c0] [INFO ]: Got std input event. Terminating myself.
2014-11-01 12:16:40,185 [0x7fc2ef8587c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7fc2ef8587c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,203 [0x7fc2ef8587c0] [INFO ]: SciDB is going down …
2014-11-01 12:16:40,203 [0x7fc2ef8587c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,203 [0x7fc2ef8587c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,203 [0x7fc2ef8587c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,217 [0x7fc2ef8587c0] [INFO ]: SciDB instance. SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright © 2008-2014 SciDB, Inc. is exiting.

==> /home/scidb/scidb_data/000/2/scidb.log <==
2014-11-01 11:42:35,384 [0x7f8cd887e700] [DEBUG]: Query (1100926610124) is still in progress
2014-11-01 11:42:45,382 [0x7f8ce3d4a700] [ERROR]: ServerMessageHandleJob::handleExecutePhysicalPlan: QueryID = 1100926610124 encountered the error: SystemException in file: src/query/Query.cpp function: handleAbort line: 549
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled.
2014-11-01 11:42:45,382 [0x7f8ce3d4a700] [ERROR]: ServerMessageHandleJob::run: Error occurred in message handler: SystemException in file: src/query/Query.cpp function: handleAbort line: 549
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled., messageType = 2, sourceInstance = 0, queryID=1100926610124
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: ServerMessageHandleJob::run: Execution of query 1100926610124 is aborted on worker
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: Query::done: queryID=1100926610124, _commitState=2, errorCode=62
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: Query (1100926610124) is being aborted
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [ERROR]: Query (1100926610124) error handlers (1) are being executed
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: Deallocating query (1100926610124)
2014-11-01 11:42:45,383 [0x7f8ce3d4a700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 12:16:37,332 [0x7f8ce3d6a7c0] [INFO ]: Got std input event. Terminating myself.
2014-11-01 12:16:40,185 [0x7f8ce3d6a7c0] [INFO ]: SciDB is going down …
2014-11-01 12:16:40,185 [0x7f8ce3d6a7c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7f8ce3d6a7c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7f8ce3d6a7c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,200 [0x7f8ce3d6a7c0] [INFO ]: SciDB instance. SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright © 2008-2014 SciDB, Inc. is exiting.

==> /home/scidb/scidb_data/000/1/scidb.log <==
2014-11-01 11:42:35,384 [0x7f19c5bd4700] [DEBUG]: Query (1100926610124) is still in progress
2014-11-01 11:42:45,382 [0x7f19ba708700] [ERROR]: ServerMessageHandleJob::handleExecutePhysicalPlan: QueryID = 1100926610124 encountered the error: SystemException in file: src/query/Query.cpp function: handleAbort line: 549
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled.
2014-11-01 11:42:45,383 [0x7f19ba708700] [ERROR]: ServerMessageHandleJob::run: Error occurred in message handler: SystemException in file: src/query/Query.cpp function: handleAbort line: 549
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_CANCELLED
Error description: Query processor error. Query 1100926610124 was cancelled., messageType = 2, sourceInstance = 0, queryID=1100926610124
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: ServerMessageHandleJob::run: Execution of query 1100926610124 is aborted on worker
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: Query::done: queryID=1100926610124, _commitState=2, errorCode=62
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: Query (1100926610124) is being aborted
2014-11-01 11:42:45,383 [0x7f19ba708700] [ERROR]: Query (1100926610124) error handlers (1) are being executed
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: Deallocating query (1100926610124)
2014-11-01 11:42:45,383 [0x7f19ba708700] [DEBUG]: MpiManager::removeCtx: queryID=1100926610124
2014-11-01 12:16:37,332 [0x7f19c5bf47c0] [INFO ]: Got std input event. Terminating myself.
2014-11-01 12:16:40,185 [0x7f19c5bf47c0] [INFO ]: SciDB is going down …
2014-11-01 12:16:40,185 [0x7f19c5bf47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7f19c5bf47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,185 [0x7f19c5bf47c0] [DEBUG]: Disconnected
2014-11-01 12:16:40,209 [0x7f19c5bf47c0] [INFO ]: SciDB instance. SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright © 2008-2014 SciDB, Inc. is exiting.[/color]


#8

First, increase the values for the following config.ini params:
execution-threads=4
result-prefetch-threads=4
result-prefetch-queue-size=4

Then, try running the query again. If it still fails, you can do either

scidb.py dbginfo-lt cluster and collect the tar file under /home/scidb/scidb_data/000/0/all-*.tar

AND/OR

collect the following files
/home/scidb/scidb_data/00*//mpi_log/.log

Next, if you have an HTTP/FTP server with public access I can grab the files from there. If not, do the following:
sftp ​ftp_guest@ftp.paradigm4.com
enter your password: paradigm4
cd ftp
put <THE_FILES_TO_UPLOAD>
This is a shared a public shared account, make sure not to upload any sensitive information.


#9

Hi Tigor,
As you suggested, I made the changes in the [color=#0000FF]config.ini [/color]file. But the error persist when i execute the [color=#0000FF]gemm[/color] command.

Then I collected the log file by executing the command [color=#0040FF]scidb.py dbginfo-lt cluster[/color]. The name of the tar file which I have uploaded using ftp is [color=#0000FF]all-20141104-121532.tar.[/color].

Please give me a solution for come out of this error.

Best regards
Subu surendran


#10

Hi Subu,
Here is what I got from the logs:

tigor@tigor-devbox:~/Downloads/gemm_problem_forum$ cat 0/mpi_log/1100926499899.1.mpirun.log
LAUNCHER: maxfd = 1024
Host key verification failed.

This means that the coordinator instance (0) cannot ssh to the other instances, aparently because the (local)host key has changed.
To run the scalapack queries (gemm,gesvd), you need to make sure that passwordless ssh (from the coordinator instance to the others) is set up, and the non-coordinator instances can DNS resolve the coordinator hostname, subu-desktop. In your case, the DNS is less of an issue because you have only one box, but keep it mind for future refence. In short, make sure you can ‘ssh scidb_user@subu-desktop’ without entering a password and acknowledge the ssh question about adding a new host key (if it is asked). scidb_user is the OS username under which you run scidb. Take a look at askubuntu.com/questions/45679/ss … iled-error with your particular problem. Assuming you are in full control of your box, removing ~scidb_user/.ssh/known_hosts and repeating the steps in the SSH section of the user manual should do the trick. There might be security implications, please learn more about ssh configuration before removing known_hosts if you are concerned about ssh security).
Hope this helps,
–Igor


#11

Hi Igor,
I have already executed the following commands for ssh keygenaration and copying.
[color=#0000FF]ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub scidb@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub scidb@0.0.0.0
ssh-copy-id -i ~/.ssh/id_rsa.pub scidb@127.0.0.1[/color]

and i could able to establish the connection with the following hosts as well.

[color=#0040FF]ssh scidb@localhost
ssh scidb@0.0.0.0
ssh scidb@127.0.0.1[/color]

But I got one doubt regarding the following command.

[color=#FF0000]ssh-copy-id -i ~/.ssh/id_rsa.pub scidb@worker[/color]

In my context what is [color=#FF0000]worker[/color] means? How many [color=#FF0000]workers[/color] are there in my installation?How to name each one?
Can you help me to clarify these doubts based on my log files already uploaded.

regards
Subu Surendran


#12

We do have a bit weird naming scheme … but here it is :
server-0=subu-desktop,3

tells scidb to create instances 0-3 on server-0, which happens to be the same physical box subu-desktop. Instnace 0 is assumed to be the default coordinator, instances 1-3 are the workers. Because the workers are on the same physical box, the ssh-copy-id step does not need to be repeated for the workers.
A few things to note:

  1. if you had the second ‘server-’ line in your config.ini like
    server-1=XXX-desktop,3
    it would tell scidb to create 3 more instances on XXX-desktop, and you would have 7 instances in total (0-6). So, server-0=… is special because the 0th instance is implied. Again, instance 0 is the default coordinator, instances 1-6 are the workers. Because instances 4-6 are on a different box, the ssh-copy-id step would need to be performed from subu-desktop->XXX-desktop.
  2. The naming of the data directories for each instance is not based on the instance ID, but rather on the server-ID and the per-server instance count, so for the config
    server-0=subu-desktop,3
    server-1=XXX-desktop,3
    base-path=

    this is the data directory layouts:
    instance 0 : subu-desktop:/000/0
    instance 1 : subu-desktop:/000/1
    instance 2 : subu-desktop:/000/2
    instance 3 : subu-desktop:/000/3
    instance 4 : XXX-desktop:/001/1
    instance 5 : XXX-desktop:/001/2
    instance 6 : XXX-desktop:/001/3
  3. As I said, instance 0 is the default coordinator. The client (e.g. iquery) can submit queries to other scidb instances. Whichever instance gets the query request is the coordinator for that query. If you plan to use the instances other than 0 (you dont have/need to …), you need to perform the ssh configuration steps for each coordinator instance you plan to use, usually all of them to make the system completely symmetric.
    Hope this makes sense …

#13

Hi Igor,
Thank you for your replay. From your replay I understood that [color=#FF0000]ssh-copy-id -i ~/.ssh/id_rsa.pub scidb@worker[/color] command need to be executed if I have two physical boxes connected each other. No need to perform ssh-copy to remining instances (1-3) currently available in single box. But in my current scenario, i have installed scidb in a single macine.

I got few more doubts. My questions are

  1. Can i run [color=#FF0000]gemm[/color] or[color=#FF0000] gesvd[/color] in a single box installation of scidb?

  2. If so, What are the configurations (ssh-copy) to be performed?

If it is not, can you suggest any other solution? My objective is to perform SVD on a large matrix.

Best Regards
Subu Surendran


#14

Yes, gemm(), gesvd(), and ALL other SciDB AFL queries work, on one machine or more machines.

If your config.ini says “server-0=subu-desktop,3”, you should check to make sure the following command works WITHOUT asking you for a password:

ssh subu-desktop

Would you check?


#15

Hi,
When i execute [color=#0000FF]ssh subu-desktop[/color] , it is asking password.
Best regards
Subu


#16

If it is asking for password, then it means your passwordless ssh to subu-desktop is not set up. Scidb needs passwordless ssh access to ALL machines in the cluster including the one you are running on. Here is the link to the documentation on one way to configure your machine for passwordless ssh:
http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/apas01s01.html#d0e31875

Please set up your passwordless ssh to your machine (subu-desktop) and test it again as suggested by donghui in the previous post.


#17

As you suggested i established passwordless ssh connection to [color=#0000FF]subu-desktop, scidb@localhost, scidb@127.0.0.1[/color]. Even after i established passwordless ssh connection to these machine, the same error showing when try to execute [color=#0000FF]gemm[/color] command. The error generated is as follows

[color=#FF0000]AFL% gemm(m2x3,m3x2,z);
SystemException in file: src/mpi/MPISlaveProxy.cpp function: checkLauncher line: 59
Error id: scidb::SCIDB_SE_INTERNAL::SCIDB_LE_OPERATION_FAILED
Error description: Internal SciDB error. Operation ‘MPI launcher process already terminated’ failed.[/color]


#18

Just to be absolutely clear - are you able to passwordlessly login to your machine via ssh?
The error seems to be pointing to a problem with MPI. Here is the doc entry for debugging the MPI issues:
http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/apbs01.html

Take a look in the article above and see if these suggestions fix your problem. Please let us know how it goes.


#19

I had already tried all possibilities explained in MPI Issues except the folowing

[color=#0000BF]Log in once: ssh scidb@. At the following prompt, answer yes.

Are you sure you want to continue connecting (yes/no)?[/color]

Here i can’t locate worker.

Again one more doubt. Can you just explain how passwordlessly login to your machine via ssh? I don’t have any LAN set up here.


#20

I think your configuration is for a single machine (subu-desktop). If this is true, this is the only machine you need passwordless ssh for. “Worker” typically refers to another machine that is different from the one with the default coordinator, but still part of the cluster.
To test passwordless access, you need to login as follows:
ssh “your username”@subu-desktop (or ssh "your username"@127.0.0.1)

If you are prompted for a password, then passwordless ssh is not set up properly (be sure to omit the double quotes in the user name - I included them only to make my point).