Drop array hangs?


#1

I ran the XLDB 2012 lab axercise Lab01, and ran into a bit of a snag:

AQL% drop array Signal_Timeseries;

… never returns. So I ^C out of iquery, and go back to iquery:

AQL% select * from list(‘arrays’)
CON> ;
[]
AQL%


so it worked apparently, but why did it not return from the drop? I waited about 2 minutes before I ^C-d it.
I am running 12.3 on Ubuntu 12.04
George


#2

… apparently I can not ssh localhost passwordlessly, which is weird, since I have the same id_rsa and id_rsa.pub in my ~/.ssh/ in all machines: coordinateor and servers where the instances are running.

I switched to a different user one that can passwordlessly ssh to all servers and localhost (coorcinator) and the hang is gone away.

I think I am volunteering myself to write a script called “whatnow.py” that will look in my own database of plausible causes (that is, stuff I already did that messed me up :smile:) and check for those first.

So, I think I got the solution, not waiting for a reply

( I should learn to think first and then ask the forum :smile:)

BTW: this is really great stuff! Can’t wait for 12.10 release!

George


#3

Passwordless ssh: Instead of cloning the id_dsa/rsa files within the cluster, you’ll have to use ssh-copy-id which adds entries to the ~/.ssh/authorized_keys file. The user guide describes this in some details. Take a look a chapter 2 if this note doesn’t make complete sense.
p4-releases-12-7.s3.amazonaws.c … e-12.7.pdf

Regarding the “hung” drop array command. Most likely there was another query in progress that was holding the write lock to the same array you were trying to drop. If that was not the case, could you tar up and send us the scidb.log (base-path/000/0/scidb.log). Also, passwordless ssh logins within the cluster should not interfere with query execution, so it is very likely that something else (restarting the server) probably cleared things up on your cluster.

We’re working to bring you an even better 12.10 release very soon! Stay tuned…


#4

Happened again. I tried to AFL% remove (MerraMonthly);
I had a long query that wasn’t going the way I wanted it, so I killed the shell containing the query.
The shell was in an emacs buffer and I simply killed the buffer, which killed all processes in it.
I am not sure what signal emacs sends… could be -9, I don’t know.

Seems like an interrupted query doesn’t always clean up and release locks? Especially
those that are killed with “extreme prejudice” :smile:

Here is the log:

[code]2012-10-09 16:43:26,645 [0x7f642ac4f7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:43:26,645 [0x7f642ac4f7c0] [DEBUG]: Connection started from CLIENT (127.0.0.1)
2012-10-09 16:43:40,893 [0x7f642ac2f700] [DEBUG]: Generated queryID: instanceID=0, time=1349815420, clock=1992590000, nextID=1427, queryID=1102854034623
2012-10-09 16:43:40,893 [0x7f642ac2f700] [DEBUG]: Allocating query (1102854034623)
2012-10-09 16:43:40,893 [0x7f642ac2f700] [DEBUG]: Number of allocated queries = 1
2012-10-09 16:43:40,893 [0x7f642ac2f700] [DEBUG]: Initialized query (1102854034623)
2012-10-09 16:43:40,893 [0x7f642ac2f700] [DEBUG]: Parsing query(1102854034623): remove (MerraMonthly)
2012-10-09 16:43:40,896 [0x7f642ac2f700] [DEBUG]: Acquiring 1 array locks for query 1102854034623
2012-10-09 16:43:40,896 [0x7f642ac2f700] [DEBUG]: Acquiring lock: Lock: arrayName=MerraMonthly, arrayId=0, queryId=1102854034623, instanceId=0, instanceRole=COORD, lockMode=4, arrayVersion=0, arrayVersionId=0, timestamp=1850
2012-10-09 16:43:40,896 [0x7f642ac2f700] [DEBUG]: SystemCatalog::lockArray: Lock: arrayName=MerraMonthly, arrayId=0, queryId=1102854034623, instanceId=0, instanceRole=COORD, lockMode=4, arrayVersion=0, arrayVersionId=0, timestamp=1850

[/code]

Then I did a “scidb.py stopall merra” followed by a “scidb.py startall merra” and tried again.
I have one coordinator running on a private network on the ib interface on 10.0.0.15
and 9 servers on 10.0.0.1 - 10.0.0.9 inclusive.

After this, the afl remove command worked, and here is the log after the stop/start and tracking the remove command:

[code]2012-10-09 16:50:03,437 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:03,438 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (127.0.0.1)
2012-10-09 16:50:23,536 [0x7fd3ef66a700] [DEBUG]: Generated queryID: instanceID=0, time=1349815823, clock=40000, nextID=0, queryID=1100861483599
2012-10-09 16:50:23,537 [0x7fd3ef66a700] [DEBUG]: Allocating query (1100861483599)
2012-10-09 16:50:23,537 [0x7fd3ef66a700] [DEBUG]: Number of allocated queries = 1
2012-10-09 16:50:23,602 [0x7fd3ef66a700] [DEBUG]: Initialized query (1100861483599)
2012-10-09 16:50:23,602 [0x7fd3ef66a700] [DEBUG]: Parsing query(1100861483599): remove (MerraMonthly)
2012-10-09 16:50:23,607 [0x7fd3ef66a700] [DEBUG]: Acquiring 1 array locks for query 1100861483599
2012-10-09 16:50:23,607 [0x7fd3ef66a700] [DEBUG]: Acquiring lock: Lock: arrayName=MerraMonthly, arrayId=0, queryId=1100861483599, instanceId=0, instanceRole=COORD, lockMode=4, arrayVersion=0, arrayVersionId=0, timestamp=1
2012-10-09 16:50:23,607 [0x7fd3ef66a700] [DEBUG]: SystemCatalog::lockArray: Lock: arrayName=MerraMonthly, arrayId=0, queryId=1100861483599, instanceId=0, instanceRole=COORD, lockMode=4, arrayVersion=0, arrayVersionId=0, timestamp=1
2012-10-09 16:50:23,626 [0x7fd3ef66a700] [DEBUG]: Inferred schema for operator remove: MerraMonthly<H:float,T:float,U:float,V:float,QV:float,O3:float,Cov_U_V:float,Cov_U_T:float,Cov_V_T:float,Cov_U_QV:float,Cov_V_QV:float,vsts:float,Var_H:float,Var_T:float,Var_U:float,Var_V:float,Var_QV:float,Var_O3:float> [day=197701:*,100,0,height=0:41,100,0,xdim=0:539,100,0,ydim=0:360,100,0]
2012-10-09 16:50:23,626 [0x7fd3ef66a700] [DEBUG]: The query is prepared
2012-10-09 16:50:23,626 [0x7fd3ef66a700] [DEBUG]: The result preparation of query is sent to the client
2012-10-09 16:50:23,626 [0x7fd3e5876700] [DEBUG]: Executing query(1100861483599): remove (MerraMonthly)
2012-10-09 16:50:23,626 [0x7fd3e5876700] [DEBUG]: Creating Habilis optimizer instance
2012-10-09 16:50:23,627 [0x7fd3e5876700] [DEBUG]: Query is optimized
2012-10-09 16:50:23,627 [0x7fd3e5876700] [DEBUG]: The physical plan is detected as DML
2012-10-09 16:50:23,627 [0x7fd3e5876700] [DEBUG]:
[pPlan]:

[pInstance] physicalRemove agg 0 ddl 0 tile 0 children 0
schema MerraMonthly<H:float,T:float,U:float,V:float,QV:float,O3:float,Cov_U_V:float,Cov_U_T:float,Cov_V_T:float,Cov_U_QV:float,Cov_V_QV:float,vsts:float,Var_H:float,Var_T:float,Var_U:float,Var_V:float,Var_QV:float,Var_O3:float> [day=197701:*,100,0,height=0:41,100,0,xdim=0:539,100,0,ydim=0:360,100,0]
props sgm 1 sgo 1
distr roro
bound start [197701, 0, 0, 0] end [4611686018427387903, 41, 539, 360] density 1 cells 4611686018427387903 chunks 4611686018427387903 est_bytes 4.19202e+21

2012-10-09 16:50:23,627 [0x7fd3e5876700] [DEBUG]: (Pre)Single executing queryID: 1100861483599
2012-10-09 16:50:23,628 [0x7fd3e5876700] [DEBUG]: Query is serialized: [pPlan]:

[pInstance] physicalRemove agg 0 ddl 0 tile 0 children 0
schema MerraMonthly<H:float,T:float,U:float,V:float,QV:float,O3:float,Cov_U_V:float,Cov_U_T:float,Cov_V_T:float,Cov_U_QV:float,Cov_V_QV:float,vsts:float,Var_H:float,Var_T:float,Var_U:float,Var_V:float,Var_QV:float,Var_O3:float> [day=197701:*,100,0,height=0:41,100,0,xdim=0:539,100,0,ydim=0:360,100,0]
props sgm 1 sgo 1
distr roro
bound start [197701, 0, 0, 0] end [4611686018427387903, 41, 539, 360] density 1 cells 4611686018427387903 chunks 4611686018427387903 est_bytes 4.19202e+21

2012-10-09 16:50:23,628 [0x7fd3e5876700] [DEBUG]: Prepare physical plan was sent out
2012-10-09 16:50:23,628 [0x7fd3e5876700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 9 instances
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 1 (10.0.0.1), 10.0.0.1:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 5 (10.0.0.5), 10.0.0.5:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 2 (10.0.0.2), 10.0.0.2:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 6 (10.0.0.6), 10.0.0.6:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 3 (10.0.0.3), 10.0.0.3:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 7 (10.0.0.7), 10.0.0.7:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 4 (10.0.0.4), 10.0.0.4:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 8 (10.0.0.8), 10.0.0.8:1240
2012-10-09 16:50:23,629 [0x7fd3ef68a7c0] [DEBUG]: Connected to instance 9 (10.0.0.9), 10.0.0.9:1240
2012-10-09 16:50:23,636 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,636 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.1)
2012-10-09 16:50:23,636 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,636 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.3)
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.9)
2012-10-09 16:50:23,638 [0x7fd3e5775700] [DEBUG]: Notify on processing query 1100861483599 from instance 1
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.4)
2012-10-09 16:50:23,638 [0x7fd3e5674700] [DEBUG]: Notify on processing query 1100861483599 from instance 3
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.2)
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,638 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.5)
2012-10-09 16:50:23,638 [0x7fd3e5573700] [DEBUG]: Notify on processing query 1100861483599 from instance 9
2012-10-09 16:50:23,638 [0x7fd3e5573700] [DEBUG]: Notify on processing query 1100861483599 from instance 4
2012-10-09 16:50:23,638 [0x7fd3e5573700] [DEBUG]: Notify on processing query 1100861483599 from instance 2
2012-10-09 16:50:23,638 [0x7fd3e5573700] [DEBUG]: Notify on processing query 1100861483599 from instance 5
2012-10-09 16:50:23,639 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,639 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.6)
2012-10-09 16:50:23,639 [0x7fd3ef66a700] [DEBUG]: Notify on processing query 1100861483599 from instance 6
2012-10-09 16:50:23,641 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,641 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.7)
2012-10-09 16:50:23,641 [0x7fd3e5573700] [DEBUG]: Notify on processing query 1100861483599 from instance 7
2012-10-09 16:50:23,641 [0x7fd3ef68a7c0] [DEBUG]: Waiting for the first message
2012-10-09 16:50:23,641 [0x7fd3ef68a7c0] [DEBUG]: Connection started from CLIENT (10.0.0.8)
2012-10-09 16:50:23,642 [0x7fd3e5674700] [DEBUG]: Notify on processing query 1100861483599 from instance 8
2012-10-09 16:50:23,642 [0x7fd3e5876700] [DEBUG]: Execute physical plan was sent out
2012-10-09 16:50:23,642 [0x7fd3e5876700] [INFO ]: Executing query(1100861483599): remove (MerraMonthly)
2012-10-09 16:50:23,642 [0x7fd3e5876700] [DEBUG]: Request exclusive lock of array MerraMonthly for query 1100861483599
2012-10-09 16:50:23,642 [0x7fd3e5876700] [DEBUG]: Granted exclusive lock of array MerraMonthly for query 1100861483599
2012-10-09 16:50:23,643 [0x7fd3e5876700] [DEBUG]: Query is executed locally
2012-10-09 16:50:23,650 [0x7fd3e5775700] [DEBUG]: Received query result from instance#2, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,650 [0x7fd3ef66a700] [DEBUG]: Received query result from instance#6, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,650 [0x7fd3e5573700] [DEBUG]: Received query result from instance#9, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,650 [0x7fd3e5674700] [DEBUG]: Received query result from instance#1, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,653 [0x7fd3e5775700] [DEBUG]: Received query result from instance#4, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,653 [0x7fd3ef66a700] [DEBUG]: Received query result from instance#3, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,654 [0x7fd3e5573700] [DEBUG]: Received query result from instance#5, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,654 [0x7fd3e5573700] [DEBUG]: Received query result from instance#7, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,654 [0x7fd3e5775700] [DEBUG]: Received query result from instance#8, queryID=1100861483599, arrayName=
2012-10-09 16:50:23,654 [0x7fd3e5876700] [DEBUG]: The responses are received
2012-10-09 16:50:23,654 [0x7fd3e5876700] [DEBUG]: (Post)Single executing queryID: 1100861483599
2012-10-09 16:50:23,666 [0x7fd3e5876700] [DEBUG]: The result of query is returned
2012-10-09 16:50:23,666 [0x7fd3e5876700] [DEBUG]: The result of query is sent to the client
2012-10-09 16:50:23,666 [0x7fd3ef66a700] [DEBUG]: Query (1100861483599) is committed
2012-10-09 16:50:23,667 [0x7fd3ef66a700] [DEBUG]: Deallocating query (1100861483599)
2012-10-09 16:50:23,667 [0x7fd3ef66a700] [DEBUG]: Releasing locks for query 1100861483599
2012-10-09 16:50:23,667 [0x7fd3ef66a700] [DEBUG]: SystemCatalog::deleteArrayLocks instanceId = 0 queryId = 1100861483599
2012-10-09 16:50:23,674 [0x7fd3ef66a700] [DEBUG]: Release lock of array MerraMonthly for query 1100861483599

[/code]

Hope this is good info…

Cheers, George


#5

All the above was committed with 12.3.
I axed my painfully build partial array and switching over to 12.7 for a fresh (re)start.
Should anything be different?
I did notice, that 12.7 does support binary load (WooHoo) chapter 5.3
Loading the monthly arrays for MERRA took unacceptably long: about 20 minutes per dataset.
A dataset is 42x361x540 array of 22 independent floats. The 20 minutes included the text generation from the HDF-4 files piped through csv2scidb into a named pipe, loading the 1D array from the named pipe. This is followed by a redimension into a temp array which was then merged with the master MerraMonthly array.
Each incremental load went do a different “day” dimension (even though it’s month… I named it day. Saw it in an example on the forum and it stuck).

This way it would take me almost 130 hours to load everything–if it worked (See previous post about fsync errors)

Cheers, George