Redimension doesn't finish


#1

hi

I want to load data of 350GB, 450GB and so on. So, I loaded data in flat array.
Then, I executed redimension operation to change 1-D array to 3-D array. and I saw as follow logs.

2014-11-20 22:32:44,129 [0x2b85c8e38700] [DEBUG]: Prepare physical plan was sent out
2014-11-20 22:32:44,129 [0x2b85c8e38700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 21 instances
2014-11-20 22:32:44,129 [0x2b85c8e38700] [INFO ]: Executing query(1099765780851): store(redimension(between(NSST_1D,1,5000000000), NSST),NSST); from program: 127.0.0.1:53443/opt/scidb/14.8/bin/iquery ;
2014-11-20 22:32:44,129 [0x2b85c8e38700] [DEBUG]: Waiting notification in queryID from 21 instances
2014-11-20 22:32:44,129 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 1
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 3
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 4
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 7
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 8
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 14
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 2
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 9
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 10
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 5
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 11
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 18
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 6
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 13
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 16
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 12
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 17
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 15
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 20
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 21
2014-11-20 22:32:44,130 [0x2b85c8d37700] [DEBUG]: ServerMessageHandleJob::handleNotify: Notify on processing query 1099765780851 from instance 19
2014-11-20 22:32:44,130 [0x2b85c8e38700] [DEBUG]: Send message from coordinator for waiting instances in queryID: 1099765780851
2014-11-20 22:35:40,041 [0x2b85c8e38700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 175902 ms, or 2 minutes 55 seconds 902 milliseconds
2014-11-20 22:35:40,156 [0x2b85c8e38700] [DEBUG]: [SortArray] Sort for array  begins
2014-11-20 22:35:46,492 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 1
2014-11-20 22:35:48,151 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 2
2014-11-20 22:35:54,013 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 3
2014-11-20 22:35:55,629 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 4
2014-11-20 22:36:01,456 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 5
2014-11-20 22:36:03,209 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 6
2014-11-20 22:36:09,158 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 7
2014-11-20 22:36:10,950 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 8
2014-11-20 22:36:16,632 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 9
2014-11-20 22:36:18,997 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 10
2014-11-20 22:36:23,988 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 11
2014-11-20 22:36:26,964 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 12
2014-11-20 22:36:31,051 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 13
2014-11-20 22:36:34,462 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 14
2014-11-20 22:36:38,290 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 15
2014-11-20 22:36:42,062 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 16
2014-11-20 22:36:45,922 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 17
2014-11-20 22:36:49,591 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 18
2014-11-20 22:36:53,766 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 19
2014-11-20 22:36:56,785 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 20
2014-11-20 22:37:01,484 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 21
2014-11-20 22:37:10,545 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 22
2014-11-20 22:37:14,300 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 23
2014-11-20 22:37:18,293 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 24
2014-11-20 22:37:21,671 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 25
2014-11-20 22:37:26,019 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 26
2014-11-20 22:37:29,394 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 27
2014-11-20 22:37:34,103 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 28
2014-11-20 22:37:37,905 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 29
2014-11-20 22:37:42,299 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 30
2014-11-20 22:37:45,757 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 31
2014-11-20 22:37:50,405 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 32
2014-11-20 22:37:53,394 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 33
2014-11-20 22:37:53,804 [0x2b85e0100700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:37:57,963 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 34
2014-11-20 22:38:06,689 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 35
2014-11-20 22:38:16,274 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 36
2014-11-20 22:38:18,932 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 37
2014-11-20 22:38:23,409 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 38
2014-11-20 22:38:25,485 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 39
2014-11-20 22:38:30,586 [0x2b85e0302700] [DEBUG]: [SortArray] Produced sorted run # 40
2014-11-20 22:38:30,965 [0x2b85e0403700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:38:32,603 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 41
2014-11-20 22:38:39,712 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 42
2014-11-20 22:38:47,586 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 43
2014-11-20 22:38:55,397 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 44
2014-11-20 22:38:58,204 [0x2b85e0100700] [DEBUG]: [SortArray] Produced sorted run # 45
2014-11-20 22:39:02,884 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 46
2014-11-20 22:39:04,936 [0x2b85e0100700] [DEBUG]: [SortArray] Produced sorted run # 47
2014-11-20 22:39:05,375 [0x2b85e0302700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:39:06,839 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:39:12,032 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 48
2014-11-20 22:39:13,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 20 (10.150.20.69)
2014-11-20 22:39:13,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 20
2014-11-20 22:39:13,657 [0x2b85c8a089e0] [DEBUG]: Connected to instance 20 (10.150.20.69), jupiter10:1240
2014-11-20 22:39:20,254 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 49
2014-11-20 22:39:20,471 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:39:28,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 7 (10.150.20.62)
2014-11-20 22:39:28,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 7
2014-11-20 22:39:28,657 [0x2b85c8a089e0] [DEBUG]: Connected to instance 7 (10.150.20.62), jupiter03:1241
2014-11-20 22:39:31,881 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 50
2014-11-20 22:39:38,396 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 51
2014-11-20 22:39:39,710 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 52
2014-11-20 22:39:42,435 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:39:45,479 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 53
2014-11-20 22:39:46,429 [0x2b85e0201700] [DEBUG]: [SortArray] Produced sorted run # 54
2014-11-20 22:39:46,714 [0x2b85e0100700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:39:48,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 17 (10.150.20.67)
2014-11-20 22:39:48,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 17
2014-11-20 22:39:48,658 [0x2b85c8a089e0] [DEBUG]: Connected to instance 17 (10.150.20.67), jupiter08:1241
2014-11-20 22:39:53,573 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 55
2014-11-20 22:40:00,684 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 56
2014-11-20 22:40:09,445 [0x2b85e0302700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:40:10,513 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 57
2014-11-20 22:40:10,988 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:40:16,156 [0x2b85e0403700] [DEBUG]: [SortArray] Produced sorted run # 58
2014-11-20 22:40:16,396 [0x2b85e0201700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:40:18,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 3 (10.150.20.60)
2014-11-20 22:40:18,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 3
2014-11-20 22:40:18,657 [0x2b85c8a089e0] [DEBUG]: Connected to instance 3 (10.150.20.60), jupiter01:1241
2014-11-20 22:40:30,769 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:40:38,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 8 (10.150.20.63)
2014-11-20 22:40:38,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 8
2014-11-20 22:40:38,658 [0x2b85c8a089e0] [DEBUG]: Connected to instance 8 (10.150.20.63), jupiter04:1240
2014-11-20 22:40:39,598 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:40:48,657 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 15 (10.150.20.66)
2014-11-20 22:40:48,657 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 15
2014-11-20 22:40:48,658 [0x2b85c8a089e0] [DEBUG]: Connected to instance 15 (10.150.20.66), jupiter07:1241
2014-11-20 22:40:51,912 [0x2b85e0100700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:41:06,493 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:41:13,658 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 12 (10.150.20.65)
2014-11-20 22:41:13,658 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 12
2014-11-20 22:41:13,658 [0x2b85c8a089e0] [DEBUG]: Connected to instance 12 (10.150.20.65), jupiter06:1240
2014-11-20 22:41:25,939 [0x2b85e0403700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-11-20 22:41:31,020 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:41:38,658 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 13 (10.150.20.65)
2014-11-20 22:41:38,658 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 13
2014-11-20 22:41:38,658 [0x2b85c8a089e0] [DEBUG]: Connected to instance 13 (10.150.20.65), jupiter06:1241
2014-11-20 22:41:59,394 [0x2b85c8a089e0] [DEBUG]: Disconnected
2014-11-20 22:42:08,658 [0x2b85c8a089e0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 9 (10.150.20.63)
2014-11-20 22:42:08,658 [0x2b85c8a089e0] [DEBUG]: Recovering connection to instance 9
2014-11-20 22:42:08,659 [0x2b85c8a089e0] [DEBUG]: Connected to instance 9 (10.150.20.63), jupiter04:1241
2014-11-20 22:42:48,780 [0x2b85e0302700] [DEBUG]: [SortArray] Found 2 runs to merge
2014-11-20 22:44:17,152 [0x2b85c8e38700] [DEBUG]: [SortArray] merge sorted chunks complete took 516936 ms, or 8 minutes 36 seconds 936 milliseconds
2014-11-20 22:44:17,322 [0x2b85c8e38700] [DEBUG]: [RedimStore] redimensioned sorted took 517231 ms, or 8 minutes 37 seconds 231 milliseconds
2014-11-20 22:44:17,324 [0x2b85c8e38700] [DEBUG]: [RedimStore] synthetic dimension populated took 0 ms, or 0 millisecond
2014-11-20 22:47:48,257 [0x2b85c8e38700] [DEBUG]: [RedimStore] redimensioned --> beforeRedistribution took 210933 ms, or 3 minutes 30 seconds 933 milliseconds
2014-11-20 22:47:49,745 [0x2b85c8e38700] [DEBUG]: SG started with partitioning schema = 1, instanceID = 18446744073709551615
2014-11-20 22:47:49,748 [0x2b85c8e38700] [DEBUG]: Temporary array was opened
2014-11-20 22:47:49,748 [0x2b85c8e38700] [DEBUG]: Sending barrier to every one and waiting for 21 barrier messages

After this log, I didn’t get any logs.
my query doesn’t complete.

my array example is

{1} 'NSST',22,'NSST<value:int64> [col=0:8639,102,0,row=0:4319,102,0,time=2002185:2010145,2048,0]',true,false
{2} 'NSST_1D',9,'NSST_1D<row:int64,col:int64,time:int64,value:int64> [i=0:*,100000,0]',true,false

Chunk sizes of NSST are given by calculate_chunk_length.py.
The number of data is 13,623,600,000

my cluster consists of 11 node and has 32 RAM, 4 CPU cores.
my scidb configuraion is

[mydb]
server-0=jupiter00,1
server-1=jupiter01,2
server-2=jupiter02,2
server-3=jupiter03,2
server-4=jupiter04,2
server-5=jupiter05,2
server-6=jupiter06,2
server-7=jupiter07,2
server-8=jupiter08,2
server-9=jupiter09,2
server-10=jupiter10,2
db_user=mydb
db_passwd=mydb
install_root=/opt/scidb/14.8
metadata=/opt/scidb/14.8/share/scidb/meta.sql
pluginsdir=/opt/scidb/14.8/lib/scidb/plugins
logconf=/opt/scidb/14.8/share/scidb/log4cxx.properties
base-path=/work/scidb/data
base-port=1239
smgr-cache-size=2048
mem-array-threshold=2048
merge-sort-buffer=512
network-buffer=1024
replication-send-queue-size=1000
replication-receive-queue-size=1000
max-memory-limit=10000

execution-threads=2
operator-threads=2
result-prefetch-threads=4
result-prefetch-queue-size=2

please help me,
Thanks.


#2

The ‘Broken pipe’ errors suggest that several instances crashed (or somehow failed).
One possibility is that redimension() ran out of memory (known problem), but you should check the /X/Y/scidb.log files on those scidb instances.
One clue indicating an out of memory condition would be a line like this:
2014-11-18 12:54:9 (ppid=2482): SciDB child (pid=2498) terminated by signal = 9
in /X/Y/scidb-stderr.log.
If indeed the problem is with redimension() running out of memory, you should try to insert(redimension()…) your data into the target array incrementally.