Redimension fails


#1

Hi,

I’ve got a problem with redimensioning a large flat array in my SciDB.
I imported 56GB of binary data into a flat array and tried to redimension it into a 2-dim array. After running for ~50 minutes, the query aborts with this message:

SystemException in file: src/network/BaseConnection.h function: receive line: 310
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot send or receive network messages.

I checked the scidb.log file of the coordinator instance at the time when the query aborts

...
2014-10-29 11:41:43,089 [0x7fd3a6789700] [DEBUG]: Send message from coordinator for waiting instances in queryID: 1103631937128
2014-10-29 12:32:00,358 [0x7fd3a6789700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 3017175 ms, or 50 minutes 17 seconds 175 milliseconds
2014-10-29 12:32:00,611 [0x7fd3a6789700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-29 12:32:31,735 [0x7fd38fdfd700] [ERROR]: Job::execute: unhandled exception
type: N5scidb15SystemExceptionE
jobType: N5scidb9SortArray7SortJobE
mesg: SystemException in file: src/util/DataStore.cpp function: readData line: 268
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_DATASTORE_CHUNK_CORRUPTED
Error description: Storage error. Chunk header corrupted in DataStore with guid '/srv/scidbdata/14.8/tmp/memarray/0.data' offset '97001472'.
queryID = 1103631937128
2014-10-29 12:32:38,916 [0x7fd38fcfc700] [ERROR]: Job::execute: unhandled exception
type: N5scidb15SystemExceptionE
jobType: N5scidb9SortArray7SortJobE
mesg: SystemException in file: src/util/DataStore.cpp function: readData line: 268
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_DATASTORE_CHUNK_CORRUPTED
Error description: Storage error. Chunk header corrupted in DataStore with guid '/srv/scidbdata/14.8/tmp/memarray/0.data' offset '92700672'.
queryID = 1103631937128
2014-10-29 12:32:38,950 [0x7fd38fbfb700] [ERROR]: Job::execute: unhandled exception
type: N5scidb15SystemExceptionE
jobType: N5scidb9SortArray7SortJobE
mesg: SystemException in file: src/util/DataStore.cpp function: readData line: 268
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_DATASTORE_CHUNK_CORRUPTED
Error description: Storage error. Chunk header corrupted in DataStore with guid '/srv/scidbdata/14.8/tmp/memarray/0.data' offset '93093888'.
queryID = 1103631937128
2014-10-29 12:32:43,576 [0x7fd3b1e837c0] [INFO ]: Start SciDB instance (pid=28020). SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc.
2014-10-29 12:32:43,585 [0x7fd3b1e837c0] [INFO ]: Configuration:
precision : 6
...

Does “2014-10-29 12:32:00,358 [0x7fd3a6789700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 3017175 ms” mean that the redim actually finished and something in the post processing went wrong?

Also, I get several “SCIDB_LE_DATASTORE_CHUNK_CORRUPTED” exceptions - can I do something about that? What must be corrected to fix this?


#2

Hi Bonsai—
You have revealed a bug in the way SciDB refers to files in temporary space. In the default SciDB configuration, each instance has its own temp directory. Unfortunately, if you use the “tmp-path” option in the config.ini file, this results in all of the SciDB instances on a server using the same temp directory. When this happens we can get name conflicts for some temp files and see the errors that you experienced.

I have created a bug in our tracker to fix this issue.

In the meantime, you will need to ensure that each of your instances is using a separate temp directory. The simplest way to do this is to remove the “tmp-path” entry from your config.ini and restart SciDB. If you want your instances to share a pool of temp space, then you will need to use symlinks to redirect each single instance temp directory to a separate subdirectory within the shared location. Let us know whether you continue to have issues.
–Steve F
Paradigm4


#3

Hi Steve,

thanks for your help :smile:

[quote=“stevef”]
The simplest way to do this is to remove the “tmp-path” entry from your config.ini and restart SciDB. [/quote]
OK, I removed the tmp-path and reinitialised the cluster. Now I get a different error in the SortArray stage :wink:
Again, my query runs for ~45 min and then aborts.

2014-10-30 14:21:05,186 [0x7fc2eab9a700] [DEBUG]: Send message from coordinator for waiting instances in queryID: 1103733643066
2014-10-30 15:04:29,374 [0x7fc2f6adf7c0] [DEBUG]: Disconnected
2014-10-30 15:04:32,006 [0x7fc2f6adf7c0] [DEBUG]: Disconnected
2014-10-30 15:04:34,513 [0x7fc2f6adf7c0] [ERROR]: Network error in handleSendMessage #104('Connection reset by peer'), instance 2 (10.10.10.136)
2014-10-30 15:04:34,606 [0x7fc2f6adf7c0] [DEBUG]: Recovering connection to instance 2
2014-10-30 15:04:34,900 [0x7fc2f6adf7c0] [DEBUG]: Connected to instance 2 (10.10.10.136), vm-scidb-02:1240
2014-10-30 15:04:39,162 [0x7fc2f6adf7c0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 3 (10.10.10.136)
2014-10-30 15:04:39,162 [0x7fc2f6adf7c0] [DEBUG]: Recovering connection to instance 3
2014-10-30 15:04:39,263 [0x7fc2f6adf7c0] [DEBUG]: Connected to instance 3 (10.10.10.136), vm-scidb-02:1241
2014-10-30 15:07:06,166 [0x7fc2eab9a700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 2760846 ms, or 46 minutes 846 milliseconds
2014-10-30 15:07:06,305 [0x7fc2eab9a700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-30 15:07:42,862 [0x7fc2f6adf7c0] [DEBUG]: Disconnected
2014-10-30 15:07:46,320 [0x7fc2f6adf7c0] [ERROR]: Network error in handleSendMessage #104('Connection reset by peer'), instance 1 (127.0.1.1)
2014-10-30 15:07:46,320 [0x7fc2f6adf7c0] [DEBUG]: Recovering connection to instance 1
2014-10-30 15:07:46,333 [0x7fc2f6adf7c0] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
2014-10-30 15:07:46,333 [0x7fc2f6adf7c0] [DEBUG]: Connected to instance 1, vm-scidb-01:1240
2014-10-30 15:07:51,308 [0x7fc2f6adf7c0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 1
2014-10-30 15:07:51,309 [0x7fc2f6adf7c0] [DEBUG]: Recovering connection to instance 1
2014-10-30 15:07:51,310 [0x7fc2f6adf7c0] [DEBUG]: Connected to instance 1 (127.0.1.1), vm-scidb-01:1240
2014-10-30 15:08:24,353 [0x7fc2e8fb9700] [ERROR]: Job::execute: unhandled exception
type: St9bad_alloc
jobType: N5scidb9SortArray7SortJobE
mesg: std::bad_alloc
queryID = 1103733643066
2014-10-30 15:08:24,360 [0x7fc2e8fb9700] [ERROR]: Thread::threadFunction: unhandled exception: std::bad_alloc, tid = 140475109185280, pool = 0x7fc2dc002bd0
2014-10-30 15:08:26,132 [0x7fc2f6adf7c0] [INFO ]: Start SciDB instance (pid=6062). SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc.
2014-10-30 15:08:26,133 [0x7fc2f6adf7c0] [INFO ]: Configuration:
precision : 6
...

As a side node: I didn’t notice any network problems during that time…

From the log of instance 1 for the same time:

2014-10-30 14:21:05,186 [0x7fe68dd10700] [DEBUG]: ServerMessageHandleJob::handleWait: Wait on processing query 1103733643066
2014-10-30 15:04:29,344 [0x7fe69a15a7c0] [DEBUG]: Disconnected
2014-10-30 15:04:32,006 [0x7fe69a15a7c0] [DEBUG]: Disconnected
2014-10-30 15:04:34,727 [0x7fe69a15a7c0] [ERROR]: Network error in handleSendMessage #104('Connection reset by peer'), instance 2 (10.1
0.10.136)
2014-10-30 15:04:34,727 [0x7fe69a15a7c0] [DEBUG]: Recovering connection to instance 2
2014-10-30 15:04:34,872 [0x7fe69a15a7c0] [DEBUG]: Connected to instance 2 (10.10.10.136), vm-scidb-02:1240
2014-10-30 15:04:39,160 [0x7fe69a15a7c0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 3 (10.10.10.136)
2014-10-30 15:04:39,160 [0x7fe69a15a7c0] [DEBUG]: Recovering connection to instance 3
2014-10-30 15:04:39,162 [0x7fe69a15a7c0] [DEBUG]: Connected to instance 3 (10.10.10.136), vm-scidb-02:1241
2014-10-30 15:06:30,183 [0x7fe68eb1e700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 2724931 ms, or 45 minutes 24 seconds 9
31 milliseconds
2014-10-30 15:06:30,310 [0x7fe68eb1e700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-30 15:07:36,852 [0x7fe68ca3e700] [ERROR]: Job::execute: unhandled exception
type: St9bad_alloc
jobType: N5scidb9SortArray7SortJobE
mesg: std::bad_alloc
queryID = 1103733643066
2014-10-30 15:07:36,894 [0x7fe68ca3e700] [ERROR]: Thread::threadFunction: unhandled exception: std::bad_alloc, tid = 140628178757376, pool = 0x7fe6880019a0
2014-10-30 15:07:43,915 [0x7fe69a15a7c0] [INFO ]: Start SciDB instance (pid=6033). SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc.

From the log of instance 2 for the same time:

2014-10-30 14:21:12,454 [0x7fc059f44700] [DEBUG]: ServerMessageHandleJob::handleWait: Wait on processing query 1103733643066
2014-10-30 15:03:08,142 [0x7fc059e43700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 2515531 ms, or 41 minutes 55 seconds 5
31 milliseconds
2014-10-30 15:03:08,311 [0x7fc059e43700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-30 15:04:33,006 [0x7fc058665700] [ERROR]: Job::execute: unhandled exception
type: St9bad_alloc
jobType: N5scidb9SortArray7SortJobE
mesg: std::bad_alloc
queryID = 1103733643066
2014-10-30 15:04:33,439 [0x7fc058665700] [ERROR]: Thread::threadFunction: unhandled exception: std::bad_alloc, tid = 140464093550336, pool = 0x7fc0540038b0
2014-10-30 15:04:37,037 [0x7fc0652867c0] [INFO ]: Start SciDB instance (pid=8080). SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc.

From the log of instance 3 for the same time:

2014-10-30 14:21:12,454 [0x7f1b07aa4700] [DEBUG]: ServerMessageHandleJob::handleWait: Wait on processing query 1103733643066
2014-10-30 15:03:17,764 [0x7f1b078a2700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 2525291 ms, or 42 minutes 5 seconds 291 milliseconds
2014-10-30 15:03:17,820 [0x7f1b078a2700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-30 15:04:36,479 [0x7f1b13ff87c0] [DEBUG]: Disconnected
2014-10-30 15:04:38,206 [0x7f1b060cb700] [ERROR]: Job::execute: unhandled exception
type: St9bad_alloc
jobType: N5scidb9SortArray7SortJobE
mesg: std::bad_alloc
queryID = 1103733643066
2014-10-30 15:04:38,210 [0x7f1b060cb700] [ERROR]: Thread::threadFunction: unhandled exception: std::bad_alloc, tid = 139754042341120, pool = 0x7f1b000038b0
2014-10-30 15:04:39,303 [0x7f1b13ff87c0] [INFO ]: Start SciDB instance (pid=8082). SciDB Version: 14.8.7978. Build Type: RelWithDebInfo. Copyright (C) 2008-2014 SciDB, Inc.

Btw: the directories for the instances on my 2 hosts are named in a way that doesn’t seem very straightforward to me:
On the first host (with instances 0 and 1), the folders are:
…/data/000/0 and
…/data/000/1
On the second host (with instances 2 and 3), the folders are:
…/data/001/1 and
…/data/001/2
Is that intentional/correct? I’m just surprised that there are two folders “1” for two different instances.


#4

I played around with the memory configuration to try and get rid of these exceptions myself.
I’ve now got two VMs with 16 GB of RAM and the following config

[Datacube]
server-0=vm-scidb-01,1
server-1=vm-scidb-02,2
install_root=/opt/scidb/14.8
metadata=/opt/scidb/14.8/share/scidb/meta.sql
pluginsdir=/opt/scidb/14.8/lib/scidb/plugins
logconf=/opt/scidb/14.8/share/scidb/log4cxx.properties
base-path=/srv/scidbdata/14.8/data
base-port=1239
smgr-cache-size=1024
mem-array-threshold=1024
merge-sort-buffer=512
network-buffer=512
max-memory-limit=5000
execution-threads=1
result-prefetch-queue-size=1
result-prefetch-threads=1
operator-threads=1

Is that an appropriate config?

In any case, I now get still another exception. Now my query runs for ~190 minutes and then fails with

SystemException in file: src/network/NetworkManager.cpp function: handleConnectionError line: 853
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR2
Error description: Network error. Connection error while sending.

From the coordinator log:

2014-10-31 16:37:07,670 [0x7f4c2477e700] [DEBUG]: Send message from coordinator for waiting instances in queryID: 1100926547606
2014-10-31 16:59:51,048 [0x7f4c2477e700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 1363377 ms, or 22 minutes 43 seconds 377 milliseconds
2014-10-31 16:59:51,048 [0x7f4c2477e700] [DEBUG]: [SortArray] Sort for array  begins
2014-10-31 17:00:00,575 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 1
2014-10-31 17:00:10,813 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 2
2014-10-31 17:00:24,748 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 3
2014-10-31 17:00:35,820 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 4
2014-10-31 17:00:47,108 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 5
2014-10-31 17:00:58,236 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 6
2014-10-31 17:01:09,869 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 7
2014-10-31 17:01:20,575 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 8
2014-10-31 17:01:31,157 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 9
2014-10-31 17:01:43,450 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 10
2014-10-31 17:01:53,849 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 11
2014-10-31 17:02:05,596 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 12
2014-10-31 17:02:17,881 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 13
2014-10-31 17:02:28,049 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 14
2014-10-31 17:02:39,135 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 15
2014-10-31 17:02:51,818 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 16
2014-10-31 17:03:04,658 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 17
2014-10-31 17:03:15,307 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 18
2014-10-31 17:03:25,098 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 19
2014-10-31 17:03:35,341 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 20
2014-10-31 17:03:47,308 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 21
2014-10-31 17:03:59,553 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 22
2014-10-31 17:04:09,761 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 23
2014-10-31 17:04:21,540 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 24
2014-10-31 17:04:32,514 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 25
2014-10-31 17:04:44,914 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 26
2014-10-31 17:04:56,002 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 27
2014-10-31 17:05:09,815 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 28
2014-10-31 17:05:21,244 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 29
2014-10-31 17:05:31,726 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 30
2014-10-31 17:05:41,303 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 31
2014-10-31 17:05:51,169 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 32
2014-10-31 17:06:03,036 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 33
2014-10-31 17:06:03,588 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:06:41,801 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 34
2014-10-31 17:06:52,636 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 35
2014-10-31 17:07:03,874 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 36
2014-10-31 17:07:13,789 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 37
2014-10-31 17:07:25,006 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 38
2014-10-31 17:07:35,567 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 39
2014-10-31 17:07:48,368 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 40
2014-10-31 17:07:48,872 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:08:26,292 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 41
2014-10-31 17:08:35,861 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 42
2014-10-31 17:08:46,067 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 43
2014-10-31 17:08:56,601 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 44
2014-10-31 17:09:06,462 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 45
2014-10-31 17:09:15,957 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 46
2014-10-31 17:09:27,537 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 47
2014-10-31 17:09:28,021 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:10:18,325 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 48
2014-10-31 17:10:28,131 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 49
2014-10-31 17:10:38,527 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 50
2014-10-31 17:10:48,180 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 51
2014-10-31 17:10:57,613 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 52
2014-10-31 17:11:07,239 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 53
2014-10-31 17:11:17,220 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 54
2014-10-31 17:11:17,733 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:12:07,523 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 55
2014-10-31 17:12:17,634 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 56
2014-10-31 17:12:30,845 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 57
2014-10-31 17:12:40,854 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 58
2014-10-31 17:12:50,874 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 59
2014-10-31 17:13:01,459 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 60
2014-10-31 17:13:13,575 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 61
2014-10-31 17:13:14,092 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:14:42,521 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 62
2014-10-31 17:14:53,051 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 63
2014-10-31 17:15:02,946 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 64
2014-10-31 17:15:13,430 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 65
2014-10-31 17:15:23,231 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 66
2014-10-31 17:15:34,005 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 67
2014-10-31 17:15:43,906 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 68
2014-10-31 17:15:44,418 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:17:05,596 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 69
2014-10-31 17:17:16,346 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 70
2014-10-31 17:17:28,072 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 71
2014-10-31 17:17:38,155 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 72
2014-10-31 17:17:48,106 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 73
2014-10-31 17:17:58,099 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 74
2014-10-31 17:18:09,301 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 75
2014-10-31 17:18:09,928 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:19:28,196 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 76
2014-10-31 17:19:38,784 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 77
2014-10-31 17:19:48,466 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 78
2014-10-31 17:19:57,591 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 79
2014-10-31 17:20:06,601 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 80
2014-10-31 17:20:16,365 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 81
2014-10-31 17:20:26,472 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 82
2014-10-31 17:20:26,985 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:21:46,067 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 83
2014-10-31 17:21:56,373 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 84
2014-10-31 17:22:09,060 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 85
2014-10-31 17:22:20,360 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 86
2014-10-31 17:22:30,507 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 87
2014-10-31 17:22:40,005 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 88
2014-10-31 17:22:50,668 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 89
2014-10-31 17:22:51,214 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:24:51,360 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 90
2014-10-31 17:25:00,821 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 91
2014-10-31 17:25:12,176 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 92
2014-10-31 17:25:22,700 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 93
2014-10-31 17:25:32,190 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 94
2014-10-31 17:25:42,968 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 95
2014-10-31 17:25:53,771 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 96
2014-10-31 17:25:54,341 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:27:56,671 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 97
2014-10-31 17:28:06,598 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 98
2014-10-31 17:28:15,768 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 99
2014-10-31 17:28:25,424 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 100
2014-10-31 17:28:35,721 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 101
2014-10-31 17:28:49,115 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 102
2014-10-31 17:28:58,759 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 103
2014-10-31 17:28:59,269 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:31:10,996 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 104
2014-10-31 17:31:22,129 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 105
2014-10-31 17:31:33,287 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 106
2014-10-31 17:31:44,920 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 107
2014-10-31 17:31:57,600 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 108
2014-10-31 17:32:09,344 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 109
2014-10-31 17:32:20,628 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 110
2014-10-31 17:32:21,224 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:34:39,599 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 111
2014-10-31 17:34:50,230 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 112
2014-10-31 17:35:00,519 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 113
2014-10-31 17:35:11,735 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 114
2014-10-31 17:35:25,856 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 115
2014-10-31 17:35:38,505 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 116
2014-10-31 17:35:49,432 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 117
2014-10-31 17:35:49,945 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:38:57,745 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 118
2014-10-31 17:39:09,525 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 119
2014-10-31 17:39:26,642 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 120
2014-10-31 17:39:38,455 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 121
2014-10-31 17:39:49,751 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 122
2014-10-31 17:40:02,236 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 123
2014-10-31 17:40:14,157 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 124
2014-10-31 17:40:14,764 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:43:23,838 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 125
2014-10-31 17:43:35,164 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 126
2014-10-31 17:43:46,133 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 127
2014-10-31 17:43:56,787 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 128
2014-10-31 17:44:07,997 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 129
2014-10-31 17:44:18,596 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 130
2014-10-31 17:44:32,822 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 131
2014-10-31 17:44:33,366 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:47:38,589 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 132
2014-10-31 17:47:50,704 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 133
2014-10-31 17:48:00,803 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 134
2014-10-31 17:48:14,246 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 135
2014-10-31 17:48:26,395 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 136
2014-10-31 17:48:38,257 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 137
2014-10-31 17:48:50,450 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 138
2014-10-31 17:48:51,164 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:52:01,709 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 139
2014-10-31 17:52:12,330 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 140
2014-10-31 17:52:23,348 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 141
2014-10-31 17:52:33,989 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 142
2014-10-31 17:52:49,393 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 143
2014-10-31 17:53:00,254 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 144
2014-10-31 17:53:11,775 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 145
2014-10-31 17:53:12,290 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 17:57:02,016 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 146
2014-10-31 17:57:13,872 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 147
2014-10-31 17:57:30,758 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 148
2014-10-31 17:57:42,896 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 149
2014-10-31 17:57:54,202 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 150
2014-10-31 17:58:05,687 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 151
2014-10-31 17:58:17,015 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 152
2014-10-31 17:58:17,601 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:02:21,862 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 153
2014-10-31 18:02:32,828 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 154
2014-10-31 18:02:44,492 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 155
2014-10-31 18:02:55,794 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 156
2014-10-31 18:03:08,022 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 157
2014-10-31 18:03:19,018 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 158
2014-10-31 18:03:31,097 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 159
2014-10-31 18:03:31,730 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:07:29,899 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 160
2014-10-31 18:07:41,303 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 161
2014-10-31 18:07:52,599 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 162
2014-10-31 18:08:06,229 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 163
2014-10-31 18:08:17,687 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 164
2014-10-31 18:08:28,963 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 165
2014-10-31 18:08:41,886 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 166
2014-10-31 18:08:42,467 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:12:36,592 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 167
2014-10-31 18:12:48,400 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 168
2014-10-31 18:13:02,540 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 169
2014-10-31 18:13:14,101 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 170
2014-10-31 18:13:29,421 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 171
2014-10-31 18:13:40,680 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 172
2014-10-31 18:13:52,451 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 173
2014-10-31 18:13:53,064 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:18:26,167 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 174
2014-10-31 18:18:37,578 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 175
2014-10-31 18:18:49,766 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 176
2014-10-31 18:18:59,875 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 177
2014-10-31 18:19:10,003 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 178
2014-10-31 18:19:26,757 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 179
2014-10-31 18:19:40,190 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 180
2014-10-31 18:19:40,743 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:24:17,226 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 181
2014-10-31 18:24:29,169 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 182
2014-10-31 18:24:45,713 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 183
2014-10-31 18:24:56,959 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 184
2014-10-31 18:25:08,132 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 185
2014-10-31 18:25:26,435 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 186
2014-10-31 18:25:39,718 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 187
2014-10-31 18:25:40,276 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:30:25,237 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 188
2014-10-31 18:30:39,833 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 189
2014-10-31 18:30:58,231 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 190
2014-10-31 18:31:10,349 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 191
2014-10-31 18:31:22,644 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 192
2014-10-31 18:31:34,418 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 193
2014-10-31 18:31:46,156 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 194
2014-10-31 18:31:46,809 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:36:37,483 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 195
2014-10-31 18:36:53,565 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 196
2014-10-31 18:37:05,273 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 197
2014-10-31 18:37:16,754 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 198
2014-10-31 18:37:34,863 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 199
2014-10-31 18:37:46,951 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 200
2014-10-31 18:37:59,745 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 201
2014-10-31 18:38:00,434 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:43:39,198 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 202
2014-10-31 18:43:50,067 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 203
2014-10-31 18:44:02,210 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 204
2014-10-31 18:44:14,944 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 205
2014-10-31 18:44:25,980 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 206
2014-10-31 18:44:38,795 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 207
2014-10-31 18:44:51,575 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 208
2014-10-31 18:44:52,122 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:50:27,333 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 209
2014-10-31 18:50:37,809 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 210
2014-10-31 18:50:53,939 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 211
2014-10-31 18:51:07,533 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 212
2014-10-31 18:51:21,070 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 213
2014-10-31 18:51:34,877 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 214
2014-10-31 18:51:46,588 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 215
2014-10-31 18:51:47,123 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 18:57:33,029 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 216
2014-10-31 18:57:51,197 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 217
2014-10-31 18:58:03,659 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 218
2014-10-31 18:58:17,800 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 219
2014-10-31 18:58:31,057 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 220
2014-10-31 18:58:42,467 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 221
2014-10-31 18:58:53,052 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 222
2014-10-31 18:58:53,578 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 19:05:22,744 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 223
2014-10-31 19:05:36,791 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 224
2014-10-31 19:05:55,501 [0x7f4bb745b700] [DEBUG]: [SortArray] Produced sorted run # 225
2014-10-31 19:05:56,066 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 19:13:08,569 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 19:19:46,516 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 19:25:38,712 [0x7f4bb745b700] [DEBUG]: [SortArray] Found 8 runs to merge
2014-10-31 19:49:11,057 [0x7f4bb745b700] [DEBUG]: pwritev failed, fd=21 size=3374042 offs=10628366336 rc=1011712 errno=2
2014-10-31 19:49:14,026 [0x7f4c247a87c0] [DEBUG]: Disconnected
2014-10-31 19:49:16,934 [0x7f4bb745b700] [ERROR]: Job::execute: unhandled exception
type: N5scidb15SystemExceptionE
jobType: N5scidb9SortArray8MergeJobE
mesg: SystemException in file: src/util/FileIO.cpp function: writeAllv line: 304
Error id: scidb::SCIDB_SE_IO::SCIDB_LE_PWRITE_ERROR
Error description: I/O error. pwrite failed to write 3374042 byte(s) to the position 10628366336 with error 2.
queryID = 1100926547606
2014-10-31 19:49:17,003 [0x7f4c2477e700] [DEBUG]: [SortArray] merge sorted chunks complete took 10165887 ms, or 2 hours 49 minutes 25 seconds 887 milliseconds
2014-10-31 19:49:17,063 [0x7f4c2477e700] [DEBUG]: [RedimStore] redimensioned sorted took 10166015 ms, or 2 hours 49 minutes 26 seconds 15 milliseconds
2014-10-31 19:49:17,063 [0x7f4c2477e700] [DEBUG]: [RedimStore] synthetic dimension populated took 0 ms, or 0 millisecond
2014-10-31 19:49:17,224 [0x7f4c2477e700] [DEBUG]: [RedimStore] redimensioned --> beforeRedistribution took 161 ms, or 161 milliseconds
2014-10-31 19:49:17,751 [0x7f4c247a87c0] [ERROR]: Network error in handleSendMessage #104('Connection reset by peer'), instance 1 (127.0.1.1)
2014-10-31 19:49:17,754 [0x7f4c247a87c0] [DEBUG]: Recovering connection to instance 1
2014-10-31 19:49:17,870 [0x7f4c247a87c0] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
2014-10-31 19:49:17,870 [0x7f4c247a87c0] [DEBUG]: Connected to instance 1, vm-scidb-01:1240
2014-10-31 19:49:22,738 [0x7f4c247a87c0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 1
2014-10-31 19:49:23,002 [0x7f4c247a87c0] [DEBUG]: Recovering connection to instance 1
2014-10-31 19:49:23,003 [0x7f4c247a87c0] [ERROR]: Could not get the remote IP from connected socket to/frominstance 1. Error:107('Transport endpoint is not connected')
2014-10-31 19:49:23,003 [0x7f4c247a87c0] [DEBUG]: Connected to instance 1, vm-scidb-01:1240
2014-10-31 19:49:23,190 [0x7f4c2477e700] [DEBUG]: SG started with partitioning schema = 1, instanceID = 18446744073709551615
2014-10-31 19:49:23,199 [0x7f4c2477e700] [DEBUG]: Temporary array was opened
2014-10-31 19:49:23,199 [0x7f4c2477e700] [DEBUG]: Sending barrier to every one and waiting for 3 barrier messages
2014-10-31 19:49:23,209 [0x7f4c247a87c0] [ERROR]: Network error in handleSendMessage #32('Broken pipe'), instance 1
2014-10-31 19:49:23,209 [0x7f4c247a87c0] [ERROR]: NetworkManager::handleConnectionError: Conection error in query 1100926547606
2014-10-31 19:49:23,216 [0x7f4c247a87c0] [DEBUG]: Recovering connection to instance 1
2014-10-31 19:49:26,217 [0x7f4c247a87c0] [DEBUG]: Connected to instance 1 (127.0.1.1), vm-scidb-01:1240
2014-10-31 19:49:33,230 [0x7f4c2477e700] [DEBUG]: Query::done: queryID=1100926547606, _commitState=0, errorCode=120
2014-10-31 19:49:33,230 [0x7f4c2477e700] [ERROR]: executeClientQuery failed to complete: SystemException in file: src/network/NetworkManager.cpp function: handleConnectionError line: 853
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR2
Error description: Network error. Connection error while sending.
2014-10-31 19:49:33,230 [0x7f4c2477e700] [DEBUG]: Query (1100926547606) is being aborted
2014-10-31 19:49:33,230 [0x7f4c2477e700] [ERROR]: Query (1100926547606) error handlers (2) are being executed
2014-10-31 19:49:33,231 [0x7f4c2477e700] [DEBUG]: Update error handler is invoked for query (1100926547606)
2014-10-31 19:49:33,477 [0x7f4c2477e700] [DEBUG]: UpdateErrorHandler::handleErrorOnCoordinator: the new version 1 of array ascat_soil_moisture_all is being rolled back for query (1100926547606)
2014-10-31 19:49:33,477 [0x7f4c2477e700] [DEBUG]: Performing rollback
2014-10-31 19:49:33,537 [0x7f4c2477e700] [DEBUG]: End of log at position 708560 rc=136
2014-10-31 19:49:33,537 [0x7f4c2477e700] [DEBUG]: End of log at position 0 rc=0
2014-10-31 19:49:33,538 [0x7f4c2477e700] [DEBUG]: Rollback complete
2014-10-31 19:49:33,569 [0x7f4c2477e700] [DEBUG]: Broadcast ABORT message to all instances for query 1100926547606

I’m surprised about the network connection errors to instance 1 because it resides on the same machine as the coordinator.
As for “pwrite failed to write 3374042 byte(s) to the position 10628366336 with error 2.” does this mean there was no space left on the hdd?


#5

Hi Bonsai—
Your config looks reasonable. Also, the naming of directories also looks correct. For some esoteric reasons, instance numbers on the coordinator start with “0” and on other nodes start with “1”.

As for your errors, it does appear that you are running out of temporary space. In 14.8, the error path for out-of-space conditions is not correct, which results in the wrong error number being reported “2”, as well as the exception not being handled properly. This results in the “network connection” errors that you see. In the upcoming release we have fixed this error-handling so that when space on the file system is exhausted, SciDb reports the proper error and fails more gracefully.

Of course the real question is now: why are you running out of space? Redimension ends up sorting the data in your input array. Given that you have 16GB of RAM, but 45 GB of data, much of the array data will spill to disk into temporary storage. We need to figure out how much space you have on your hard drive and in what directories. Can you re-run the query and periodically capture the output of the “df” command during the run?
–Steve


#6

Hello,

I also experience big trouble with the redimension operator and the version system. It not exactly fails but it slows down dramatically. I have several “small” flat arrays (~2500) and i am redimension them in one big 2 dim array. The performance at the beginning is somehow ok (still way to slow, but the job could finish in finite time) but after some time the speed of the redimension degrades to an unacceptable level. Please explain why this happens and if it is possible to work around this issue (and not by adding more hardware, in my opinion a 4 CPU Node with 8GB should easily handle some 20 - 50GB of data).

The problem with the version system is that it consumes way too much hard disk space. In my case i am just inserting data with almost no overlaps and disk usage explodes (filling up the 300GB easily with 10GB of net data). It almost looks like as you would snapshot whole chunks.

I am using a virtual machine with ubuntu 12.04, 4 CPUs, 8 GB RAM, 300 GB Harddisk and scidb 14.8. Setup was done with the cluster installer.

The configuration is

[ascat]
server-0=localhost,3
install_root=/opt/scidb/14.8
metadata=/opt/scidb/14.8/share/scidb/meta.sql
pluginsdir=/opt/scidb/14.8/lib/scidb/plugins
logconf=/opt/scidb/14.8/share/scidb/log4cxx.properties
db_user=pguser
db_passwd=pgpass
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0
mem-array-threshold=338
smgr-cache-size=338
merge-sort-buffer=169
network-buffer=169
execution-threads=1
result-prefetch-threads=1
result-prefetch-queue-size=1
operator-threads=1

####################
# WARNING: Make sure all of the below directories are created
# before you execute 'initall'

# Instance storage directories on localhost
data-dir-prefix-0-0=/datadisk0/ascat.0.0
data-dir-prefix-0-1=/datadisk1/ascat.0.1
data-dir-prefix-0-2=/datadisk0/ascat.0.2
data-dir-prefix-0-3=/datadisk1/ascat.0.3

The source arrays look like this:

create array ascat_soil_moisture_flat_"$CELLID" \
      <gpi: int64, \
       lat: float, \
       lon: float, \
       time: int64, \
       sm: uint8, \
       sm_noise: uint8, \
       ssf: uint8, \
       proc_flag: uint8, \
       orbit_dir: uint8> \
      [i=0:"$INDEX",1000000,0]; \
    LOAD(ascat_soil_moisture_flat_"$CELLID",'"$filename"',0,'(int64,float,float,int64,uint8,uint8,uint8,uint8,uint8)',0);"

The redimension shell script:

#!/bin/bash

echo "start redimension"
ARRAYNAME="ascat_soil_moisture_all"

iquery --ignore-errors -naq "remove($ARRAYNAME)" 2>/dev/null

iquery -naq "\
  create array $ARRAYNAME < \
      sm:uint8, \
      sm_noise:uint8, \
      ssf:uint8, \
      proc_flag:uint8, \
      orbit_dir:uint8> \
  [ gpi=0:3264390,437,0, \
    time=55000000:65000000,10000,0 ];"

for filename in /home/scidb/testdata/ASCAT_netcdf/*.bin; do
  TMP=$(basename ${filename} .bin)
  IFS='_' read -ra DATA <<< "$TMP"
  echo "CELLID ${DATA[0]}, INDEX ${DATA[1]}"
  CELLID=${DATA[0]}
  time iquery --ignore-errors -naq "insert(redimension(ascat_soil_moisture_flat_$CELLID, $ARRAYNAME), $ARRAYNAME)"
  VERSIONNUM=$(iquery -aq "aggregate(versions($ARRAYNAME), max(version_id))" | sed -n 's|^{0} \([0-9]\+\)$|\1|p')
  iquery -naq "remove_versions($ARRAYNAME, $VERSIONNUM);"
done
echo "done redimension"

Attached you find the performance log. as you can see in the redimension shell script i am only logging the duration of the redimension/insert operation.
The line e.g. “CELLID 0028, INDEX 13107” indicates flat array num 0028 with 13107 entries.

Ahh… the redimension ran for appr. 28h and the Runtime information of the virtual machine is


Runtime Attributes

Screen Resolution 1440x1050x32 @0,0
Clipboard Mode Disabled
Drag'n'Drop Mode Disabled
VT-x/AMD-V Enabled
Nested Paging Enabled
Unrestricted Execution Disabled
Guest Additions 4.3.10 r93012
Guest OS Type Linux 2.6 / 3.x (64 bit)
Remote Desktop Server Port Not Available


Storage Statistics 
Controller: IDE

  IDE Secondary Master:
    DMA Transfers 1
    PIO Transfers 52.886
    Data Read 0 B
    Data Written 0 B


Controller: SATA
  SATA Port 0:
    DMA Transfers 66.593.428
    Data Read 446.050.414.592 B
    Data Written 603.569.939.456 B

Network Statistics

Adapter 1
  Data Transmitted 216.442 B
  Data Received 1.463.208 B

any help appreciated

best regards!
redimension.log (30.4 KB)


#7

Hello,

In my case i am just inserting data with almost no overlaps and disk usage explodes (filling up the 300GB easily with 10GB of net data).
Could you elaborate on your example ? We do store complete chunk copies; however, if you have 10GB of raw data, you should get at most ~10GB of redimensioned data. Each redimension() may use upto 5 times the input of temp space, but it has no effect on the stored redimensioned array. Also, remove_versions() can be used to remove overlapping data from previous versions.

Ahh… the redimension ran for appr. 28h and the Runtime information of the virtual machine is …
Most likely, SciDB is running out of memory. It is a known problem with redimension(), which causes memory bloat. We are investigating that problem. The only 100% effective workaround is to restart scidb after a few redimensions. Reducing some config.ini parameters may also help: smgr-cache-size=128, mem-array-threshold=256, merge-sort-pipeline-limit=16


#8

Hello tigor,

Thats not exactly true… Let’s say i have 100 small flat arrays (loaded from binary files) and and i execute redimension on each of this flat arrays and insert it into one multidimensional array then it possibly doubles every time on insert the physical size of the multidimensional array (if each chunk is touched). You are aiming on big data with your technology. You have to keep storage overhead as small as possible.

I will try this settings but it doesn’t look like a memory problem due to memory leaks. When i restart the Scidb instances and continue my redimension/insert already the first redimension/insert takes very long. I also tried to issue an analyze command on the multidimensional array but the operation did not finish. I canceled it after 2 hours. For me it looks like an issue with your internal data dictionary when modifying (insert) an array several times.


#9

So … I’m looking over the shoulder of this thread. I think we’re looking at a couple of problems with SciDB (we’re working on these). But I have a couple of thoughts / suggestions that are more intended to help you at the application development level. (Although, the script I’m pasting into this reply is one of the tools we’re using to pick apart the problems you’re running into, so feel free to comment on how close we are to your use-case, especially the data generator…)

1. Rather than loading-and-redimensioning one file at a time, it can be more efficient (overall) to load data in a more “batched” style.

Instead of creating a single 1D load array for each input binary file, instead create a 2D array, load a set of binary files into it (one “slice” at a time) and then transform the entire batch into the target. Something like this:

#
#  Starting with some variables we'll use througout. Change
#  the names here to whatever you want. This script is also
#  intended to measure some things like storage utilization, so
#  to get it to work on your setup, change the DATA_DIR 
#  directory name appropriately.
#
LOAD_ARRAY_1D_NAME="ascat_soil_moisture_load_1D"
LOAD_ARRAY_2D_NAME="ascat_soil_moisture_load_2D"
ARRAY_2D_NAME="ascat_soil_moisture"
DATA_DIR=~/Devel/Data/000/0/datastores
#
#  Hygiene, Make the script re-runnable.
#
iquery -aq "remove ( ${LOAD_ARRAY_1D_NAME} )"
iquery -aq "remove ( ${LOAD_ARRAY_2D_NAME} )"
iquery -aq "remove ( ${ARRAY_2D_NAME} )"
#

So. Rather than a series of arrays with names like my_load_array_CELLNUM, bring the CELLNUM into a dimension. Something like this …

#
#   So. The first improvement idea is to organize the load away from a "one
#  array per file" model, to a "one array for a block of files" model. In the
#  following array, the CELLID used to name each distinct array.
#
iquery -aq "
CREATE ARRAY ${LOAD_ARRAY_2D_NAME}
<   gpi      : int64,
    lat      : float,
    lon      : float,
    time     : int64,
    sm       : uint8,
    sm_noise : uint8,
    ssf      : uint8,
    proc_flag: uint8,
    orbit_dir: uint8
>
[ cellnum=0:*,1,0, i=0:*,1000000,0];
"

I created a couple of queries and a loop to generate a number of data files that ( I hope ) look and feel something like the ones you’re using.

#
#  One of the things we do in this script is to generate some data to use
#  in our mock up.
#
CMD_CREATE_1D_ARRAY="CREATE ARRAY ${LOAD_ARRAY_1D_NAME}
<   gpi      : int64,
    lat      : float,
    lon      : float,
    time     : int64,
    sm       : uint8,
    sm_noise : uint8,
    ssf      : uint8,
    proc_flag: uint8,
    orbit_dir: uint8
>
[ i=0:*,1000000,0];"
#
CMD_POPULATE_1D_ARRAY="SELECT random()%100000 AS gpi,
       float(random()%20000)/float(100.0) AS lat,
       float(random()%40000)/float(100.0) AS lon,
       55000000 + random()%1000000 AS time,
       uint8(random()%256) AS sm,
       uint8(random()%256) AS sm_noise,
       uint8(random()%256) AS ssf,
       uint8(random()%256) AS proc_flag,
       uint8(random()%256) AS orbit_dir
  INTO ${LOAD_ARRAY_1D_NAME}
  FROM build ( < dummy : bool > [ i=0:999999,1000000,0 ], true );
"
#

Now I can use those queries in a loop to generate (say) 10 load files.

#
for i in `seq 10`; do
#
echo "+=================+"
echo "||     ${i}     ||"
echo "+=================+"
#
iquery --ignore-errors -aq "${CMD_REMOVE_1D_ARRAY}"
iquery --ignore-errors -aq "${CMD_CREATE_1D_ARRAY}"
iquery --ignore-errors -nq "${CMD_POPULATE_1D_ARRAY}"
#
CMD_SAVE_1D_ARRAY="save (
  ascat_soil_moisture_load_1D,
  '/tmp/ASM_${i}D.bin',
  0,
  '(int64,float,float,int64,uint8,uint8,uint8,uint8,uint8)'
);"
#
iquery -naq "${CMD_SAVE_1D_ARRAY}"
#
done
#
iquery --ignore-errors -aq "${CMD_REMOVE_1D_ARRAY}"

At this point, it’s worth doing a quick check on the sizes of these files, and to get a baseline for the size of the datastore the SciDB installation has created. The key thing is that we’re using the input(…) operator (that reads data from an external file, and hands the rest of the system an object that looks like an array) and pumping what it produces through a redimension (to get it into the same “shape” as the 2D load array) and then an insert(…) (to place the data in the array). Note way that the CELLNUM dimension has a per-dimension chunk length of 1? That means that all of the data from the load file is going into a new, and distinct, chunk. There won’t be any overlapping cells.

#
#  Just a quick check to show that the datastore is empty at this point.
#  Each remove(...) cleans up all of the files used to hold the array data.
#
du -sh ${DATA_DIR}
du -sh /tmp/ASM_*D.bin

Recall the 2D load array we created earlier? The following script illustrates how to take the collection of 10 files in /tmp, and load them all (one at a time) into it. The point is that you can purge / remove this “temporary” 2D load array at the end of each block. Or you can load (say) 100 files into it.

[code]for i in seq 10; do

LOAD_CMD=“insert (
redimension (
apply (
input (
< gpi : int64,
lat : float,
lon : float,
time : int64,
sm : uint8,
sm_noise : uint8,
ssf : uint8,
proc_flag: uint8,
orbit_dir: uint8
>
[ i=0:*,1000000,0 ],
’/tmp/ASM_${i}D.bin’,
0,
’(int64,float,float,int64,uint8,uint8,uint8,uint8,uint8)’
),
cellnum, ${i}
),
${LOAD_ARRAY_2D_NAME}
),
${LOAD_ARRAY_2D_NAME}
);”

Log the time for each load.

iquery -aq "op_now();"
iquery -naq “${LOAD_CMD}”

Note how, on each load, the number of cells in the 2D load array increases

by the number of cells in the load file.

iquery -q “SELECT COUNT(*) FROM ${LOAD_ARRAY_2D_NAME};”

How much data in the data directory?

Note that SciDB allocates storage data in segments that are larger than

the size of a collection of chunks.

du -sh ${DATA_DIR}

done
[/code]

The final step is to take the contents of the 2D load array and insert it into the 2D target array. Now … a little bit of “look ahead” into what we’re doing with SciDB here. You used the calculate_chunk_length.py script to set your per-dimension chunk lengths. (More on this below). For 14.12, we’re planning to make that script obselete and to replace it with functionality that works on the CREATE ARRAY … statement. Same basic idea, though. You need an array that has the data you’ll be organizing into the new array. The difference being, with the approach below, you don’t need to edit the CREATE ARRAY … by hand.

CMD_CREATE_ARRAY_2D="
CREATE ARRAY ${ARRAY_2D_NAME}
<
   sm        : uint8,
   sm_noise  : uint8,
   ssf       : uint8,
   proc_flag : uint8,
   orbit_dir : uint8
>
[ gpi=0:*,?,0, time=55000000:65000000,?,0 ] AS ${LOAD_ARRAY_2D_NAME};
"

If you want to see what SciDB choose, use the show(‘array_name’) operator as follows:

$ iquery -aq "show ( ${ARRAY_2D_NAME} );"
{i} schema
{0} 'ascat_soil_moisture<sm:uint8,sm_noise:uint8,ssf:uint8,proc_flag:uint8,orbit_dir:uint8> [gpi=0:*,32768,0,time=55000000:65000000,262144,0]'

So in my case, given the data I generated, SciDB choose the per-dimension chunk lengths shown here.

Converting from the 2D load array into the 2D target array can be accomplished using the redimension(…) operator in exactly the way you were doing it, only each step involves a batch of (say) 10 data files, rather than a single file-at-a-time.

iquery -aq "op_now();"
#
iquery -naq "
insert (
  redimension (
    ${LOAD_ARRAY_2D_NAME},
    ${ARRAY_2D_NAME}
  ),
  ${ARRAY_2D_NAME}
);
"
#
iquery -aq "op_now();"

Overall? It’s best practice to “batch” a set of load files, and re-organize them all together.

2. Give the calculate_chunk_length.py script as much data to work on as you can.

Looking at your per-dimension chunk length values makes me a bit suspicious. 437 x 10000 yields a logical size of 4,370,000 cells. In general, we try for about 1,000,000 physical cells per chunk. So the calculate_chunk_length.py script thinks you’re dealing with sparse data. But because it can work with only one of your 1D load arrays at a time, I suspect it’s not producing ideal values. Each of your netcdf files (I assume) focusses itself on one spatio-temporal region of some larger space. So what you’re getting here is ideal chunking for just that single input, not the size / shape of the space you will get when you combine all of the files together.

I would (at least) try loading a selection of your input data into the 2D load array I’ve created above, and re-run your calculate_chunk_length.py using that collection of input file data. I suspect you will get better chunk organization. If your chunks are too (logically) small, then you’ll be creating a lot more of them, and that will incur (in turn) a lot more meta-data to manage them.

Another way to get a handle on what’s going in is to examine the chunk maps directly. SciDB keeps a lot of meta-data about chunk organization. The following query will tell you, per array, how many chunks you have, how big they are on average (both in terms of physical size and cells per chunk), etc.

# The purpose of this query is to
#  get a handle on how the target array's chunking is actually working: the
#  number of chunks, the total size of the array's storage, and details
#  about the number of cells per chunk.
#
CHECK_CHUNKING_2D="aggregate (
  filter (
    cross_join (
      filter (
        list ('arrays'),
        name = '${LOAD_ARRAY_2D_NAME}'
      ) AS A,
      list('chunk map') AS C
    ),
    A.uaid = C.uaid
  ),
  count(*) AS num_chunks,
  sum ( C.nelem ) AS num_cells,
  min ( C.nelem ) AS min_cells_per_chunk,
  max ( C.nelem) AS max_cells_per_chunk,
  avg ( C.nelem ) AS avg_cells_per_chunk,
  sum ( C.usize ) AS total_array_size_in_bytes,
  sum ( C.csize ) AS total_compressed_array_size_in_bytes,
  sum ( C.asize ) AS total_allocated_size_in_bytes
);"

The information this query gives you should tell you how close you are to the ideal chunk size, which is (as I mentioned) about 1,000,000 cells-per-chunk (on average).

Anyway … I hope those two suggestions are helpful.


#10

As a side note: When I run calculate_chunk_length.py on the whole dataset in one flat array, I get this:
gpi=0:3264390,262144,0, time_scaled=135140000:416400000,156584,0


#11

[quote]As a side note: When I run calculate_chunk_length.py on the whole dataset in one flat array, I get this:
gpi=0:3264390,262144,0, time_scaled=135140000:416400000,156584,0[/quote]

This doesn’t surprise me. See my point #2 in the response above. I suspect SciDB looked at a single “slice” of data, and made some very poor choices by extrapolating that out to the whole thing.

I suspect that your chunk sizes are very small. Did you try the “chunk sizes from list(‘chunk map’) query” that was the last thing suggestion?


#12

[quote=“plumber”][quote]As a side note: When I run calculate_chunk_length.py on the whole dataset in one flat array, I get this:
gpi=0:3264390,262144,0, time_scaled=135140000:416400000,156584,0[/quote]

This doesn’t surprise me. See my point #2 in the response above. I suspect SciDB looked at a single “slice” of data, and made some very poor choices by extrapolating that out to the whole thing. [/quote]

But why would SciDB only look at a slice if I pass into calculate_chunk_length.py the whole dataset?


#13

Sorry Bonsai!

We’ve multiple people on this thread. I got mixed up!

You’re quite right. Passing calculate_chunk_length.py the whole dataset is the ideal way of figuring out chunk sizing.

Did you change your chunking to accommodate the script’s advice? I’m curious to know if it improved things (or not).


#14

I did change my chunking to (something similar to) the script’s advice and I think it improved matters (well, made loading possible at all, to tell the truth).
But I didn’t get to switch my loading process to iterative yet (which is a suggestion from here to limit the disk usage of the loading process).