How to improve the loading performance?


#1

Hi !

I’m doing experiment on astronomy application with SciDB, I’ve found that its searching performance was perfect.

But I have a question that how to imporve the loading performance?

I’m dealing with 2,000,000,000 records which occupied about 300GB, but I only have one PC.

I create an array with two dimensions: [obs_id=1204300800:1217520000,1,0, id=0:20000,1000,0], I just loaded about 10,000,000 records in 40 hours, and the speed is getting slower and slower with time passed by that I can’t wait.

Then I divided the obs_id dimensiont into 800 arrays, the situation was better, I loaded 1,300,000,000 records in 40 hours, but the result was still unacceptable to me because it was getting slower and slower.

I wonder whether this situation is normal or I just did it wrong ?


#2

Would you be so kind as to please post:

  1. Your CREATE ARRAY statement.
  2. A rough idea of the what your source data looks like. It’s a lot of data. Is that the USNOB catalog? Or another one?
  3. The script you’re using to load your source into SciDB.

We will try to repro your situation and have a look.

Questions:

i. SciDB does best when it’s able to parallelize this kind of operations. How many cores on your PC? And how many SciDB instances?
ii. When you look at the box, are you CPU bound? (top says idle% = 0?)
iii. The “load slows over time” worries me a bit. It suggests we have a serial data structure somewhere that needs to be turned into something else.

KR

Pb

#3

Hello,

I can provide some loading data we’ve collected in the past, for comparison purposes.

Internally, we’ve run a benchmark on a particular data set that had:

  • about 20 attributes of types int64, int32, float, and int8
  • a single dimension [x=0:*,10000,0]

We had a total count of 470,829,834 cells and a pre-load on-disk size of 54GB (text).
We’ve loaded this dataset in 16,080 seconds, which is under 5 hours. We were using a 4-node cluster, loading data in from a single node so we further assumed that the network was a bottleneck in this case.

You have 300GB, which is 6x the size we dealt with. So we would, at worst, expect your entire load to finish in 30 hours or less. Not the case!

One glaring finding: your chunk size may be too small. I am not sure how sparse your data is but a chunk size of 1x1000 may result in too many chunks being created and too many chunk headers being written out. Based on the data, we really would like the chunk size to be a few megs per chunk. What are the attributes and how dense is the data in this case?


#4

Thank you for your advices! I’ll try!