Daily data load into the same array


#1

Hello,

I have a few questions about the loading process in SciDB.
I have a big amount of data that I want to load in multiple times with different nodes.
Let’s say for example that I have a cluster of 10 nodes, and every day a new data set of ~300GB arrives on one of these nodes.
I have done an initial load of the first data set into an array (csv load into 1D-array then redimension into 3D).

  1. What is the best way to insert new data the following day into the same array? I want to append a new slice of data (previous values will remain untouched, another part of the array is involved), without uselessly copying previous data into a new array.
    For example my first array would be temp:float with dimensions: [x=0:100, y=0:10, z=0:3], and the new data set would have the same structure with y=11:20.

  2. Does SciDB provide a load-balancing system so that each node’s space disk occupation is about the same?

  3. Can the LOAD command be invoked by other nodes than the coordinator node?

  4. How can I include another node to the system afterwards if I want to?

Thank you !


#2

Hi, thanks for your interest.

  1. What is the best way to insert new data the following day into the same array? I want to append a new slice of data (previous values will remain untouched, another part of the array is involved), without uselessly copying previous data into a new array. For example my first array would be temp:float with dimensions: [x=0:100, y=0:10, z=0:3], and the new data set would have the same structure with y=11:20.

You would do something like this:
create array data <…> [x=0:, y=0:, z=0:*]
create array data_day_1 … – create 1D copy of the first batch
load ( data_day_1 , … ) – load the first batch
insert ( redimension ( data_day_1, data), data) – make the first batch 3D and insert it into data

create array data_day_2 … – second batch
load (data_day_2, … ) – load the second batch
insert ( redimension (data_day_2, data), data) – insert second batch into data
– or if you need to “move” the coordinates: –
insert ( redimension ( project( attribute_rename(apply( data_day_2, newX, x+…), newX,…), newX, x), data), data) – insert second batch into data

… and so on.

In order for this to work, all dimensions must be integer. Insert over non-integer dimensions doesn’t work yet. This performs best if your chunks are split on the load boundary (i.e. new load doesn’t touch the chunks from the old load). For example if x is always between 1 and 1000 and y is always between 1 and 1000 and every day you add 10 new values of Z, then make the Z chunk size equal to 10.

  1. Does SciDB provide a load-balancing system so that each node’s space disk occupation is about the same?

Yes we smear the data across instances. We take the top-left coordinate of each chunk and hash it, then compute the hash modulo number of instances - and send the chunk to that instance. Take a look at this query for a way to examine it: viewtopic.php?f=18&t=1091

  1. Can the LOAD command be invoked by other nodes than the coordinator node?

The query always has to be sent to the coordinator node, but you can load from a particular instance, or you can split the file into pieces, send them to each instance and load from all instances simultaneously. See the documentation for the load command and the script loadcsv.py

  1. How can I include another node to the system afterwards if I want to?

With the current system, you have to do an “opaque save” - to export all the data outside scidb in scidb format. Then create a new cluster. Then perform an opaque load to put it back. Good news is that opaque save and load are very quick. Adding new instances on the fly is definitely on the future roadmap.

Hope it helps.

  • Alex Poliakov

#3

I was going to ask almost the same question, but with the concrete situation of time series. Each day (or more generally period) I get a new set of data for that period that I’d like to insert/append to my existing time series. So it’d be non-integer dimensions (datetime).

From your answer it sounds like that isn’t possible, correct?

The overhead of creating new full time series each day/period seems rather high and operationally it sounds quite complicated. Would probably disqualify it for many financial markets use cases.

Adding to time series something that you’ve got in your plans?

Thanks


#4

@apoliakov :

Thanks, that helps a lot :wink:


#5

Just to add a caveat - a lot of the folks we are working with simply elect to use an integer dimension for time. They encode time as the number of milliseconds, or microseconds, since some known start date. Modulo time zones, that’s how “datetime” and “timestamp” datatypes are usually implemented to begin with. Works for a few folks we’ve talked to - may or may not work for you.


#6

Yeah, that was the obvious workaround popping up in my head as well.

But more than that I think I’m looking for a good way to insert say just 1 observation without say csv import and redim etc.
I have the data in structured format already and I’m doing this programmatically so going via import formats etc seems unwanted complexity.

So I’ll rephrase to ask if any plans for “structured insert” without intermediary steps? For bulk the current way make sense, but less so for incremental small updates.


#7

To make the question more pointed: let’s say I have the data for a single entry / datapoint available in memory in my application and I want to add that single datapoint - can I do that straight via an api and not involve any intermediary csv file etc?

UPDATE:
I found this thread which seems helpful: viewtopic.php?f=6&t=1077
Learnt about array literal. Had missed that on first reading.

I’m interested in doing this from java/jvm language (clojure) so will look into how/if the jdbc driver allows me to do this.


#8

Understood. But keep in mind we are not (yet) optimized for single-point inserts. Doing a point-insert or a point-correction every once in a while is OK, but doing it all the time will explode your disk usage. FWIW.

That’s because every new insert creates a new array version which carries some overhead. It’s a feature designed to allow you to travel back in time in your array.

To work around this, if you must do a bunch of point-inserts into array A, then periodically do the following:

store(A, B)
remove(A)
rename(B,A)

Which will remove all of the history associated with A, and free up a lot of the disk.