Load numpy array into SciDB array


#1

I have some arrays with the same size (18036020) and I need to load it into SciDB.

Example:

archivo = ‘uv20150130rt.nc#I get the data from NetCDF file
ncfile = Dataset(archivo,‘r’) #Open NetCDF

wind1 = ncfile[‘w’] #Get the variable I’m interested

name = “winds"
sdb.query(“create array “+name+” <m_s:double>[time=0:”+str(wind_total.shape[0] - 1)+”,1,0,lat=0:"+str(wind_total.shape[1] - 1)+",1,0,lon=0:"+str(wind_total.shape[2] - 1)+",1,0]")

Array = sdb.wrap_array(name)
print Array.toarray()

So, I have prepared the SciDB array, but I do not load the data


#2

I am trying to do this:

sdb.query(“create array “+name+” <m_s:double>[time=0:”+str(wind_total.shape[0] - 1)+",1,0,lat=0:"+str(wind_total.shape[1] - 1)+",1,0,lon=0:"+str(wind_total.shape[2] - 1)+",1,0]")

Array = sdb.wrap_array(name)

temp = sdb.from_array(wind_total)

sdb.afl.store(temp.name, Array.name)

but don not work…


#3

What are your attributes?

The structure of an array is:

CREATE ARRAY Array_Name
< attribute_name : type_name { , attribute_name : type_name } > 
[ dimension_name = min_value : max_value, per-dimension-chunk-length, overlap,
 {, dimension_name = min_value : max_value, per-dimension-chunk-length, overlap }
];

I could be wrong, but it looks to m that your CREATE ARRAY … statement doesn’t have any attributes defined. What does the string you’re creating look like?


#4

Yes, I was wrong when I copied, the correct query that I try is:

sdb.query(“create array “+name+” < m_s:double >[time=0:”+str(wind_total.shape[0] - 1)+",1,0,lat=0:"+str(wind_total.shape[1] - 1)+",1,0,lon=0:"+str(wind_total.shape[2] - 1)+",1,0]")

so, in this example the query is:

create array winds <m_s:double>[time=0:1,1,0, lat=1:180,1,0, lon=1:360,1,0]

and wind_total has the followinf shapes = (2,180,360)


#5

Oh boy.

  1. You need to specify the attributes that will appear in the array. Have a look at the CREATE ARRAY … statement in the documentation here. What is it that appears at each time x lat x long cell? The variable you’re interested in is called “wind”. If it’s type is a float ( say ), then you need to specify what you want.

    CREATE ARRAY Winds
    < wind : float >
    [ time, lat, long ];

  2. It’s a really, really, really bad idea to set your per-dimension-chunk-length to ‘1’. As a rule, we set chunk sizes so that are about 1,000,000 cells per chunk. Of course, in your case the sizes can’t be that big. But …

    CREATE ARRAY Winds
    < wind : float >
    [ time=0:1,1,0, lat=1:180,180,0, long=1:360,360,0 ];

  3. So … um … this is a really small array. My advice would be to use on of the HDF5 loaders that you will find reference to if you search the forum. You probably don’t want to use python as your vehicle for bulk loading data. Have a look at this write-up on the best way to rapidly bulk load data.

As a rule, use python to exchange queries and the data that makes up query results with SciDB and to display result, etc. Python is the dashboard, steering wheel, and so on. SciDB is the engine.

Hope this helps!


#6

Sorry, but I did not separate the <> in the before reply and did not appear the attributes :frowning:

About the size, yes,all the arrays that I have to load are about 10MB (because of this, I choose the chunk size 1); but I need python to load the data into SciDB, because I have a program that create dinamically the arrays using netcdf4/geotiff files and it is necessary to load it into SciDB. I have to create about 2000 arrays with the same size (36018020*size_float-8bytes- about 10MB, more or less)

About the chunk size, if the array is small, do not care; but probably I was wrong

About python, as I said, I need to use it. Another idea I have for load the numpy array is throgh a text file

1- pass numpy array to text file
2- use load utility of SciDB


#8

OK …

  1. “I have to create about 2000 arrays with the same size (36018020*size_float-8bytes- about 10MB, more or less)”

    I am a little confused. But if you’re proposing to create 2,000 SciDB arrays, each of which which will hold only 1 of these geotiff files, all I can say is please do not do this! Have a look at that loading tutorial I included. You want to create a single SciDB array with a dimension that separates each 3D “slice” of your data set. That way, you will be able to write queries that analyze the data by space and time. With 2,000 arrays you will be creating massive pain for yourself.

  2. Well … at chunk size of 1 x 1 x 1, you will be creating 360 x 180 x 20 = 1,296,000 chunks per attribute. Now … we will be tracking internally, for each chunk, it’s coordinates, physical location on disk, and so on. So the per-chunk meta-data overhead is likely to be about 100 bytes per chunk. What you’re basically doing is to take your single 10MB file, and breaking it up into files where each file contains 8 bytes of data. Doing so is very slow, and very wasteful. We designed SciDB to break larger data objects up into smaller ones. But not that small!

  3. Your general idea? numpy -> file -> SciDB load? Given how much data you’re proposing (2,000 arrays, each of 10MB) I think that’s a very good idea. Quick question, though. Why use numpy to pull th data out of a Net-CDF file? There are perfectly good alternatives very suitable for dumping the contents of a Net-CDF (or HDF) file without going through the numpy interfaces.


#9

I have a PostgreSQL Database for find each one array, and it is neccesary load the arrays separated because each array is one thing independent. So, in this project, the arays have to be separated (because of this, I need 2000 arrays of 10 MB).

Said this, I use python to open geotiff and netcdf files to process the meteorological data for obtain proccesed value using some patterns that I have define; so, it is neccesary too. As you said, I could create NetCDF files and load it directly to SciDB but I dont know how do it.

Finally, yes, my idea is use numpy for process all the meteorological files, then applying the patterns to create new arrays and this arrays load into SciDB. I want to use SciDB like a stored array (for now, in the future I use the potencial of SciDB). SciDB was my firts option, but maybe, for now, thre are another options, I don not know. A friend said me that I can use MongoDB (I think) for now, and then change to SciDB.

So, what do you think?


#10

@David_02

A minor code formatting suggestion first

Possibly the < and > characters are getting left out when you publish the post.

Might I suggest using the ` character while writing inline code:
```
BLOCK CODE
```
or ` INLINE CODE `

This might be causing some confusion as to what code you actually tried.

Next, to answer the load from numPy to SciDB query

I made up a ipython Notebook demo to answer your question. See here:

Basically, you were doing almost all things correctly. But when you wrote:

sdb.afl.store(temp.name, Array.name)

It did not actually evaluate the store query – just keeps it stored for future evaluation (see http://scidb-py.readthedocs.org/en/stable/operations.html#lazy-evaluation).

Instead, what I show in my example is:

sdb.query("store(" + temp.name + ", " + Array.name + ")")

This actually runs the copy command.

There is one obvious inefficiency here. We store in a temp array, and then copy it to the final desired array. It would be much better to just rename the attributes of the dynamically generated array to your desired names. I will show that example in an updated version of the notebook.

Other discussions on this thread

@plumber brings up other points about best ways to load data into SciDB. He is absolutely right. I just wanted to solve the initial question of loading numpy arrays into SciDB.