Multi-attribute array from numpy data


#1

Hello. In machine learning datasets often have a numpy shape like (50000, 32, 32, 4), corresponding to (batchsize, image_width, image_height, RGBA values).

In that situation, I’d prefer a schema like <r:int64,g:int64,b:int64,a:int64>[obs,x,y]. I’d like to use scidbpy’s upload_data=a keyword parameter with a numpy array a that has the right shape, and I don’t mind using .reshape in numpy to accommodate scidbpy.

In the obsolete documentation for scidbpy at http://scidb-py.readthedocs.io/en/stable/creation.html, I see examples of multi-attribute scidb arrays created from numpy objects, but in the new documentation, I don’t see that. How can I use a numpy ndarray as the upload_data for a scidb array with a multi-attribute schema (preferably with multiple dimensions as well)?

To simplify the question, I tried a single dimension and two attributes. I can use CSV data:

In [71]: scidbpy.__version__
Out[71]: '18.1.3'

In [72]: with open('df.csv', 'rb') as f: print(f.read().decode('utf-8'))
1,2
3,4


In [73]: with open('df.csv', 'rb') as f: db.iquery("store(input(<x:int64,y:int64>[i], '{fn}', 0, 'CSV'), df4)", upload_data=f)

In [74]: db.arrays.df4[:]

Out[74]: 
   i    x    y
0  0  1.0  2.0
1  1  3.0  4.0

In [75]: 

But adding data from a numpy object doesn’t work as I’d expect, whether I use a reshaped np.arange, a structured array, or a recarray as below.

In [125]: a = np.rec.array([(4,5), (6,7)], dtype=[('x', 'i8'), ('y', 'i8')])

In [126]: a
Out[126]: 
rec.array([(4, 5), (6, 7)],
          dtype=[('x', '<i8'), ('y', '<i8')])

In [127]: db.iquery("insert(input({sch}, '{fn}', 0, '(int64)'), df4)", upload_data=a)
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-127-d92c78f48093> in <module>()
----> 1 db.iquery("insert(input({sch}, '{fn}', 0, '(int64)'), df4)", upload_data=a)

~/environments/probtorch/lib/python3.6/site-packages/scidbpy/db.py in iquery(self, query, fetch, use_arrow, atts_only, as_dataframe, dataframe_promo, schema, upload_data, upload_schema)
    374 
    375         else:                   # fetch=False
--> 376             self._shim(Shim.execute_query, query=query)
    377 
    378             # Special case: -- - load_library - --

~/environments/probtorch/lib/python3.6/site-packages/scidbpy/db.py in _shim(self, endpoint, **kwargs)
    465 
    466         req.reason = req.content
--> 467         req.raise_for_status()
    468         return req
    469 

~/environments/probtorch/lib/python3.6/site-packages/requests/models.py in raise_for_status(self)
    937 
    938         if http_error_msg:
--> 939             raise HTTPError(http_error_msg, response=self)
    940 
    941     def close(self):

HTTPError: 406 Client Error: UserException in file: src/smgr/io/TemplateParser.cpp function: parse line: 114
Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_ATTRIBUTES_MISMATCH
Error description: Error during query execution. Attributes mismatch. for url: http://localhost:8080/execute_query?query=insert%28input%28%3Cx%3Aint64+NOT+NULL%2Cy%3Aint64+NOT+NULL%3E+%5Bi%5D%2C+%27%2Ftmp%2Fshim_input_buf_gCXE5z%27%2C+0%2C+%27%28int64%29%27%29%2C+df4%29&id=iu531gvyx99bwdekbutaj889ee4dx9wf

In [128]: 

On the scidb server there’s more information:

shim: /execute_query
shim: execute_query[iu531g]: execute, scidb[0] 0x7f06bc043e20, scidb[1] 0x7f06bc043e50, query insert(input(<x:int64 NOT NULL,y:int64 NOT NULL> [i], '/tmp/shim_input_buf_gCXE5z', 0, '(int64)'), df4)
shim: execute_query[iu531g]: execute, qid 0.1533923112498421677
shim: execute_query: ERROR execute, iu531g: UserException in file: src/smgr/io/TemplateParser.cpp function: parse line: 114
Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_ATTRIBUTES_MISMATCH
Error description: Error during query execution. Attributes mismatch.

I don’t see the attribute mismatch—In fact, if I use an ‘f8’ value instead of an ‘i8’ value, it says specifically that the types don’t match.

Can someone please provide an example that shows how to load data from a numpy object into multi-attribute (and preferably multi-dimension) scidb array?


#2

I’ve learned some things about this situation.

You might think that if you get a numpy array from a SciDB query, its dtype would be the one you need for uploading data to that array, but I think it is not.

For example, I create an array in iquery below. It has multiple dimensions and multiple attributes.

AFL% create array image<r:double not null,g:double not null,b:double not null>[x=0:1;y=0:2];
Query was executed successfully
AFL% 

… and I can see that the dtype of the numpy array I get from fetching in scidb-py includes the dimension indices:

In [134]: db.iquery('scan(image)', fetch=True, as_dataframe=False)
Out[134]: 
array([],
      dtype=[('x', '<i8'), ('y', '<i8'), ('r', '<f8'), ('g', '<f8'), ('b', '<f8')])

In [135]: 

But a numpy array with that dtype cannot be used without an attribute mismatch as upload data. I think it’s because you’re only supposed to include attributes in the uploaded data.

In [26]: rgb = np.array(np.random.normal(size=2*3), np.dtype([('r', '<f8'), ('g', '<f8'), ('b', '<f8')]))

In [27]: rgb
Out[27]: 
array([(-0.94326614, -0.94326614, -0.94326614),
       (-0.10770737, -0.10770737, -0.10770737),
       ( 1.12254221,  1.12254221,  1.12254221),
       (-0.59358083, -0.59358083, -0.59358083),
       ( 0.70709724,  0.70709724,  0.70709724),
       ( 0.00274978,  0.00274978,  0.00274978)],
      dtype=[('r', '<f8'), ('g', '<f8'), ('b', '<f8')])

In [28]: 

That array can be used as upload data, even though its dtype doesn’t match the fetched array.

In [29]: db.iquery("load(image, '{fn}', 0, '{fmt}')", upload_data=rgb)

In [30]: db.arrays.image[:]
Out[30]: 
   x  y         r         g         b
0  0  0 -0.943266 -0.943266 -0.943266
1  0  1 -0.107707 -0.107707 -0.107707
2  0  2  1.122542  1.122542  1.122542
3  1  0 -0.593581 -0.593581 -0.593581
4  1  1  0.707097  0.707097  0.707097
5  1  2  0.002750  0.002750  0.002750

In [31]: 

For getting data, you need to provide a special attrs_only keyword argument to omit dimension indices, so this behavior might be surprising.