ValueError: array is too big


#1

I am getting this error when trying to retrieve results of scidb query via python scidbpy interface.
i am using between operator to get smaller subset of big array. i do persist output of between operator and then use wrap_array and and toarray functions of scidbpy. The result is well less than 1M cells shim is constrained by but it seems to me between operator keeps original dimensions of large dimensions and scidbpy uses those to estimate the size of query’s result.
Here is the details: as you see the array has only 100k cells. (in fact number of cells does not matter, even if number of cells in query is 0, i still get array to big error.

here is the query which produced q_result array
store(between(current_speed,NULL,NULL,-30,1979010200,NULL,NULL,-6,1979010200),q_result)

AFL% analyze(q_result);
{attribute_number} attribute_name,min,max,distinct_count,non_null_count
{0} ‘speed’,’-99’,‘1.588’,1276,118098

AFL% dimensions(q_result);
{No} name,start,length,chunk_interval,chunk_overlap,low,high,type
{0} ‘lat’,-2000000,8000001,500000,0,-1025000,5025000,‘int64’
{1} ‘lon’,8000000,8000001,2000000,0,8975000,15025000,‘int64’
{2} ‘level’,-10000,10001,5,0,-25,-15,‘int64’
{3} ‘time’,1979010100,46113001,1,0,1979010200,1979010200,‘int64’

this is corresponding scidbpy code

bring the result back as numpy array

q_result = sdb.wrap_array(“q_result”)
q_result_as_numpy = q_result.toarray() <— get array too large, even though array has only 100k cells.


#2

i had tried switching to subarray function which returns array with smaller dimensions than original based on query results
but i am now getting different error
sdb.query(“remove(q_result)”);

res=sdb.query(“store(subarray(current_speed,NULL,NULL,-30,1979010200,NULL,NULL,-
6,1979010200),q_result)”);

bring the result back as numpy array

q_result = sdb.wrap_array(“q_result”)
q_result_as_numpy = q_result.toarray() <----- Memory error

File “test_scidb_query.py”, line 40, in
q_result_as_numpy = q_result.toarray()
File “/home/scidb/anaconda/lib/python2.7/site-packages/scidb_py-14.12.0.dev-py2.7.egg/scidbpy/scidbarray.py”, line 999, in toarray
File “/home/scidb/anaconda/lib/python2.7/site-packages/scidb_py-14.12.0.dev-py2.7.egg/scidbpy/parse.py”, line 357, in toarray
File “/home/scidb/anaconda/lib/python2.7/site-packages/scidb_py-14.12.0.dev-py2.7.egg/scidbpy/parse.py”, line 299, in toarray_sparse
MemoryError


#3

scidbpy toarray is def broken alas
scidbpy todataframe method works but for some reason scidb subarray function convert the actual value of dimension to index…

bring the result back as numpy array

q_result = sdb.wrap_array(“q_result”)

broken, need to test more

#q_result_as_numpy = q_result.toarray()

workaround

dataframe = q_result.todataframe()
dataframe.reset_index(inplace=True)

arr = dataframe.values
column_names=dataframe.columns
print column_names
print arr


#4

Right. Subarray’s behavior is to recenter the selected region at 0. So if you have A <…>[0:20,0:30] and you do subarray(A, 5,6,7,8) the result is <…>[0:2,0:2],

What if you run a between query and then download the result with a todataframe() call? Is that what you want?

Yes, the python package is a little behind the R package in terms of functionality. In R we do things like automatically detect when to use a sparse matrix instead of dense:

> count(m)
[1] 1060942
> str(m)
SciDB expression:  R_array7f6a161f71cd1101374275336
SciDB schema:  <v:double> [sample_id=0:*,10000,0,gene_id=0:*,10000,0]

Attributes:
  attribute   type nullable
1         v double    FALSE
Dimensions: 
  dimension start end chunk
1 sample_id     0   * 10000
2   gene_id     0   * 10000

> mlocal=m[]
Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  Dimensions too big for R sparse the Matrix package! Returning data in unpacked data.frame form.

> mlocal=m[0:10000, 0:40000][]
> class(mlocal)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
> mlocal
10001 x 40001 sparse Matrix of class "dgCMatrix"
   [[ suppressing 49 column names ‘0’, ‘1’, ‘2’ ... ]]
                                                                                                            
0   . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

#5

Alex, i need to run between query and then use toarray.


#6

Yes, I understand we need to address this in the Python package. We’ll address it when we have available resources and the community is encouraged to contribute as well.

But does between / todataframe work well enough in the interim?


#7

yes todataframe works for me for the time being.
i think i can fix python bug in toarray by comparing it to R implementation.

couple of questions then
a. do you know what is list of known gaps between SciDBR and SciDBPy ? Can you publish those?
b. when you do release testing will you incorporate unit testing both scidbr and scidbpy interfaces for regression?

Stanislav


#8

SciDB-Py and SciDB-R are disconnected, separate projects. All the issues for SciDB-py and SciDB-R are public and listed on the respective github pages. For example: github.com/Paradigm4/SciDB-Py/issues. Test scripts and testing are also independent.

The packages and plugins tend to follow SciDB with a lag. A plugin that works with SciDB 14.12 may or may not work with SciDB 15.6. It is up to the plugin to “keep up”, sometimes sooner, sometimes later.