Aggregation memory usage


#1

I am having trouble understanding aggregation performance, it is not clear to me why there is a need for so much memory. Below is the example.

  1. Build an array with three attributes per cell

m = scidb("build(<vid:int32>[i=0:100000,1000,0], random()%10000)") m = bind(m, "x", "bool(random()%2)") m = bind(m, "y", "bool(random()%2)") m = scidbeval(m)

  1. Aggregation fails due to memory shortage
  1. Aggregation succeeds

The x and y attributes are booleans, they would increase table size 4-fold. However, there is a huge difference in the execution.

Thanks,
Ohad.


#2

Although I can’t reproduce the problem on a machine with modest memory (16GB), I can say that the query generated by that aggregation is fairly complex. You can see it yourself by setting:

options(scidb.debug=TRUE)

and then re-running the aggregation. The upshot is that each non-int64 attribute is placed in an auxiliary dimension array using index_lookup, and then those are cross-joined into the main array, redimensioned, and then aggregated.

In this example it would be much more efficient to just use int64 attributes all around and then just redimension.

Something like:

m = scidb(“build(vid:int64[i=0:100000,1000,0], random()%10000)”)
m = bind(m, “x”, “int64(random()%2)”)
m = bind(m, “y”, “int64(random()%2)”)
m = scidbeval(m)
redimension(bind(m,“one”,1),dim=c(“vid”,“x”,“y”),FUN=“sum(one) as sum”)

(Note: this uses syntax available in the latest package on github, not yet released on CRAN. In general, I am going to start recommending R users try the redimension function instead of aggregate even though the latter is familiar. The implementation of the redimension function in the latest package is better than aggregate.)


#3

R package version difference?


#4

Also, Ohad, are you guys using SciDB 14.8 or 14.12?


#5

We are using the latest, which is currently SciDB 14.12.

   Ohad.

#6

I updated to the latest scidbR package from github. I now get this:

> m = scidb("build(<vid:int64>[i=0:100000,1000,0], random()%10000)")
> m = bind(m, "x", "int64(random()%2)")
> m = bind(m, "y", "int64(random()%2)")
> m = scidbeval(m)
> redimension(bind(m,"one",1),dim=c("vid","x","y"),FUN="sum(one) as sum")
Error in scidbquery(query, afl, async = FALSE, save = "lcsv+", release = 0,  : 
  UserQueryException in file: src/query/parser/Driver.cpp function: fail line: 146
Error id: scidb::SCIDB_SE_PARSER::SCIDB_LE_QUERY_PARSING_ERROR
Error description: Error during query parsing. Query parser failed with error 'syntax error'.
filter(redimension(apply(R_array157f033d44f8b1102424095338, one,1),<one:int64>[vid=0:9999,100,0,x=0:1,100,0,y=0:1,100,0],sum(one) as sum(one) as one),true)
                                                                                                                                        ^

Here is the R version information:

[quote]platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25)
nickname Frisbee Sailing [/quote]

Correction, after reinstalling (again) from github, I get the correct result.