Scidbpy mean() function on large arrays


#1

I was using scidbpy’s mean() function on some 2-d arrays of different sizes, and here are some interesting results. Numpy and SciDB on [100,100] are pretty much the same; [1000,1000] starts showing some discrepancy; [4094,2046] is completely messed up (see below). Any ideas?

a.shape = (100, 100)
Numpy: a.mean = 286.45
SciDB: mean = 286.451162834

a.shape = (1000, 1000)
Numpy: a.mean = 287.57
Numpy: a_new.mean = 287.57
SciDB: mean = 287.490968998

a.shape = (4094, 2046)
Numpy: a.mean = 273.766
SciDB: mean = 291.174422006


#2

Wow this is a big deal. Can you share how you generated the array or uploaded it to SciDB? What if you run iquery AFL average on the data. Does the count(*) match?

I wonder if we have a precision issue (data converted to text and then back to doubles when uploading) or maybe an issue in the python package.

I just ran a quick R test on 20 million values and my result was consistent:

> foo = rnorm(20000000)
> mean(foo)
[1] 0.0002026191
> foo_scidb = as.scidb(foo)
> aggregate(foo_scidb, FUN="avg(val)")[]
  i      val_avg
1 0 0.0002026191

#3

Thanks for the reply! Sorry I didn’t get a chance to post the data I tested (they are FITS files stored in AWS S3). I touched the data again today and found the raw numpy data has some "nan"s. Would that be a potential cause? -Dongfang