Possible bug in regrid operator


#1

I’m seeing behavior that might be a bug in the regrid operator. I’m trying to regrid a 2D array in order to display the array as a down-sampled image. For some grid sizes this works perfectly. But when the grid size goes below 9x9, the results come out scrambled.

For example, using the python bindings, a 10x10 regrid window looks great:

query = f.regrid('gimms_jan', 10, 10, 'avg(ndvi) as ndvi')
plt.imshow(query.toarray())


But a 5x5 window gives the wrong result:

query = f.regrid('gimms_jan', 5, 5, 'avg(ndvi) as ndvi')
plt.imshow(query.toarray())


You can see the results of these queries and others in a capture of my IPython notebook: http://pabercrombie.com/scidb/regrid/Regrid.html. The data set that I used for this experiment is available in SciDB text format at: http://pabercrombie.com/scidb/regrid/gimms_jan_2010.txt.gz. The array schema that I used is:

gimms_jan<ndvi:int16 NULL DEFAULT null,flag:uint8> [y=0:2159,500,0, x=0:4319,500,0]

Anyone know what’s going on here?


#2

Hi Parker,

Thanks for the cool image and sorry for the trouble. I repeated the experiment using R:


I’m getting something that’s (albeit upside down) quite sensible. I am using a newer (unreleased) scidb version but there shouldn’t have been any substantial regrid changes. I’ll try it on 14.3 when some resources free up.
Is it possible that the image function is screwing with us – due to the presence of missing codes perhaps?

By the way, love the blog and the tips. Also:

  • remember that AQL doesn’t always translate to the most efficient AFL, there is a tension here between making AQL smarter versus just focusing on R/Python. I try to avoid AQL at the moment.
  • don’t use the scidb text format. Your file is small, but that format is generally the most inefficient ever. Recommend csv/tsv, binary or opaque instead.

Hope it helps. I’ll let you know when I get a chance to run this in R on 14.3 or maybe you can give that a shot.


#3

I tried the regrid with 14.3 and R and got the same result as you: upside down but otherwise reasonable. So it must be something in either the Python bindings or the image function. I already filled missing data with zero, so I doubt it would be a missing data problem. I’ll play with it a little more and report on what I find. Thanks!


#4

Looks like the problem might be related to the toarray() function in the SciDB Python binding. I tried running the regrid from Python, then dumping the result to a file and loading into R. I produced the plot in R to take the Python plotting functions out of the loop. For a regrid window less than 10x10 I get the same scrambled image from R as I see in Python, but it works for larger windows.

Specifically, I ran this code in Python:

a = f.regrid('gimms_jan', 5, 5, 'avg(ndvi) as ndvi')
np.savetxt('gimms_jan_5x5.dat', a.toarray())

And then this code in R:

a <- read.table('gimms_jan_5x5.dat')
image(as.matrix(a))

Here’s the result:



#5

Hi Parker,

Thanks for the detailed bug report. I can reproduce this, and it’s definitely a problem with toarray() (or, less likely, shim). I’ll let you know when I know more

Cheers,
Chris Beaumont


#6

Hi Parker,

Ok, I think the root cause of the problem is that the chunk size is smaller than the array size. When this happens, SciDB-Py’s toarray() method makes incorrect assumptions about the order in which the bytes are downloaded. That’s why the array looks scrambled.

I’ll fix this on SciDB-Py’s end. In the meantime, you can workaround the bug by redimensioning the array into a single chunk – this helper function should do the trick

from scidbpy.schema_utils import change_axis_schema

def toarray(x):
    """ A version of toarray that works around the bug """
    
    # change chunk size as needed so full array fits in single chunk
    ds = x.datashape
    for i, (c, s) in enumerate(zip(ds.chunk_size, x.shape)):
        if c < s:
            ds = change_axis_schema(ds, i, chunk=s)
            
    schema = x.sdbtype.schema + ds.dim_schema
    return x.afl.redimension(x, schema).toarray()

Cheers,
Chris


#7

Thanks, Chris. I’ll give the workaround a try.


#8

Just an update on this: The parsing logic in SciDB-Py has been modified to fix this bug. I believe this change will be reflected in the 14.7 release of SciDB-Py in the next few days (you can also grab the latest version of the code of GitHub if you want to try this out in the meantime: github.com/paradigm4/SciDB-py)