Chunking and fetched results


#1

I’m working on a research project that involves testing SciDB with MODIS data.

Firstly, I’m using:

SciDB version: 18.1
scidb-py version: 18.1.4

From a source NetCDF file, I used a numpy.ndarray with a shape of (435, 433, 433) that corresponds to time, y and x dimensions. I then used scidbpy to create an array and load data, letting SciDB determine the chunk lengths for each dimension.

The schema thus looks like this:

test_array<val:int16> [time=0:434:0:99; ydim=0:432:0:99; xdim=0:432:0:99]

Where I’m running into a problem is with the dataframe returned from the fetch of a between query. The value and dimension indices are not what I expected and not what lines up with what is in the source NetCDF file.

For instance, the dataframe from this query:

df = sdb.iquery(f’between(test_array, 3, 94, 424, 3, 94, 424)’, fetch=True, use_arrow=True)

… looks like this:

time	ydim	xdim	val
   3	  94	 424	-3000

… where I expected it to look like this:

time	ydim	xdim	val
   3	  94	 424	5137

If I create an array with chunk lengths pertaining to the length of the dimension:

test_array_2<val:int16> [time=0:434:0:435; ydim=0:432:0:433; xdim=0:432:0:433]

and run the same between operation, I get the expected results.

However, because the chunk size of that array ends up being large - 81557715 (435 * 433 * 433) - the performance for queries is naturally much worse than the “auto-chunked” array, with its chunk size of 970299 (99 * 99 * 99).

How can I still have a good chunk size and retrieve results via scidbpy that have the correct dimension indices and attribute values?


#2

Hi @jckoch,
Thank you for reaching-out! I’m curious if the problem may be in scidb-py, what happens if you set use_arrow=False? Locally, I have synthesized data for the two array schemas you provided and receive the same result under both for the given between query in the iquery client.
Thanks,
Dave


#3

Hello @dgosselin

I appreciate your help. I’ve tried using using use_arrow=False in the iquery calls, for the auto-chunked array, but unfortunately still get the same result:

time	ydim	xdim	val
   3	  94	 424	-3000.0

I also tried running the between query using the iquery client on the command line, and get the same result.


#4

Hi @jckoch,
I see that you’re running SciDB 18.1, is that specifically 18.1.0? What does scidb --version show?


#5

It shows:

# scidb --version
SciDB Version: 18.1.13
Build Type: RelWithDebInfo
Commit: 11667d8
Copyright (C) 2008-2017 SciDB, Inc.

#6

If you used a custom script to upload your data, do you mind sharing it?


#7

Hi @dgosselin,

Sorry for the delay. From my testing app, I’ve extracted and “paraphrased” how the array is created and loaded:

nc_file = 'path/to/file.nc'
array_name = 'test_array'
attr = 'val'

# load the file into a SciDB array
with netCDF4.Dataset(nc_file, mode='r') as ds:
    ds.set_auto_maskandscale(False)
    # extract the dimensions and attribute(s)
    layer_data = ds.variables[attr]

    # extract the data as an numpy array, and calculate the end indexes of the three dimensions
    layer_data_arr = layer_data[:]
    time_end_idx = layer_data_arr.shape[0] - 1
    lat_end_idx = layer_data_arr.shape[1] - 1
    lon_end_idx = layer_data_arr.shape[2] - 1

    # construct the 3D array schema and create the array
    schema = f'<{attr}:{layer_data.dtype}>[time=0:{time_end_idx};ydim=0:{lat_end_idx};xdim=0:{lon_end_idx}]'

    sdb.create_array(array_name, schema)
    sdb.load(conn.arrays[array_name], upload_data=layer_data_arr)

#8

Hi @jckoch,
I’m working through a response for you, sorry for the delay.


#9

Hi @jckoch,
It looks like you’ll need to redimension the input according to the chunk sizes and layout of the destination arrays.
For example, if your array dimensions are [x=0:*:0:1, y=0:*:0:10]then the first 10 values get assigned x=0, y=0,1,2,3,4,...9. If instead your array dimensions are [x=0:*:0:5, y=0:*:0:2], then the first 10 values are assigned x=0,y=0, x=1,y=0,... x=4,y=0, x=1,y=0,....
Please try to redimension the loaded data into the destination schema.


#10

Thanks @dgosselin. I’ll work on reshaping the input ndarray to fit the destination array chunking.


#11

Hi @dgosselin. Earlier this week, I was able to rework my input array to properly fit the destination array schema. Now the results from between, filter operations are reporting dimension indices that are correct.

Thanks again.