Problem with dimensions indexes in SciDB 19.11

Hello!
A use SciDB from docker image rvernica/scidb:19.11-xenial.

scidb --version
SciDB Version: 19.11.3
Build Type: RelWithDebInfo
Commit: e75cd16
Copyright (C) 2008-2019 SciDB, Inc.

I have problem with dimensions indexes when chunk size is smaller than dimension size.

For example when i use this script:

scheme = "<data_value:float> [time = 0:*; lattitude = 0:144:0:?; longitude = 0:184:0:?]"
database.create_array('data', scheme)
database.load(database.arrays.data, upload_data = np.arange(8*144*184).reshape([8, 144, 184]))
print(database.arrays.data.schema())
print(database.arrays.data[:])

i receive this output:

data<data_value:float> [time=0:*:0:102; lattitude=0:144:0:99; longitude=0:184:0:99]
        time  lattitude  longitude  data_value
0          0          0          0         0.0
1          0          0          1         1.0
2          0          0          2         2.0
3          0          0          3         3.0
4          0          0          4         4.0
     ...        ...        ...         ...
211963    21         62          4    211963.0
211964    21         62          5    211964.0
211965    21         62          6    211965.0
211966    21         62          7    211966.0
211967    21         62          8    211967.0

[211968 rows x 4 columns]

but when i change schema to <data_value:float> [time = 0:*; lattitude = 0:144:0:144; longitude = 0:184:0:184]
i receive:

data<data_value:float> [time=0:*:0:37; lattitude=0:144:0:144; longitude=0:184:0:184]
        time  lattitude  longitude  data_value
0          0          0          0         0.0
1          0          0          1         1.0
2          0          0          2         2.0
3          0          0          3         3.0
4          0          0          4         4.0
     ...        ...        ...         ...
211963     7        143        179    211963.0
211964     7        143        180    211964.0
211965     7        143        181    211965.0
211966     7        143        182    211966.0
211967     7        143        183    211967.0

[211968 rows x 4 columns]

As we see, array size is the same in both cases, but dimension indexes are stranges in the first case.
Why this happens? Does chunk size affect dimension size or i have a mistake in my code?

Hi @segodk , just a quick note to let you know I am looking at your problem. I hope to get back to you with an answer soon.

Hello @segodk, here’s the explanation for the behavior you are seeing.

TL;DR: The load() and input() operators keep data local in the
expectation that a redimension() will be done soon. Data is placed in
the cluster based on a hash of the “upper left” corner of a chunk. In
a schema, different chunk lengths will result in a different set of
“local” upper left corners. Therefore the uploaded data is placed in
chunks at a different set of locations, and so the cells have
different coordinates. But all the cells are present.

To load this array for real, your load schema would be something like

<data_value:float, time:int64, lattitude:int64, longitude:int64>

And then you would redimension to your target schema. That will give
consistent results no matter what the chunk lengths of the load schema
are.

The gory details….

SciDB-Py uploads data using SciDB binary format. This is one of
several formats (like ‘csv’ and ‘tsv’) that doesn’t explicitly encode
cell positions. When loading these formats, SciDB assumes that the
data is eventually going to be redimensioned.

Redimensioning involves spreading the data chunks around the cluster
according to the redimension() operator’s target schema. The
placement of a particular chunk of data is based on hashing the
coordinates of the upper left cell position in the chunk.

Spreading around the data chunks (internally called “scatter gather”
or SG) is expensive. Since they assume that a redimension() must be
done in any case, the load() and input() operators do not spread the
data around, but keep the chunks on the local instance, so that in the
typical case there’ll be just one SG instead of two.

Now we come to your examples! In the first example, your schema is

<data_value:float> [time = 0:*; lattitude = 0:144:0:?; longitude = 0:184:0:?]

Using ? for a chunk length means that the system will statically
compute these values. Leaving a chunk length unspecified (as you did
for time) means that the eventual chunk length will be determined
based on the first data stored there. When I uploaded a similar data
file on my system, I got:

AFL% show(fig1);
{i} schema,distribution,etcomp
{0} 'fig1<data_value:float> [time=0:*:0:102; lattitude=0:144:0:99; \
  longitude=0:184:0:99]','hashed','none'

Your second example uses explicit chunk lengths of 144 and 184 for
your lat/lon dimensions. (Note that 0…144 is 145 elements, likewise
0…184 is 185.) Different chunk sizes, so the upper left corners hash
differently, resulting in the different coordinates that you see.

I hope this helps!

Hello @mjl, thank you very much for your explanation!
I try to understand load principle step by step.
You say schema would be as <data_value:float, time:int64, lattitude:int64, longitude:int64>. But in this case i must upload 4 attributes values in one data array instead of 1 attribute value which is may numpy array. Hence i must necessarily upload unused values? No chance to upload pure numpy N-dimensional array without any headers?

If I’m not mistaken, the reshape that you are doing on the numpy array before uploading has no effect. The load operator treats the array as one-dimensional and uploads it as such. As the data is uploaded to SciDB the values start populating the SciDB array and the dimensions get filled, but you have no control over what value goes where as you already noticed.

So, if you want a specific value to go to specific dimension coordinates, you would have to specify that. Just using reshape does not work.

This post on uploading DataFrames might help.