Query results change with different chunk size for same data


#1

Hi all,

I have a time series data about air quality. its a regular hourly data. I have considered four dimensions as station, date, hour and parameter.And two attributes measured_value and quality_flag. Since station and date are not integers I have created integer indexes (using uniq and index_lookup ) for their subsequent use to redimension array.

Then i am trying different chunk sizes to check for any possible improvement in query performance. And when ever i redimension with different chunk size my query results differ. I am not able to figure out why this is happening?


#2

You don’t mention the precise queries you are running, but I assume you are probably averaging measured_value, which is a floating point number?

Floating point arithmetic is sensitive to the data order, and changing the chunk sizes will definitely change the ordering of the data.

You’ll need to figure if the observed changes are small enough to ignore, or if you’ll need to use more precise numerical methods. (e.g. Always sum floats from the smallest to largest!)


#3

yes i am averaging measured_value (type: double) as well as counting the number of records returned by the query (as i use slicing/filter). there is slight diference in average value, besides count of records returned also differs. this would not happen in relational database.

sorry, i did not get what you meant here. Is your suggestion that i sort the data before computing average ?


#4

I was suggesting that you should understand your data, and know if a little rounding error is acceptable. However, the fact that your records count also differs is worrisome, as that does not happen in SciDB under normal circumstances either. Perhaps if you better describe the problem with the exact schema and query and results, we might better understand what is going on.

B.T.W - here is a nice primer on floating point for the curious: https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf


#5

thank you for addressing the issue . below is the detailed information .

1D array schema is as below:

  create array hourdata
           <values:double null,
           flag:int64 NULL,
           station_code:string NULL, 
           component_code:int64 NULL, 
           date:string NULL,
           hour:int64 NULL>
           [i=1:*,100000,0]

summary statistics

  attribute_number attribute_name        min        max distinct_count non_NA_count
1                0         values      -9999     3999.9         380075    141261230
2                1           flag       -100        100             12    153639456
3                2   station_code    DEBB001    NL00938            378    153639456
4                3 component_code          1       6007             31    153639456
5                4           date 1976-01-02 2012-12-31          13601    153639456
6                5           hour          0         23             24    153639456

creating integer indexes for Date and Station_code.

store(uniq(sort(project(hourdata, station_code ))), station_idx)
store(index_lookup(hourdata,station_idx, hourdata.station_code, stn_index), index1)

store(uniq(sort(project(hourdata, date ))), Date_idx) 
 store(index_lookup(index1,Date_idx, index1.date, date_index), index2) 

redimension array schema is .

create array Array1
   <values:double null ,
    flag:int64 null,
   date: string null,
   station_code: string null    
   >
   [
   component_code = 0:*, 24, 0,    
   stn_index=0:*,24, 0,
   hour = 0:*,24, 0, 
   date_index = 0:*, 24, 0
   ] 

applying redimension

 store(redimension(index2, Array1), Array1)

Query is …

aggregate(between(Array1,  5, 0,0,8765,  5,378,23,12782), avg(values), count(values)) 

result

   values_avg values_count
   25.39707      8107168

Changing the schema.

 create array Array2
       <values:double null ,
        flag:int64 null,
       date: string null,
       station_code: string null    
       >
       [
       component_code = 0:*, 1, 0,    
       stn_index=0:*,100, 0,
       hour = 0:*,24, 0, 
       date_index = 0:*, 1000, 0
       ] 

applying redimension

 store(redimension(index2, Array2), Array2)

same query…

aggregate(between(Array2,  5, 0,0,8765,  5,378, 23, 12782), avg(values), count(values))

Result

 values_avg values_count
   25.39749      8107223

#6

Which version of SciDB are you using? I ask, because if you are using a version before 15.7, it may be that your initial redimension() has collisions and you do not know it.