Question about the cause of data skew in window aggregation


Hi experts,

I am wondering if it is possible to avoid certain random distribution of chunks in window aggregation in certain applications to further improve the performance. Considering the 2 general scenarios of the window aggregation below:

  1. The window slides along one or more spatial dimensions: Gaussian smoothing of images;
  2. The window slides along the temporal dimension: time series calibration.

In the first scenario, if we only consider dense arrays, is it really possible to expose an data skew during the smoothing? So far I have never seen any smoothing that applies a predicate based on the value of array element, since smoothing involves evaluating similarities, which should be executed over all the queried elements (even the missing values).

In the second scenario, random distribution of time slices (or a multiple of time slice) may be also enough skew-tolerant. Since the time dimension usually serves as the highest dimension in arrays (without redimension), the slicing can be very efficient.

Honestly, I am working on the window aggregation implementation on the raw data files which can be in the form of flat (unchunked) arrays, and I think this will help me improve the performance, by sacrificing some unnecessary skew-tolerance.