Dimensions vs Attributes



It would be good to have some sort of concise guide to the tradeoffs between dimensions and attributes - for example, I can use either in a WHERE clause, but are there reasons to prefer one or the other?



The presence of dimensions is really what separates SciDB from other engines. Dimensions unite what other databases call “indexing” and “clustering” in one feature. They impact the following things:

  • Filtering over dimensions is much faster (when you use specialized AFL command between(); there is also a neat cross_join() filtering trick)
  • Grouped aggregates are done by dimensions
  • Operators slice(), subarray(), etc use dimensions to quickly select subregions
  • Joins of arrays are performed with dimensions; if two arrays have the same exact coordinate system, then join/merge commands are effortless; the cross_join command can also be used to join arrays where some of the dimensions are common
  • Based on what dimensions and chunking you choose, you have a guarantee that data that is spatially located in the same chunk shall be stored together on the same instance

Hope this helps!


That’s very helpful, many thanks!



So what is the impact/performance of querying across dimensions?


Very curious to know what you mean by “query across dimensions”?

Suppose I have an array that looks like this (let’s ignore the physical organization details for a moment - trying to understand the semantics of what you want).

    a : type_one,
    b : type_two,
    c : type_three >
[ I = 0:*, CLd1, OLd1, J = 0:1000, CLd2, OLd2,  J = 0:1000, CLd3, OLd3 ];

This being a three dimensional array, with three attributes. As Alex says, the data model’s ideal for slicing and dicing through dimensions.

aggregate ( slice ( Example, I, 5 ), sum ( a ), max ( b ), J );

What’s an example of a “query across dimensions”? If it’s just a query where the predicate doesn’t mention a dimension–such as a filter or apply–then we proceed to apply the query to all qualifying chunks–which would be all of them in the simplest case–in parallel. For example:

project ( apply ( filter ( Example, c > 10 ), d, a + 5 ), d );

The query above filters all the values in Example where Example.c > 10, and then adds 5 to a, and projects out only d.

The decision about whether something should be an attribute or a dimension is similar to the problem is deciding on your keys in SQL systems. If the predicates in your application are highly selective, then it might make sense to include the “column” in the list of dimensions. On the other hand, if the predicates aren’t very selective, indexing won’t help much, and you might as well scan (in parallel) for optimal results.