Let's sketch some answers. Several of your questions can be combined.
- How do we 'index' data in SciDB? How do you retrieve data quickly?
First, SciDB doesn't use any "traditional" indexing. No B-trees, R-trees, GiST, etc. We might put them in eventually, but because we're geared for decision support queries and analytic algorithms that grind through bulk data, it isn't clear that attribute indices will help us much.
Now, in lieu of traditional indexing, what we do is pretty much exactly the same thing that folk who implement archives of big images or perform linear algebra at scale do. If you have an array A, we divide the cells of A up using a technique known as 'block partitioning'. Basically, we chop the array up into many smaller (2 - 16 M ) blocks (we call them 'chunks') using a simple-minded algorithm, and then we distribute each 'block' over the instances that make up the SciDB installation using a hashing method.
This is how we achieve fast query results when what your query wants to do is to pull out a 'region' of the array: say a box from within it, or a 'slice' of it. From the predicate size and shape, we can calculate the list of 'chunks' that contain the data the query wants, and we can figure out which instances contain those chunks. For example, suppose I had an array A that was 3,000 x 3,000 in size. Let's say we have four instances: I1, I2, I3 and I4. Assuming array A is dense, we would probably divide it up into 9 'chunks', each of which was 1,000 x 1,000. Then we would map C1 (which contains A[0,0-> 1000,1000]) to I1, C2 (which contains A[0,1000 -> 1000,2000]) to I2, and so on.
Now suppose your query is 'aggregate ( between ( A, 900, 900, 1100, 1100 ), count (*) )'. In other words, you want to count the cells in the region A[900,900 -> 1000,1000]. We know, by looking at our block partitioning algorithm, that out of the 9 chunks that store data for A, this region involves C1, C2, C4, and C5. And we can tell, from our chunk -> instance mapping, that these are to be found in I1, I2, and I4.
Does that help? We don't use any kind of indexing, distributed or not. Rather ... we exploit the mathematic properties of arrays. All of our "indexing" is algorithmic.
- How can you extend SciDB?
Have a look in the examples directory that we ship with the source. You'll find examples of postgres-style user-defined types, functions and aggregates. And you'll find examples of how to create your own 'operators'. The idea is to support things like this:
CREATE ARRAY my_array
attribute : My_Type
SELECT My_Aggregate ( attribute )
WHERE My_Function ( My_Array.attribute ) > 5;
FROM My_Operator ( My_Array );
Does that help?