The short answer is, “it depends”.
It’s important to understand at the outset that, inside SciDB, many of the “intermediate arrays” don’t really exist, except as kind of logical “edges” in the query plan. Suppose you have a query like this: foo ( bar ( mug1 ( inputA ), mug2 ( inputB ), … ), … ). If you were designing some other system, you could execute this query in the following fashion:
- Compute mug2 ( inputB ) and store to Temp_1.
- Compute mug1 ( inputA ) and store to Temp_2.
- Compute bar ( Temp_2, Temp_1) and write out to Temp_3, deleting Temp_1 and Temp_2 when you’re done.
- Comute foo ( Temp_3 ), write it out to Temp_4, deleting Temp_3 when you’re done.
- When the client program asks for a piece of the result, find that piece in Temp_4, and return it to him.
- When the client program ends the query, delete Temp_4.
Now, the approach above would have the (not inconsiderable) advantage that queries could be restarted, at any point in the plan, using the inputs stored by the previous step. This “operation-at-a-time” is in fact the way Hadoop’s Map/Reduce mechanism works, and it helps to explain why Hadoop has those very nice failover properties. We think we can get something similar, using a similar idea (checkpointing state at critical points in the plan). But we’re not there yet. And we are more ambitious about our performance goals.
The approach outlined above has a huge, obvious drawback. The result of every step hits the storage. If your last step merely involves counting how many cells were output by the query, you would need to write then read the results. This is called a “go slow” button. Mind you, if you step back to 10,000 feet you can see why this mechanism lets you to come up with very impressive scalability curves. It’s very, very slow. But as you add more and more resources you can cope (equally slowly) with more and more data. We think we can get the same scalability and similar reliability without compromising on the speed. (What we lose is the ability to program literally anything into mappers and reducers.)
In SciDB, we try to avoid materializing intermediate arrays where-ever possible. And what we don’t materialize doesn’t have a “distribution” associated with it. What we’re able to do inside SciDB is recognize when one operator can just ask the operators “below” it (it’s inputs) for the “next” bits of the array (base array, logical intermediate array, or a physically materialized intermediate array) that it needs, and then pass what it computes on up the plan tree to the operators “above” it. Think of the whole thing as a kind of “data pump”, with filters and valves in the operators along the way, each of which is processing a piece of the array at a time. Of course, things don’t always work out this neatly. Some operators require that we materialize an intermediate result (or at least, large bits of it) before the “next” operator can do it’s thing. And as your colleague has figured out, those intermediate results can have very strange shapes indeed.
Here’s some more details. For some operations–usually the shape-preserving ones like filter(…), apply(…), between(…), join(…), merge(…) and so on–the size and shape of the output array is identical to the size and shape of the input array(s). We know this, because each operator tells us. So when we see a chain or sequence of these operators, for example:
filter ( inputA ), ... ),
filter ( inputB ), ... ),
we know that we can pump data through them without any materialization of intermediate data. In fact we go one step further. We try to pump blocks-at-a-time through the operators for maximal CPU efficiency and minimal Memory <-> CPU cache bandwidth usage.
With some operators–cross(…), cross_join(…)–we look at the inputs and make decisions about how best to perform the operator depending on the sizes of the inputs. For cross(…), we pick the smaller of the two inputs, and we materialize / replicate it on all instances. This means that the actual cross(…) operation can then proceed in parallel. Of course, the time and space involved in the replication step can be significant. And we’re looking at ways to minimize that.
Other operators–aggregate(…), window(…), regrid(…) for example–are even weirder. What we’re able to do here is figure out what the size, shape and there distribution of the output array will be based on the distribution of the inputs. Consequently we can pull whatever data the client wants from the instance on which it is produced. But if you store (…) the output of one of these “shape changing” operators, then we go through the whole process of figuring out, for each chunk (which might now be much larger or smaller or stranger in shape) where that chunk should go (the instance it belongs on). So if you’re planning to store(…) the result of a query to the database on a temporary basis and then subject this result to lots of intense read queries, it might be helpful to use repart(…) to set the shape to what you want.
Worth noting that what makes all of this possible is the way that SciDB was designed around the array data model. Arrays have very handy properties when you’re doing this kind of manipulation. We can reason about array properties like sizes, shapes, boundaries, locations and distributions based on the metadata we maintain about the array schemas, the behavior of the operators, and the configuration of the cluster. It’s really hard to do that with just a file-system and no abstract data model.
If you’re just pumping query results to a client, don’t worry too much about intermediate results. There isn’t much you can do, short of injecting lots of repart(…) operators by hand. And we don’t yet feel like we have enough of an understanding of our own system to do this sensibly (we’re working on adding smarts to help).
The best thing you can do is to pay attention to the physical design of your permanent arrays - dimensions, dimension order, chunk lengths, and so on. That basic physical layout is what drives everything else. And there the rules are few and simple (mostly because we don’t yet know any better). Try to keep your chunks to 2 - 8 Meg in size (about 1,000,000 values per chunk), and order the dimensions so that the one(s) you want to group-by are at the inner, and the ones you want to slice-and-dice by (slice(…), between(…) etc) are at the outer.
If you’re debugging a plan, look at the physical plan we generate and write to scidb.log.
Long post. Sorry. Hope it clears things up.