- What happens when I lose a server during a long query execution?
With the SciDB core at the moment, bad things will happen. You should get an error message, but the problem is that it really depends on how/why you lost the instance; network partitioning, machine crash, software problem, etc. So what we’ve done on the Paradigm4 side is to create a set of operators that plug into the SciDB core to manage its instances. If you use this plugin, you will have a much better grip on the SciDB installation’s state. What we do is actively instrument the SciDB instances to get them to constantly exchange heartbeats with command / control information.
This is what happens today if you have those plugins: Suppose you have 16 SciDB instances in your SciDB installation. Suppose you’re running a big query, and one instance crashes (stops responding to or sending heartbeats). SciDB will terminate all running queries, and return a NO_QUORUM error message to the client. Unfortunately, we have to terminate the query because in our design we wanted to avoid (at this time) materializing or checkpointing intermediate results whenever they are produced. We’ll get around to doing that once we start seeing people running very large numbers of instances, or when enough people tell us they need it. Materializing intermediate state isn’t that hard: we’re just more focussed on getting the feature list right before we focus too much on the quality of service functionality.
(NOTE: “Instances” is the term we use to distinguish SciDB engine processes from computer servers, because you might have, for example, 16 SciDB instances sharing 4 physical computers.)
- Will my query be rescheduled?
Not automatically. You will need to re-submit the query. We felt at this time we’d rather tell people the truth about what’s going on rather than do something cheap like silently resubmit the query at the level of the client API.
- If I have turned on replication (redundancy) can I still have access to all my arrays?
Yes. The initial design goal of the storage manager was to support data availability when you’re not running with all of the instances available. Of course, if you lose more than “k” instances (where k is the redundancy count) then there’s not much we can do but call a halt to proceedings until the instances are back online.
- If I add new servers/nodes and therefore change the configuration, is there a way to rebalance? Is it manual or automatic?
We don’t (yet) support the kind of elasticity you’re describing, but once again, it’s in the plans. The infrastructure we’ve built to support the heartbeats and instance management metadata supports it. We’ve just not had folk yet who have made this a priority for us.
Hope this help!