Constant iterator reset in 16.9


#1

I have a SciDB operator for 15.12 which needs to iterator over the input multiple times. I have been successfully using reset on a constant iterator before every new read iteration:

vector<shared_ptr<ConstArrayIterator> > saiters(nSrcAttrs);
saiters[attrId] = srcArray->getConstIterator(attrId);
...
saiters[attrId]->reset();

in 16.9, it seems that reset is no longer available:

error: 'class scidb::ConstArrayIterator' has no member named 'reset'
           saiters[attrId]->reset();

What is the recommended way to move forward? Should I reinitialize the iterator before every new read iteration?

saiters[attrId] = srcArray->getConstIterator(attrId);

#2

Hey Rares.

There is a ConstArrayIterator::restart() now.

FWIW Iterating over the input multiple times is a bit of an anti-pattern - if the input is the result of a filter/apply/between (non-materializing operators) then the underlying may be recomputed as you re-iterate. Sometimes unavoidable.

Also, not all arrays are guaranteed to support it. See Array::getSupportedAccess in Array.h. For an example check, see https://github.com/Paradigm4/equi_join/blob/master/PhysicalEquiJoin.cpp#L146

A realistic case is an InputArray (from the input() operator). These are rare.


#3

Sorry, I first wrote “reset” but it’s “restart” - see edit above.


#4

Thanks, Alex! This is very useful.

Just to verify, is ensureRandomAccess the recommended way to buffer an input array in the operator in order to avoid recomputing it when re-iterating over it?


#5

Actually, not quite!

EnsureRandomAccess current implementation is simple:

std::shared_ptr<Array>
PhysicalOperator::ensureRandomAccess(std::shared_ptr<Array>& input,
                                     std::shared_ptr<Query> const& query)
{
    if (input->getSupportedAccess() == Array::RANDOM)
    {
        return input; /*no change to input; quick exit*/
    }
    LOG4CXX_DEBUG(logger, "Query "<<query->getQueryID()<<
                  " materializing input "<<input->getArrayDesc());
    bool vertical = (input->getSupportedAccess() == Array::MULTI_PASS);
    std::shared_ptr<MemArray> memCopy(new MemArray(input->getArrayDesc(), query));
    memCopy->append(input, vertical);
    input.reset();
    return memCopy;
}

Problem is that supports random access does not mean supports cheap random access. So a FilterArray (the output of AFL filter()) is said to support random access! For that case, the ensureRandomAccess() call just go into that early return and do nothing. This whole framework just ensures the capability and doesn’t factor in performance.

Do you want to force the materialization? Do what the function does without that first if.

Another call you could check is Array::isMaterialized(). That returns true for DBArrays (read off primary storage) and MemArrays (mid-query materialized results) and false for other things. So, if isMaterialized() is true - then you know random access will be cheap. Otherwise, it may or may not be cheap.

In general the problem is one of cost optimization. Consider an example query:

your_op(filter(Array, Expression))

With multiple variables:

1. your total cache sizes and occupancy (mem-array-threshold)
2. number of iterations your_op will perform
3. the complexity of the filter Expression

When the expression is expensive to evaluate, and you have ample cache room (or result of filter is small) and your_op makes many iterations, it’s cheaper to materialize the output of filter. But if the expression is cheap, and you don’t have much cache room to spare (which means caching would hit the disk multiple times) and you’re doing two iterations, it might take less time to just iterate twice…

You can also force a materialization in AFL by adding an _sg call:

your_op( _sg(filter(Array, Expression), 1) )

It’s hard to give a heuristic that will be optimal in all cases. This is something the future optimizer will need to handle.