Input operator


#1

Hello,

To get familiar w/ the SciDB code, I’ve been rewriting the input operator. While this is mostly an exercise, there’s a couple of things I’d like to add. The first is support for multiple file formats, which I’ve partially done (and in that process, I rewrote the current parser to address some edge cases). The second feature is support for queries such as: iquery -aq “filter(input(Two_Dim, ‘/home/miguel/array2d.txt’), a>1)”

… that is, be able to filter the input directly from its location, without actually having to load it first into the database. (Yes, not totally realistic for a wide variety of scenarios, but this is mostly an exercise and perhaps the beginning of bigger things :smiley: )

Now, I’ve been trying to understand how operators are implemented and executed, in particular the filter operator. I don’t fully understand this yet, so if one of the devs has a bit of time to write down or point me to a high-level overview of this, I’d be most grateful! For instance, in the InputArray::getConstIterator, the code is such that iterators are instantiated only once and reused later (Q: when do they need to be reused through the getConstIterator call? they could be in the “wrong” position if one does getConstIterator for something that was init’ed and advanced (++) before…). The code as-is interacts badly with the filter; if that code is changed to return a new iterator on every call, then the iquery command above will work fine. Having said that, while doing a step-by-step execution of a filter operation, I was sort of surprised with the number of times FilterArray::createChunk or FilterArray::createArrayIterator are called in the execution… but I guess that’s just because I don’t quite understand the implementation!..

Thanks,
Miguel


#2

Miguel,

This is a really cool project!

Let me try and mention a few things that might help:

  1. As you can see from PhysicalFilter.cpp all filter() does is create a filter array wrapper on top of the given input and return that. No “filtration” is executed until either
  • another operator upstream or
  • the data return pathway
    starts iterating over the array that filter return.

In this sense our execution model is “lazy”. Many ops take an input array, wrap another array around it and return that. Some more complex ops do iterate over the input, create some intermediary, and then return a “new array” based on that.

You are saying you saw way too many calls to “createChunk” and “createArrayIterator”. My suspicion is that this is the return pathway being overzealous.
See the method void QueryProcessorImpl::execute(boost::shared_ptr query) in QueryProcessor.cpp.
It creates a “parallel accumulator array” whose job is to prefetch chunks and I think it might be performing multiple passes over the output of filter.

One thing to try:
iquery -anq “store( filter (input (…)))”

The -n will disable result fetching. The store will iterate over the result in a different fashion.

Does this help at all?


#3

Hi,

Many thanks for the prompt reply! Things start to make sense: as you expected, it I use -n and store, the behavior is different. Thanks for pointing this out; now I’ll know where to look!

I’m still not sure why the original InputArray code re-uses iterators instead of re-instantiating in every getConstIterator call, but I’m assuming it’s only for saving memory and because in the original case there would be no issue with two iterators for the same attributes in different positions?


#4

I didn’t write this guy, so I’m not 100% sure.

But looks like we have

  • one scanner object that sits off of the input array
  • K lookahead objects in the input array
  • K iterators (one iterator per lookahead object)

All of scidb execution is “columnar” which means that every array attribute is stored in a different chunk. But when it comes to the file, all attributes appear close together. So the “scanner” moves through the entire file and fills each of the “lookahead” objects with data. Looks like it keeps 2 chunks in memory at a time per lookahead.

So if I were to reset one of the iterators back to the original position, all other iterators would get reset as a side effect…

Could we allow for multiple “scanners”? Possible. In practice this would be discouraged because the disk seek overhead would kill the performance.

Hmm… does this help at all?