I’m wondering if SciDB lets user to get partial result of a query (especially UDF)?
Do you wish to see a partial result as a given query continues to execute? Or do you wish to cancel a query at some point during its execution to see a partial result? Does your use of ‘UDF’ here mean user-defined function?
Yes, UDF is user-defined function.
Let’s say I want to implement an incremental algorithm and I’m interested in the partial result after certain numbers of rows being processed. If I achieve a desirable result after 50% of the data being processed, I can make a decision to cancel the query instead of letting it keep running. So I want to see the result of a query while it is being calculated, is it possible?
Hi @hoangthaihuy, @dgosselin and I have been discussing your question. There is no pre-built solution, but there are a couple of possible approaches: writing a user-defined operator (UDO) or using the stream operator from https://github.com/Paradigm4/stream . I wouldn’t go with a user-defined function approach, because UDFs only get to see the scidb::Value objects they’re applied to during query execution, and wouldn’t be able to really alter or stop the flow of execution.
An advantage of going with your own UDO is that you could pass parameters to it that would set your algorithm’s threshold for declaring “good enough”, which might be useful.
UDOs also have access to the Query object and so can cancel the query—but if you cancel, you’ll have to make some kind of custom way of communicating the partial result. Just cancelling the query would not store any result array. But you need not cancel the query; an operator in the middle of a larger query could decide instead to simply stop providing chunks to its upstream operator. Something like
store(outer_query(your_udo(inner_query(...), threshold), outer_args), RESULT)
If you can watch the innery_query() results flying past and decide that they’ve met the threshold criteria, then you can cease providing chunks to the outer_query() and store() will store whatever you’ve got.
In the course of deciding whether the threshold is met or not, you might find the AggregateLibrary helpful. In addition to UDOs and UDFs, SciDB also allows user-defined aggregates (UDAs). See include/query/Aggregate.h in the source tree.
Regarding the stream operator from Github, the best introduction is this excellent blog post by @rares : https://rvernica.github.io/2017/10/streaming-machine-learning . I have not myself used the stream operator, but you should definitely check it out since debugging R or Python is much easier than debugging a C++ SciDB operator.
Without knowing more about your application I can’t really say much more, but I hope this gives you some good directions to investigate!
Thank you for the answer.
The idea of this is to prevent a full scan of the table should a desirable result has been obtained/retrieved.
An example of this is if we do k-mean incrementally, after processing certain numbers of row (40% of the dataset for example), if we don’t see significant change in the clusters’ centroid e.g the model getting stable, we can stop there. No need to process further. That’s why a partial result is needed.
I’m not sure if the stream() method will help but I will give it a try.