Vectorized array expressions


#1

Hi,
Today I received the latest newsletter and the following enhacenment catched my attention:

“Vectorized expression evaluation”

What is this enhacement about?
is it about memory optimization while executing array operations?
that would be great!!

Violet.


#2

Hi, Violet,

What we’re essentially saying is we’ve created a new storage format based on run-length encoding (RLE). If you’re familiar with compression schemes, it’s not strictly RLE, but RLE-based. By default, all of the data is stored in compressed form. A lot of the operations (like grand sum, or filtering) also happen on the compressed form directly. We found that, in many cases, this new data format improves performance. We also found that this format deals with data sparsity in a much nicer fashion.

For example, one use case I’m working on has very sparse, very skewed two-dimensional data where there are small regions of very high density, scattered in a lot of empty space. Using SciDB 11.6, I had no choice but to break this array into many small chunks. The array took up 2+TB of space. Now with 12.3 I was able to increase the logical chunk sizes by a factor of 1,000 and the physical chunk sizes stayed the same. Now this same array occupies only 250GB of space. And queries against this array perform better.

So, that being said, we haven’t yet eradicated all memory problems. We know that some of our ops do have memory usage issues, and working on them. If you are hitting problems and can give me a particular scenario or query - I can definitely try and help, suggest a workaround, etc.

Cheers,
-Alex Poliakov


#3

I have basic knowdledge about RLE compression algorithms.

I do agree that operations on the compressed array data will perform much faster. Are all operations supported? or only sum and filtering operations can benefit from this feature?
Specifically, can other operations such as slicing be applied on the compressed array? If so, is there any index mechanism on the compressed array for this purpose?


#4

If you’re very curious, you could take a look at RLE.cpp in the source. Array data gets encoded into a “RLE payload”, and “indexing” is a binary search operation. There’s a second level of encoding in order to represent the empty cells in the array.

We’ve changed the core format to RLE and we rewrote several operators to work on RLE directly - namely these ops are apply(), filter(), between() and aggregate() when used without group by dimensions. This covers quite a bit of ground, and, over time we will continue converting other operators to RLE mode as appropriate. Obviously, when an op doesn’t support “RLE mode”, it still works. The data just gets decomposed into cells and processed individually. This happens transparently.

Does this help answer your question?


#5

Excellent!
this looks promising.

I’ll take a look at the code as you suggested.
Thank you.