Questions/ideas


#1

I read through the documentation of AIL. I understand that the current functionality and documentation are limited and will be extended. I post a few (maybe too detailed) questions and ideas. I don’t know if some of them have an impact on the design.

  1. The documentation does not say which data types are natively supported by SciDB. Will complex and date/time also be supported?

  2. Nested arrays were planned, but it does not say anything about it. Can you, for example, get a subset from a nested array in elements of the main array, probably as a single higher-dimensioned array?

  3. I assume that chunks are stored distributedly (in maybe a round-robin way). Will there be a way to control how they are distributed in case you know that operations you want to do can benefit from keeping chunks together.
    Similarly, are there ways to control how attributes are stored. I can imagine that lesser used attributes could be stored separate from more frequently used attributes (a la columnar storage in RDBMS).

  4. I find it quite limiting that the chunk size must divide the array size evenly. Will that also be the case in later versions?

  5. I hope that in the future functions like standard deviation, median, etc. will be added. I understand that doing a median on a distributed array is not easy, but we need such robust statistics. Also functions like any, all, ntrue are useful.
    Currently only array to scalar reduction functions are present, but I hope that array to array reduction functions will also become available (e.g. take the minimum along one axis and the maximum of the result).
    Furthermore a function like a running median is also very useful (e.g. to subtract the background from pixels in an (radio-)astronomical image).

  6. I very much dislike the ability of an array origin (I guess that start in CREATE ARRAY means an origin), because it is a source of errors. We once had a C++ Array class where one could specify an origin, but when using it people always forgot about the origin when indexing it. I see something similar happening for SciDB arrays with origins. What is the rationale behind the support of origins?

  7. I assume that the use of ‘-A’ in the sort function is an internal thing, otherwise a sort on an attribute expression cannot be done.

  8. I like the python way of indexing sequences, so I hope that in AQL one can use something like ARRAY[startx:endx:stridex, starty,endy,stridey, …] where indices can be omitted at will. Also negative indices (counting from the end) are useful. Does AIL has support for strides?
    Other applications might need the average of every K pixels instead of using a stride K. For example, when displaying an image of [10000,10000] values.

  9. I hope that in AQL you can address an attribute in an array with the . operator (like array.A).

  10. Why can’t you sort a multi-dimensional array? The result could be a 1-dim linearized array.

Cheers,
Ger


#2

Good suggestions and comments – thanks! I heard you will be coming to XLDB. We can talk further next week.


#3

Hi Gervandiepen!

First - thanks very much for your interest in SciDB! I’ll do my best to answer your questions in order.

  1. Date & Datetime as a data type. We’re absolutely going to support this. The reason we didn’t get to it already is that temporal types are a bit of a worm-can. If you’re going to do it “right” (say - implement the SQL, TSQL2 or the XQuery specification) then you need to get to not only the basic types, but also all those operations involving date / date time arithmetic, time zones, etc. None of it is hard. But doing it involves a lot of work.

Our current thinking is to a) get the user-defined type support right, and then b) use the UDT/UDF mechanism to implement out a complete set of temporal types. This would include the kinds of range types ({ BEGIN DATE, END DATE }) as well as the functions you would expect (BEFORE, OVERLAP, WITHIN, CONTAINS, etc).

  1. Nested Arrays. Sorry. We simply ran out of cycles on Nested Arrays. We’re currently trying to see where they fit into the priorities.

  2. a) Chunk distribution. Yes. In 0.75, we will provide a mechanism that will support hash, range and block cyclic distribution. b) Vertical partitioning. We here as SciDB are unashamed about stealing good ideas. The existing chunks contain data for only one attribute. That is, the storage manager is a column store.

  3. Chunk Size and Array Size: We’re will be relaxing a lot of these restrictions in 0.75. In the design of 0.5, we were slightly hampered by a lack of evidence about a couple of key questions. Are variable chunk sizes a good idea? How big to make chunks ( compression v. memory usage trade-offs)? We have answers to some of these questions now (thanks to some of the CS academics we work with) and we’ll be changing this design in 0.75.

  4. Future Functions: It’s important to understand that we envision SciDB to be an extensible system, like http://developer.postgresql.org/pgdocs/postgres/xtypes.html Postgres does. One of the things we’ve become aware of in talking to our early adopters is the range of mathematical tools they want to employ. Robust analytic methods, type/domains like “random variable”, etc. Figuring we can’t possibly do all of these, the idea is to make it possible for users with basic C/C++ skills to implement precisely what they want to do. Thus you will be able, in 0.75, to run queries like this:

SELECT MRL_Approx_Median ( A.value ) FROM Array A WHERE ... GROUP BY A.Index;

Where the implementation of a user-defined aggregate like MRL_Approx_Median ( ) might be cribbed form a http://portal.acm.org/citation.cfm?id=276342 source like this.

One of our most ambitious goals is to provide support for your flavor of “array reducing functions”. For example, an algorithm that takes as input a “chunk”/“array” of some digital image, and returns an array containing the centroids or extent of the “objects” it finds in that image. Something like http://en.wikipedia.org/wiki/Feature_detection_(computer_vision) this, for example. The way we’re currently thinking about these kinds of operations is as extensions of the array algebra you can see today. For example:

SELECT COUNT(*) FROM Extract_Features ( Array ) F WHERE size ( F.bounding_box ) > 100;

  1. “What is the rationale behind the support of origins?” Mostly convenience for the chunk manipulators and the external loader. For example, if we don’t know the “minimum” value for an array, it’s impossible to answer out some data distribution questions. So the “origin” here is pretty much a physical op. But we’re also conscious of trying to prevent errors; that is, bring to the management of scientific data some of the stuff that SQL engines support in the way of value based integrity.

Let me sketch the current design for 0.75.

Array dimensions will each have: minStart, currStart, currEnd, maxEnd. When you load data into the array, you specify the absolute location of the data in a cartesian coordinate space. That is, you include with the load, the “local origin” of each new “block” of data. Subsequent operations which manipulate data can reposition the array within the n-Dimensional space as they want to. All we will do is to check that the values aren’t “out of bounds” with respect to the minStart and maxEnd limits.

7/10 Sorts: We confess. Sorts are broken and need to be re-written. This is because a) we need to parallelize the sort, and b) we need to improve our single-node sort.

  1. Strides: We’re looking at this for 0.75 or 1.0. In general, the problems of “data reduction” for display and query processing haven’t been a priority, but we’ve heard similar requirements from other people. One problem - there are a lot of possibilities out there. For example; smoothing, fractal approaches, random sampling, non-random sampling (stride elimination), etc. Some folk would like their results to appear (for visualization) like a JPEG being opened. Others want the query to run fast and yield an approximate answer. Others want allow the query to run until the error bars on the result contract enough.

  2. Absolutely. The AQL will look, feel and smell a lot like SQL. Only instead of a relational algebra, we’ll be supporting a more set-based algebra.

Hope this helps!


#4

You definitely meant that AQL will feel like an array query language. Sets don’t do us any good in our use-cases


#5

#include <blush.h>

We diverge on idiom. But Pavel’s absolutely correct. It’s arrays all the way down. And that’s what AQL will operate on.


#6

I have been implementing a parallel computable database for my own project and just stumbled onto yours.
I like very much what you are attempting, and hope to influence you towards meeting the special needs of my project.
I haven’t spent enough time reviewing your documentation or code to know whether my comments are redundant.
I apologize for the enthusiastic “jumping in” that I do in this posting.


My project:
Solve perceptual paradoxes using model neurons and small (at first) neuron aggregates.
I use only a few general operations to model a neuron:

  1. Transduce difference signals (usually an operator like tanh) in both time and space
  2. Pulse when sufficient difference is detected (variable threshold)
  3. Collect pulses along converging high speed delay lines (linear sequence of associated voxels)
  4. Detect coincidences where delay lines converge (variable threshold)
  5. Project detected coincidences over diverging high speed delay lines to transduction-points
  6. Move transduction-points to neighboring voxels under certain conditions

As you can see, my model needs are not particularly draconian.
However, when one tries to attempt to produce aggregates of neurons
the interdigitation of the collection and projection trees causes me
to consider very carefully the locality of data, avoidance of race conditions,
and the co-occupation of delay lines within a given voxel.
I have solutions for these, but they are too much to describe here.

To implement my model, I chose to use a discrete Cartesian grid of voxels where
each voxel can be fragmented into a voxel-filling discrete Cartesian grid of subvoxels
and this division can be performed recursively ad-infinitum.
In addition, given a grid, I can build a grid of supervoxels around it recursively.
I chose to use a grid of 3x3x3 subvoxels with a center subvoxel having local coordinates (0,0,0).
Each voxel contains a property list which could be construed as implementing higher dimensions.
Each voxel is considered an autonomous computation element
taking its information exclusively from its property lists and immediate neighbors.
Computation is a two-step process of

  1. collecting properties from neighboring voxel states
  2. modifying local properties from the collected neighbor properties

I chose these criteria due to opportunities for massive simplifications of
standard classical Physics calculations like Maxwell’s equations and Hertz’s speed of light calculations.
Also, these criteria are intuitively familiar to a broader group of users
since they operate on the familiar 3-space,
yet enable support for single-voxel internal array properties
to implement dimensions deemed valuable for a project.

In addition, I chose this design because it has an intrinsic structure easy to distribute
over a cluster of machines with redundancies sufficient to tolerate failures of nodes.

My model is largely insensitive to the failure of a given voxel
just as most neural aggregates are insensitive to failure of a given neuron.

But, as I said, this is how I am approaching my own project.

I have some implementation ideas to contribute to your efforts (possibly redundant):

  1. Symmetric odd-length arrays with index 0 in the center eliminate persistent expensive opposing offsets between coordinates and indices.
  2. Symmetric even-length arrays straddling 0 in the center make discrete Maxwell and Hertz style calculations simpler.
  3. Sparse arrays implemented as (not necessarily binary) trees enable local compact representations and computational space reduction.
  4. Avoid floating point altogether by enabling fragmentation within limits (maximum fragmentation depth).
  5. Implement optimized coordinate-index-coordinate transforms (I have good implementations for my own project).
  6. Implement equivalents to C++ mask_array, index_array, etc… to enable nested blittable subsetting.
  7. Implement a scientific data visualizer (like ITK/VTK) (I have developed one using OpenGL for my own project).
  8. Implement a “sculpture” tool and language to enable hand and automated construction of objects and subsets.

Avoiding floating point makes nested blittable subsetting possible, so this is not a lightly given recommendation.

One last recommendation.
Do not implement just one interface language to your database.
Make, instead, a set of languages that achieve different efficiencies
and make it possible for people to write their own languages
to implement their own solution spaces in the “best” language for the purpose.
After all, underneath LISP, FORTRAN, C++, Python, MATLAB and all that
is the very same processor.
Each language constrains the user in some ways while augmenting in other ways.
Ideally, there ought to be some very primitive “machine language” for your database
on top of which is implemented the developer language on top of which is implemented the application.

I could add value to this project if the SciDB team is interested.
I am an experienced cross-platform C++ and Intel assembly developer with deep thoughts about parallel algorithms.

I welcome both comments and invitations to further explore cross-fertilization of ideas.
Again, I apologize that I just jumped in without seeing how far your team has already gone.