Optional string names for integer dimension


#1

I am a R user and have been playing with R package scidb for a month or so. It is great to have a database back-end to manage data for analysis in R or any other software.

In scidb dimension indices are either integer (like i = 1…n) or string (non-integer dimension). On the other hand, R matrix or data.frame allow both indexing by integer or string (optional). Since it is possible in scidb to map string index to integer (not in order any more), the mechanism already exist for coexistence of integer index and string index. In fact there is no need to have non-integer dimensions (they are all integer anyway), but users can assign a string name to each index. Another advantage would be selection of a range would also work for non-integer dimensions (if their order is preserved).

My plan is to use scidb as a data warehouse for various formats of high-through put data, like genetics, mass spectrometry, transcriptomics, etc. There are usually hundreds to millions of variables, ordered (by mass, retention time, etc), 2 dimensional arrays. I want to be able to select a range of variables and also to see their names easily.

Another question here, is it possible to share 1000 genome project scidb use case? I also would like to have web interface to some of the data in scidb.

Many thanks.


#2

All I can say, at this point, is “watch this space”.

Implementing labelled dimensions is proving to be harder than we anticipated, partly because implementing updatable labels is hard, but also because it’s not clear what the semantics of these should be in the general case.

For example … regrid ( input, X, Y, aggregate( attribute ) ) … is pretty clean. But what happens when X and Y aren’t integers, but are strings? What does regrid(…) (or window(…), or even between(…) / subarray (…)) “mean” in that case?

We’re working on a couple of ideas for “how”. Until then, the best way to implement labeled dimensions is to use 1D mapping arrays, populate 'em with redimension_store(…), and use lots of cross_join(…) queries to turn the integers along the edges of the array back into labels.

Bear with us. Lots to do. Addressing labelled dimensions is a high priority.


#3

Thanks very much for your reply.
So far all the array data I have worked with can be modeled as integer dimension arrays, optionally with a label to the integer index. Since selecting a range of a dimension is often necessary, in scidb I would model all my data as integer dimension arrays. Since they are essentially still integer arrays, all the functions would remain the same. I would imagine only a few simple operations are needed, such as getNames(), setNames(). Sorry for my naive thinking in an R way. I have no idea how implementation would be affected in scidb.
I was planning exactly to use 1D mapping arrays and cross_join. But then I am quite bothered by having to give names to arrays that map to every dimension of the real array.