String dimensions?


#1

Hi Folks,

Just wondering can I use a string as a dimension?

I’m not sure who invented the gene naming scheme, they look like numbers but the are too sparse (from 10K to 200Billion), and very uneven. So chunking becomes really hard: it either runs out of memory because 1-2 huge chunks, or runs very slow because there are too many chunks.

So if SciDB internally build a map of string->int64 and we can use String as a dimension, that will be great.

And follow up question, if scidb allows a string dimension, does it support joins on string dimensions?

Thanks a lot!


#2

There is limited support for string (and other non-int64) dimensions. See 13.3 manual section 4.3.3.

For arrays with arbitrary string dimensions, you can join small data sets with a filter of a cross product.
For two arrays that have a matching string dimension (all same values) you can perform large joins and cross_joins by first cast()-ing the dimensions into their integer ordinals.

Another option is to create a mapping first (via a reduced redimension_store command or outside the system) and then convert the mapping back to dense integers and use those.

Hope it helps.


#3

Thanks Alex,

I understand the “outside the system” option. But could you elaborate on the “reduced redimension_store” option?

Thanks
-Yushu


#4

Another approach is to use a string dimension to store the array and ask SciDB for the string to integer mapping. Then, if you need to get rid of the string dimension later to use an operator not compatible with it (like cross_join and merge, for example), you can use cast do to that–which is basically free since it just works on array metadata.

Here is an example:

  1. Create an array to store some data. This example has one string dimension (x) and one integer dimension (j).
  1. Populate the array with data somehow:
  1. The mapping from the string dimension to an integer dimension is available with the special “:” syntax:

which in this example returns something like:

{no} value {0} 'x1' {1} 'x10' {2} 'x2' {3} 'x3' {4} 'x4' {5} 'x5' {6} 'x6' {7} 'x7' {8} 'x8' {9} 'x9'

  1. Some operators don’t work with non-integer dimensions–for example cross_join and merge. But the ‘cast’ operator is a quick and computationally cheap way to replace a non-integer dimension with it’s integer mapping:

Best,

Bryan


#5

Thanks Bryan and Alex,

One quick follow up question on the implementation of String Dimension during redimension_store:

I know SciDB is building a map internally. Just how is the map stored/distributed in memory during the redimension process?

I was redimensioning a 1D array with 340M different strings in the dimension and it ran out of memory and crashes.
I wonder what I can do to make this run.

Thanks

-Yushu


#6

It would be great to also be able to index a string dimension by number, then scidb array can behave more similar to R data.frame.

I noticed that for one array with the following schema:
A <value:string NULL DEFAULT null> [ID(string)=,1000,0,var(string)=,1000,0]
in R

tmp <- scidb(“A”)
dim(tmp)
[1] 4.611686e+18 4.611686e+18
while the actual dimension is about 500X1200.


#7

Hi folks,

I have some prototype code on the way that may help with this. I’ll let you know when I have something you can look at or use.


#8

Please see viewtopic.php?f=18&t=1172