Heterogeneous array


#1

Is it possible to have arrays of different data types in scidb? like arrays of classes ?if yes please give me directions and how it could be done


#2

Puzzled about precisely what you have in mind …

  1. There’s no reason you can’ stuff a string into a SciDB attribute, and within the string embed something like a JSON or XML data object. Of course, natively SciDB doesn’t have facilities like XPath expressions or inverted indices … you would need to add those.

  2. There’s no practical limit on the number of attributes you can have have in a SciDB array. And the engine supports the ability to add new types of your own invention. In recent times, we’ve done things like add data types that model the active sites on chemical compounds, implement probability distribution functions and even model complex financial objects.

  3. We don’t (yet) support structured types along the lines of recent SQL. Is this what you have in mind?

Perhaps it would be helpful if you were to drop a reply with some kind of “made up” syntax to show what you have in mind?


#3

Thank you for the reply. I am looking at adding my own data type .Could you please let me know how todo it .
If i have say three attributes and set null for the ones that are not present in the cell ,there are lot of nulls .thus i wanted to have something like array of unions where say i have a float , integer and string , each cell could have one of this while the others are not required to be set to null.


#4

So …

First … the only way you can do this currently is by implementing your own user-defined type. It would need some kind of type id, functions to pull out type-specific data, etc. This is about a week’s worth of ‘C/C++’ so long as you have a limited set of types in mind.

Second … I strongly urge you not to do this. It would complicate your queries, slow them down, and increase your storage footprint. Instead, go with an attribute per type. Let me go into a little more detail:

  1. SciDB’s query language is strongly typed. We use the strong typing to check query syntax (can we find all of the expressions you’ve asked for based on their name, argument types, etc), and semantics (can we get this type from the named attribute and put its result into that attribute). What you’re proposing is a late-typed model, where you don’t know the type of a data object until you interrogate it. If you choose to go with the monster you’re proposing every time you touch your new type, you’re going to have to wrap it in a layer of iif(…) statements to make explicit what type you expect to have. This is going to get very old, very fast.

  2. All of those iif(…) statements will enormously impact your run-time. And … if you only want a particular type (say, only the strings?) then you’re going to pay the overhead of pulling everything else out of the storage when you write a query. What SciDB does is first to strip the array’s attributes into columns, and to store each column’s data separately. Doing this is a big win, especially when there are multiple columns.

  3. When we pull the data apart into columns, we can compress the hell out of the per-attribute data sets. And because the types are all homogeneous (same type) we can apply run-length encoding to reduce the storage size. We even run-length encode the missing values (nulls). If you have a long list of null values, we will compress them into a single little token in the storage layer.

Now … what you’re going to have is a type that embeds the typeid into the type’s data. This makes it very hard (impossible?) to really take advantage of encodings at the storage layer. AND you’re adding the per-data object typeid overhead. This might not seem like much–say it’s only one byte–but you have to keep in mind how data is aligned in memory. If you add a one byte typeid (say, INT64 or DOUBLE for a data value) then the compiler will pad that our to 2 words. So your 8 byte INT64 or DOUBLE has now suddenly become a 16 byte object! Worse … it’s an object we really can’t do much to encode / compress.

In summary, I really think you should re-consider your data model strategy. Just go with the three attributes! We will compact the living carp out of them–because we will know the type, and how to handle the missing codes–and make your life writing queries much easier.


#5

Thanks a lot for the reply.
I have decided to go with the attributes type and set null for not present values.

Thanks again