How to structure my data? Problem#1


#1

Hi Experts,

I’m working with some science projects to port their data onto scidb, since we are very new to SciDB and its array data structure, we were wondering can we get some input from the experts on how to best structure our data. I’ll try to describe the data we have, and some operations that needs to be optimized. Any help/comment on how to organize the data is greatly appreciated.

Here is the case #1:

This usecase is about the relationship between 2 sets of objects (Is this the Many-to-many relationship of 2 tables in relational databases?):

Set A: Billion Objects, each has an ID and a list of properties
Set B: Million Objects, each has an ID and a list of properties

“Matches” - Relation between A/B: For each object in A, it “matches” with 0 or more objects in B; for each object in B, it “Matches” with 0 or more objects in A. In total there should be <100Billion matches.

For each “match”, there is a list of properties for this “match”, e.g. how well did it match.

Operations: we would like to ask questions like:

  • what percentage of A has got a “match”?
  • what percentage of B has got a “match”?
  • for a certain subset of B, do some statistics on the property of their matching As?
  • for a certain subset of A, do some statistics on the property of their matching Bs?

Thanks a lot! And I will have some problem #2#3#4 following soon.

-Yushu