Wanted to konw something about the storage of scidb


#1

From scidb_overview,I have known some about the storage.But i want to know it in deep.Also,I want to konw the query strategy ,such as the index that scidb uses(like b-tree etc).
Could anyone can tell it to me ? Or papers about this is Ok .
My email is asdf08010801@sina.com
:smile:

             Thanks.

#2

Start with this:

escience.washington.edu/sites/de … -brown.pdf

The details of how we achieve some of these features are very detailed. We don’t (yet) document or write it up for external consumption because some of these details are subject to change, and we don’t want to mislead anyone by premature description.

Start with the paper. Read the code. And ask any (specific) questions you might have.


#3

I have seen this paper before. But I still don’t know some things:
1.does scidb use index technique and use what?
2.when we execute a query statement,how does scidb get what you want exactly and quick? can you explain it in details?
3.In the paper,you mentioned the extension of scidb ,but i can not get an evidence that can prove it. Or in the paper,you didn’t see it clearly?
4.The last thing i want to konw is that does scidb use any distributed index strategy and use what and also if it does use,what is the extra-cost ?

Sorry to trouble you ?

Thanks ! :smile:


#4

Let’s sketch some answers. Several of your questions can be combined.

  1. How do we ‘index’ data in SciDB? How do you retrieve data quickly?

First, SciDB doesn’t use any “traditional” indexing. No B-trees, R-trees, GiST, etc. We might put them in eventually, but because we’re geared for decision support queries and analytic algorithms that grind through bulk data, it isn’t clear that attribute indices will help us much.

Now, in lieu of traditional indexing, what we do is pretty much exactly the same thing that folk who implement archives of big images or perform linear algebra at scale do. If you have an array A, we divide the cells of A up using a technique known as ‘block partitioning’. Basically, we chop the array up into many smaller (2 - 16 M ) blocks (we call them ‘chunks’) using a simple-minded algorithm, and then we distribute each ‘block’ over the instances that make up the SciDB installation using a hashing method.

This is how we achieve fast query results when what your query wants to do is to pull out a ‘region’ of the array: say a box from within it, or a ‘slice’ of it. From the predicate size and shape, we can calculate the list of ‘chunks’ that contain the data the query wants, and we can figure out which instances contain those chunks. For example, suppose I had an array A that was 3,000 x 3,000 in size. Let’s say we have four instances: I1, I2, I3 and I4. Assuming array A is dense, we would probably divide it up into 9 ‘chunks’, each of which was 1,000 x 1,000. Then we would map C1 (which contains A[0,0-> 1000,1000]) to I1, C2 (which contains A[0,1000 -> 1000,2000]) to I2, and so on.

Now suppose your query is ‘aggregate ( between ( A, 900, 900, 1100, 1100 ), count (*) )’. In other words, you want to count the cells in the region A[900,900 -> 1000,1000]. We know, by looking at our block partitioning algorithm, that out of the 9 chunks that store data for A, this region involves C1, C2, C4, and C5. And we can tell, from our chunk -> instance mapping, that these are to be found in I1, I2, and I4.

Does that help? We don’t use any kind of indexing, distributed or not. Rather … we exploit the mathematic properties of arrays. All of our “indexing” is algorithmic.

  1. How can you extend SciDB?

Have a look in the examples directory that we ship with the source. You’ll find examples of postgres-style user-defined types, functions and aggregates. And you’ll find examples of how to create your own ‘operators’. The idea is to support things like this:

[code]CREATE ARRAY my_array
<
attribute : My_Type

[ dim=0:*,1000000,0];

SELECT My_Aggregate ( attribute )
FROM My_Array
WHERE My_Function ( My_Array.attribute ) > 5;

SELECT COUNT(*)
FROM My_Operator ( My_Array );
[/code]

Does that help?


#5

[quote=“plumber”]Let’s sketch some answers. Several of your questions can be combined.

  1. How do we ‘index’ data in SciDB? How do you retrieve data quickly?

First, SciDB doesn’t use any “traditional” indexing. No B-trees, R-trees, GiST, etc. We might put them in eventually, but because we’re geared for decision support queries and analytic algorithms that grind through bulk data, it isn’t clear that attribute indices will help us much.

Now, in lieu of traditional indexing, what we do is pretty much exactly the same thing that folk who implement archives of big images or perform linear algebra at scale do. If you have an array A, we divide the cells of A up using a technique known as ‘block partitioning’. Basically, we chop the array up into many smaller (2 - 16 M ) blocks (we call them ‘chunks’) using a simple-minded algorithm, and then we distribute each ‘block’ over the instances that make up the SciDB installation using a hashing method.

This is how we achieve fast query results when what your query wants to do is to pull out a ‘region’ of the array: say a box from within it, or a ‘slice’ of it. From the predicate size and shape, we can calculate the list of ‘chunks’ that contain the data the query wants, and we can figure out which instances contain those chunks. For example, suppose I had an array A that was 3,000 x 3,000 in size. Let’s say we have four instances: I1, I2, I3 and I4. Assuming array A is dense, we would probably divide it up into 9 ‘chunks’, each of which was 1,000 x 1,000. Then we would map C1 (which contains A[0,0-> 1000,1000]) to I1, C2 (which contains A[0,1000 -> 1000,2000]) to I2, and so on.

Now suppose your query is ‘aggregate ( between ( A, 900, 900, 1100, 1100 ), count (*) )’. In other words, you want to count the cells in the region A[900,900 -> 1000,1000]. We know, by looking at our block partitioning algorithm, that out of the 9 chunks that store data for A, this region involves C1, C2, C4, and C5. And we can tell, from our chunk -> instance mapping, that these are to be found in I1, I2, and I4.

Does that help? We don’t use any kind of indexing, distributed or not. Rather … we exploit the mathematic properties of arrays. All of our “indexing” is algorithmic.

  1. How can you extend SciDB?

Have a look in the examples directory that we ship with the source. You’ll find examples of postgres-style user-defined types, functions and aggregates. And you’ll find examples of how to create your own ‘operators’. The idea is to support things like this:

[code]CREATE ARRAY my_array
<
attribute : My_Type

[ dim=0:*,1000000,0];

SELECT My_Aggregate ( attribute )
FROM My_Array
WHERE My_Function ( My_Array.attribute ) > 5;

SELECT COUNT(*)
FROM My_Operator ( My_Array );
[/code]

Does that help?[/quote]

Thanks! It really help me a lot especiallly the first question that you answer.


#6

Hello, I get a “404 Not Found” when trying to read the paper you suggested. Could you provide a working link please? Thank you.


#7

I think I’ve found the paper here

This is from 2011, do you guys plan to update this paper soon?


#8

So … you ask! We deliver!

We have just finished putting together a (long, detailed) Architectural White Paper describing the SciDB platform. You can get it here.