I want to take real data from datastore


#1

Hello, everybody.

I have one question and need your kindly answers.
I want to take real data from datastore/*.data
but, as you know, .data is made up chunks
and each chunk has header/data, but not chunk_number(meaning order)

It just tell me,
{chunk_header} and {chunk_data} without any data of ordering (it is mixed)

actually, I know one thing, instance is one order method of data.
but, It has same problem still, with replica…

whoooa… at this time, I’ll say to you what I want to do

SciDB data -> (without converting) -> save to HDFS(hadoop)
after then, Make array application with hadoop

not as like SciHadoop, caused that paper used netCDF, and it has limitation of programs

still, after then,

HDFS data(array type) -> (without converting) -> save to SciDB


OMG, I’ve studied about scidb with my senior till 5 months
I need your various comment! Please help me dear!!


#2

What you want to do (from what I can tell) is to take the data in your SciDB instance, and save it (in parallel) to a HDFS file-system? Correct?

Have a look at the parallel save(…) options. You can take the data in a SciDB array, and save it out to a number of files, one per instance. Point the target of your save(…) to a directory in the HDFS you’ve mounted on each instance.

Curious … what do you want to do in Hadoop that you can’t do in SciDB? Or is this for comparison purposes?


#3

[quote=“plumber”]What you want to do (from what I can tell) is to take the data in your SciDB instance, and save it (in parallel) to a HDFS file-system? Correct?

Have a look at the parallel save(…) options. You can take the data in a SciDB array, and save it out to a number of files, one per instance. Point the target of your save(…) to a directory in the HDFS you’ve mounted on each instance.

Curious … what do you want to do in Hadoop that you can’t do in SciDB? Or is this for comparison purposes?[/quote]

Actually, Our lab used hadoop to access BigData(as like Huge seaDas Data)
so I try and try “save scidb chunk data to HDFS without any converting and process with array function in Hadoop”

anyway, I saw some proceed point.
I had analyzed about storage.header file.
and there are two types of data, “storageheader”, "chunkdescriptor"
I could know what is “storageheader”, but what is “coordinates” in “chunkdescriptor”?

Is there anybody who know about this?


#4

Nooo! Down that path … trying to reverse engineer our storage model … lies madness!

The contents of the storage.header file is a list of all the chunks on the local instance of the overall the system. The idea is to be able to map from an array_name, and a cell’s ‘coordinates’ (the vector of int64 numbers that define a cell’s “location” within the space defined by the product of the dimensions), to the physical location of the cell’s data … the data file, offset within the data file, and so on.

You can get exactly this information from the storage.header simply by using list ( ‘chunk_map’ ). And also … we’re almost certain to change the physical organization of the storage.header file because it’s got its share of problems.

Now … we’ve spent a lot of time writing tools to save / load data in parallel. I would look at them. They will let you save data from SciDB into a range of formats using a range of modalities (single-instance save, parallel save, save as text / binary) and you can use the SciDB query language to organize the data being saved into any form you want (unpack to turn an n-D array into a single stream of data, with each cell’s attributes pre-pended to the cell’s data values).

Stepping back from the details … what is it that you’re trying to achieve? I suspect the Hadoop (HDFS) related questions might be a bit of a red herring. SciDB really doesn’t care where the data you save or load from is stored. It can be a local file-system, a pipe, a network attached file-system, HDFS, etc.


#5

[quote=“plumber”]Nooo! Down that path … trying to reverse engineer our storage model … lies madness!

The contents of the storage.header file is a list of all the chunks on the local instance of the overall the system. The idea is to be able to map from an array_name, and a cell’s ‘coordinates’ (the vector of int64 numbers that define a cell’s “location” within the space defined by the product of the dimensions), to the physical location of the cell’s data … the data file, offset within the data file, and so on.

You can get exactly this information from the storage.header simply by using list ( ‘chunk_map’ ). And also … we’re almost certain to change the physical organization of the storage.header file because it’s got its share of problems.

Now … we’ve spent a lot of time writing tools to save / load data in parallel. I would look at them. They will let you save data from SciDB into a range of formats using a range of modalities (single-instance save, parallel save, save as text / binary) and you can use the SciDB query language to organize the data being saved into any form you want (unpack to turn an n-D array into a single stream of data, with each cell’s attributes pre-pended to the cell’s data values).

Stepping back from the details … what is it that you’re trying to achieve? I suspect the Hadoop (HDFS) related questions might be a bit of a red herring. SciDB really doesn’t care where the data you save or load from is stored. It can be a local file-system, a pipe, a network attached file-system, HDFS, etc.[/quote]
Does SciDB hava the operation of list ( ‘chunk_map’ )?