High density time-series data


#1

Can anyone comment on the use of SciDB to store high-density time-series data (i.e. has anyone tried it, is it even an idea worth trying, should it be able to handle it)? We’re looking at data ranging from every few minutes to 2kHz. The idea is to be able to store all of this data within a single database for archival purposes while also making the data available for visualization and processing/analysis.

Thanks!


#2

Hi Andrew!

I’m quite confident that SciDB will be able to handle the scale of data you’re proposing, but the current design point isn’t “stream ingest”. So if what you want to visualize is (for example) moving average bars as they change “in real time”, I don’t think SciDB is the solution.

If, on the other hand, your starting point is a historical log of timeseries data and what you want to do is to project it into other forms for visualization (say - window averages, covariances over time), then I really think SciDB can help. Do you have any examples of the broad outline of what you’re looking at? Number of attributes? The other, non-temporal dimensions? The kinds of queries you’re thinking of running?

KR

Pb

#3

Hi Andrew,

I am researching this problem right now for my PhD thesis. I would be glad to discuss it further with you if you like. SciDB is one of the options we are studying.

Best Regards

Bruno Grieco


#4

I am creating a very similar database with data at about 2 kHz. There are 8 sensors all outputting data at 2 kHz. I set up a 2d database that has time on the x axis and sensor number on the y. I am having some difficulty importing the data into the database, or rather the imports are taking a long time. I would like to optimise how long it takes to import the data especially since we have more than 2 years worth of data to upload.

Currently I am creating a data file that is ~1GB containing the data in this 2d format. 1GB is almost 30 minutes of data. It is taking me about 10 minutes to upload this whole file. This seems a lot longer than I would expect. Does anyone have any ideas on what I may be doing wrong, or different alternatives I may have?

I am using this command: iquery -aq “load(dbName, ‘filename’)” > /dev/null".

The reason for dumping the output is because it seems that an import has 3 phases. 1st it parses the file, 2nd it uploads the data, 3rd it prints the data. Printing 1 GB of data to stdout takes a very long time. If I can simply disable the printing of the data all together that may save some time also.

I am also loading the data in chunks of size (62000,8) with 0 overlap. I imagine there is an optimal size for the imports. Might it be faster to use chunks of (62000,1)? or (124000,8)?

Please let me know what you think!
Thanks,
Alan


#5

Alan,
Quick suggestion. Try

iquery -anq "load(dbName, 'filename')"

The “n” means “do not fetch the results of the query”. Using -n should be faster than the pipe to dev/null.
Also, are you using a single node system or multiple nodes?