I recently installed scidb and start playing with it. After going through the example tables and queries in the documentation I want to to try something more complicated and can not find the info in the docs. Some questions are bellow, any answer is highly appreciated.
A sample problem i want to solve is the following:
I have a building with temperature and humidity sensors. Each room in the building has a sensor which provides the temperature and humidity readings every half hour.
There are X csv files, one for each sensor, for each day. For example for sensor1.txt
The table that holds the info is one dimension and looks like:
sensor<name:string, time:datetime, temp:int16, hum:int16> [i=0:*, 100,0]
Question 1: how do i get the average temperature for sensor1 for 3 days between 09am and 11am? I have figured out how to reduce my sample to either a specific range of days OR a specific timeframe, but not how to AND them.
Question 2: once i load the above csv file for sensor1 on day1, loading another csv file on the array replaces the previous contents of the array. I have found a kludgy way around it, but whats the recommended way of loading multiple files on an array?
Question 3: for the above problem wouldn’t a 3 dimensional array work best? Something like:
create array sensor<temp:int16, hum:int16>[name(string)=0:,10,0, date(datetime)=0:,10,0, time(datetime)=1:48,48,0]
In that case, the 2 dimensions -name and date- will grow as more dates and more sensors are added, while the third dimension will support 48 intraday readings.
Since this schema is not possible, whats the most efficient schema that can be implemented, assuming all queries will be using the name, date and time at the WHERE clause and not any attributes?
Question 4: a more generic question from the previous point. Given a choice in schema design, does it make any difference in performance if the query WHERE clause is filtering on dimensions or attributes?
Question 5: Are there any plans of adding ‘date’ and ‘time’ datatypes? Or allow datetime to be used as a date or time only? Is there any documentation regarding building user custom datatypes that can be used as dimensions?
With the above schema you can do something like this:
time > 'SOME_START_DATE' and
time < 'SOME_END_DATE' and
hour_of_day(time) >= 9 and hour_of_day(time)<11
So, we have functions “hour_of_day” and “day of week” for exactly this sort of thing. But this computation is not something that SciDB is built for, right. The whole point is to use dimensions to your advantage, which we get into with your next questions.
Also as I mentioned in a few other posts, you want larger chunk sizes. Shoot for about 1 million elements in the chunk.
Right now there are several things you can do:
As you may have noticed, when you load twice, the first load is not overridden exactly. What the system does is create a new version. So you have two versions of the array “sensor”: “sensor@1” and “sensor@2”. And you can refer to these specific versions in one query, i.e. “aggregate(sensor@1, count(*))”. And if you want access to all the data at once you can use something like “join” or “allversions(sensor)”. This mode works for some applications.
For other kinds of loads you can, instead of using “load(array, ‘file’)” say “store(merge(input(array,'file), array), array)”. This first loads the new stuff from the file, does a “merge” with the old stuff and then stores everything. This has the effect of making the last array version contain all of the data.
The most efficient way to do this is a hack right now. You have to create an IMMUTABLE array and then make sure that when you load things you are loading whole chunks at a time. The IMMUTABLE array has a semantic where new chunks can be added to it, but chunks that have already been written cannot be changed. And there are no versions to worry about. That would be absolute best, but it doesn’t work with non-integer dimensions.
Question 3/ Here’s what I would do if I were you:
Our non-integer dimension support is still lacking. So I would convert all dimensions to integer. Looks like you have only about 48 data points per file, which is way too small for a scidb chunk. So I would combine a bunch of files together and do bulk loads.
Where date is simply the number of days that have elapsed since some well known “day 0”. So this gives you comfortable support for 1000 sensors for 10 days in one chunk. And if you have more than 1000 sensors - no problem, it just creates an extra chunk in that space. So this is built to perform a load every 10 days as new data arrives.
How do you actually perform such a load?
Well you could at first create a temporary 1D array that looks like the above:
It would be nice if you could concat the files and load all the files for all the sensors for 10 days into this guy at once, but if that’s not doable you could do tricks like allversions. Now also I am making a convenient assumption that all your sensors are named "sensor1, sensor2,… " so we can just strip away the word sensor and cast the rest into an int64. So you’re looking at something like:
apply ( sensor_staging, sensor_id, int64(substr(val,6,100000)),
date, int64(time)/86400 - int64(datetime('6/1/2012'))/86400, --here I am assuming that 6/1/2012 is your start date
dtime, (int64(time)%86400) / 1800 + 1, --and here we convert your reading time to a number between 1 and 48
And then you can remove sensor_staging, rinse, repeat. Right now, this kind of approach gives you the best load performance and the best performance on subsequent queries. Naturally, this is kludgy. One of our priorities for the next release is to support inserts into multidimensional arrays seamlessly (with and without non-integer dimensions). We are definitely working on this problem.
Question 4: Yes filtering on dimensions can be done a lot faster. And dimensions allow you to perform other things like group-by aggregates, windows and joins. The “filter” operator will work about the same on attributes versus dimensions, but when you are restricting on a dimension, you can use “between” and “cross_join” operators to really speed things up! So your queries can compute much more interesting things easily. For example, average temperature, for the days between 15 and and 25, for time points between 10 and 20, for each sensor:
Question 5: datetimes really represent very well as integers so no we don’t currently have any plans to change or expand datetime support in the near future. Yes you can make custom datatypes and our source tree contains example datatypes “point” and “rational” that you can play around with. Any datatype (custom or not) can serve as a non-integer dimension, BUT you are still missing the ability to cleanly insert data into an array with non-integer dimensions. That will come in the next release.