Understanding schema design


#1

Hello, I’m trying to wrap my head around schema design. I have large, somewhat dense, data files which I want to store in some type of database. It seems like scidb may be what I am after, but I don’t know how to design the schema. Any ideas are appreciated. Although my scenario is not exactly this, this is the best example that I can give that mimics the complexity of my real example.

So, I am tracking racing horses. I have a csv file for each horse. Each row is a timepoint (day to day monitoring), Each column is a measured variable. Each csv file can has a different number of timepoints based on how long the horse has been racing. Each csv file might have different variables(columns) of data, but there is some overlap with the other files. For example

File 1 (JumpingJacks,csv):

Day, RaceTrack, Weight (kg), Grains (kg),Warmup(min)
1, NULL,400,6,20
2, Middleton,403,4,40
3, NULL,399,6,20
4,NULL,402,6,20

File 2 (Castaway,csv):

Day, Weight (kg), Water (L), RaceTrack, UrineOutput (L)
1, 430,32,NULL,16
2, 430,32,NULL, 12
3, 429,36,Cantabery,20
4,428,36,NULL,20

Because of the inconsistency in measured variables, I’m struggling to figure out the right schema. The types of queries that I will want to ask are:

  1. Find me all of the horses who raced at Middleton
  2. Find me all of the horses where we are tracking both weight and UrineOutput.
  3. Randomly sample N from all of the horses where the average grain intake is > 7kg.

Would anyone be able to shed some light on this? As I said my biggest struggles are (a) the variable number/type of columns (with partial overlay between horses) and (b) variable number of rows.

Thanks

David