Load multiple CSVs into same array


#1

Hi
If I run multiple

db.load('<arrayname>', "'</pathto>.csv'", 0, "'CSV'") into the same <arrayname>

does it append or overwrite the existing <arrayname>?

If it overwrites what is the best method for loading multiple csv files into the same array?

Thanks

Kevin


#2

Hi @townheadbluesboy. Yes the default is to overwrite. Assuming your array is 1-dimensional and you want to keep appending data to the end of the array, in a dense fashion. We create a dimension “x” that is merely the row of the data appended. We can then append by joining incoming data with the total count of elements in the existing array.

For example:

$ cat /tmp/a.csv 
alice,1
bob,2
clark,3

$ cat /tmp/b.csv 
dave,4
eric,5

$ iquery -aq "create array foo <a:string, b:double>[x=0:*:0:100000]"
Query was executed successfully

$ iquery -anq "
 insert(
  redimension( 
   apply(
    cross_join(
     input(<a:string, b:double>[j=0:*], '/tmp/a.csv', format:'csv'), 
     op_count(foo)
    ), 
    x, j+count
   ), 
   foo
  ), 
  foo
 )"

iquery -anq "
 insert(
  redimension( 
   apply(
    cross_join(
     input(<a:string, b:double>[j=0:*], '/tmp/b.csv', format:'csv'), 
     op_count(foo)
    ), 
    x, j+count
   ), 
   foo
  ), 
  foo
 )"

$ iquery -aq "scan(foo)"
{x} a,b
{0} 'alice',1
{1} 'bob',2
{2} 'clark',3
{3} 'dave',4
{4} 'eric',5

So, in this way j corresponds to the line number in the incoming CSV file and x is the position in foo computed as j plus the number of elements already in the array. In this way you have fine-grained control over where each element goes and, if we were loading into, say a matrix, it would be a matter of computing another coordinate y.

Note also that each insertion generates a timestamped version and you can “go back in time”:

$ iquery -aq "versions(foo)"
{VersionNo} version_id,timestamp
{1} 1,'2018-06-26 15:53:43'
{2} 2,'2018-06-26 15:54:10'

$ iquery -aq "op_count(foo@1)"
{i} count
{0} 3

See more notes and examples here: SciDB's MAC(tm) Storage Explained