Importing sparse array using loadcsv.py


#1

Hi,

I’m importing quite a large sparse array (10m non-zero elements) using loadcsv.py.

It’s been more than 4 hours but it is still showing “loading…”.

How long do you think it will take?

Or if there is any other better way to do this?

I actually might use a random sparse array (around 10m nnz) using “build_sparse”. Do you think it might be faster to create?

Please help me out!

Thanks in advance!
MJ


#2

Not sure what you’re importing but it shouldn’t take that long. Here’s a R script that builds a sparse 100M x 100M array with 10M random non-empties. Takes 35 seconds on my laptop:

library("scidb")
scidbconnect()

build_sparse_array = function ( num_cells =               10000000,
                                x_start =                 1,
                                x_end =                   100000000,
                                y_start =                 1,
                                y_end =                   100000000,
                                desired_cells_per_chunk = 1000000,
                                chunk_size =              0)
{
  dx = x_end - x_start + 1;
  dy = y_end - y_start + 1;
  if(chunk_size <= 0)
  {
    density <- num_cells / (dx * dy)
    #Relationship: chunk_size * chunk_size * density = desired_cells_per_chunk
    #Assuming 2D arrays with uniform distribution of data
    chunk_size = round(sqrt(desired_cells_per_chunk / density), digits=0)
    if (chunk_size*chunk_size > 2 ^62)
    {
      print("Calculated chunk size too large; set chunk size manually")
      return(0)
    } 
  }
  #If dx or dy is too large - may want to check max return value of random
  query <- sprintf( 
  "apply(
    build(
     <value:double> 
     [n=1:%i,1000000,0],
     random()
    ),
    x, random() %% %i + %i,
    y, random() %% %i + %i
   )", num_cells, dx, x_start, dy, y_start);
  
  query <- sprintf(
  "redimension(
    %s,
    <value:double> [x=%i:%i,%i,0, y=%i:%i,%i,0]
   )", query, x_start, x_end, chunk_size, y_start, y_end, chunk_size  
  )
  
  query <- sprintf("store( %s, target)", query)
  scidbremove("target", error=invisible);
  t1 <- proc.time()
  iquery(query);
  print(proc.time() - t1)
}

Do not use build_sparse if density is less than 10% or so. Build_sparse evaluates the expression for every possible cell. If the arrays are too sparse, it takes forever. For that reason, it is deprecated. Use the kind of pattern you see above.


#3

This will be great! I’ll try it.

By the way, is there any fast way I can import an existing sparse data in csv or any format?
As I mentioned, when I use the “loadcsv.py”, it was taking too long (while running more than 5~6 hours, I stopped it so I don’t know how long it will take.)

Thank you!!
MJ