Loading data for a sparse array


#1

Could you please help me for how I load data for a sparse array?


#2

An easy approach is to specify data as 1-D array of coordinate,coordinate,…,value tuples, load, then re-dimension. Here is an example that does this in one step using loadcsv.py:

Consider the example file named example.csv that contains:

row,col,value
1,2,5.5
1,1,3.1

Let’s say we want to load these two values into a 2x2 sparse array called “A”:

loadcsv.py -n 1 -x -a “Araw” -s “row:int64,col:int64,value:double[i=0:*,1000,0]” -A “A” -X -S “value:double[row=1:2,2,0,col=1:2,2,0]” < example.csv

See loadcsv.py -h for detailed help on the options.

You can achieve something similar using binary load too, if your data is in binary form. Let’s say in that case, your data is in a binary file that looks like:

int64 int64 double int64 int64 double …

so that the int64’s are coordinates and the doubles are the values (just like the above CSV example, but in binary).

Then you could do something like (assuming the binary file is called /tmp/example.bin):

create_array(Araw,row:int64,col:int64,value:double[i=0:*,1000,0])
load(Araw,’/tmp/example.bin’,0,’(int64,int64,double)’)
create_array(A,value:double[row=1:2,2,0,col=1:2,2,0])
redimension_store(Araw,A)

The general pattern in either case is to load into a 1D array, then redimension into whatever shape your sparse array should be.

Hope this helps…


#3

Thanks!

I have a sparse array of size (60001700, #nonzero: about 900K).
I found it is really slow when this array is multiplied with its transpose. (A
A’)
It seems that whenever “getConstIterator” called, it creates MultiplySparseArrayIterator and the left and right sparse row/column matrices are initialized. I don’t know why we should do this. In my case, the sparse array contains only one chunk but MultiplySparseArrayIterator is created about 5 times. Please correct me if I misunderstand this mechanism.

Thanks!
-MJ


#4

MJ,

Yes indeed multiply is not well-optimized for sparse matrices right now. This is a known issue that we are working on. I will update this post as soon as I can asses an ETA for a better optimized multiply operator.

Best,

Bryan


#5

Yes, what Bryan said.

This multiply code is a little old and we haven’t touched it for some time. It’s very possible that there are some optimization opportunities. Maybe the left and right row/column matrices ought to be initialized once in the array object and not in the iterator. Note also that this current multiply has at least three different code paths and you can pick the path you want with a parameter passed to multiply. Note also that some execution is multithreaded. If your query is simply “multiply(A,AT)” with data returned to the client, then the ParallelAccumulatorArray will create several threads to collect the data. This threading is controlled by the config setting “result-prefetch-queue-size”. That may be why you are seeing multiple iterators created. You can try queries like “store(multiply(A, AT), B)” or “sg(multiply(A,AT), 1, -1)” to force the system to use different threading.