Outputting large queries to a file(s)


#1

Im trying to multiply two very large sparse matrices that I’ve accumulated in scidb.

their dimensions: ~
220,000 x 43 * 43 x 40,000,000

I want to output the query to a file (which is obv going to be HUGE) because I need to later push it into a MapReduce algo…

so Im using iquery -r to write to a file, but it is outputting the multiplication in 1,000x1,000 element chunks (sparse), where Im seeing on average 700,000 elements inside. (~30% unfilled)
It is taking quite a while, and approaching 1TB.

1st question: it seems like most of the time in this query is IO based? (could be wrng) Im running in SingleInstance mode on a computer with ~150GB ram and 7TB free hdd space for this query. Any idea how I could speed this multiplication op up?

if not, ok no prob. I know its a huge job, but I have to analyze the output data on a row-by-row basis.
The chunking proves useful inside the db, but parsing 1000 chunks horizontally and vertically will not be fun…
2nd question: Is there any way to output without chunking? Something like a simple csv+ format without the chunking? the problem is that my chunks are so big, normal parsing techniques arent going to work.

Any help/insight is appreciated thanks!!!


#2

Yes, we support

iquery -o csv+ -aq "..."

This will give you the csv output. In our case:
csv+ means include coordinates
csv means just the attributes
lcsv+ means include coordinates and translate them into non-integer dimension labels (if applicable).

You can also try storing the multiply result first:

iquery -anq "store(multiply...),result)"
iquery -o csv+ -aq "scan(result)" > outfile

That might go faster. Let us know how it goes!
–Alex Poliakov