Parsing long strings


#1

Hi folks,

I recently played around with the “stream()” interface and the output is usually a long string from the UDF’s standard output. Then we need to parse the string and split the segments into cells. In my case, a single output chunk comprises of 8M strings, each of which is a float in its string format.

My first attempt was to cross-join this chunk with an index-array and extract the right substring with nth_csv(). Then I noted that this process took forever… maybe because of too many duplicates of the 8M floats during the corss-join.

My second attempt was to use the parse() function, which is deprecated but still accessible. This was a more interesting one: it worked pretty fast for ~10K or fewer strings. For 100K - 1M strings, it’s getting slow; I didn’t wait until it’s finished…maybe it’d never finish. Now, if I test it with the full scale, i.e., 8M, it took a second to give me this error:

Anyone had seen this before?

-Dongfang


#2

Hey Dongfang.

This is interesting. What is the shape of your output? You mentioned 8M strings total, but is there a single row or is there a breakdown of rows and columns? Also is this 8 million values in one chunk or are there many chunks? It may be slow if you have a single large row of data and you try to convert it to multiple attributes. Also the error likely means you need to change your chunk_size setting.

Here’s a little experiment. First, create an array with 8 million floats:

$ iquery -anq "store(build(<val:float>[i=1:8000000,1000000,0], (random()%10000)/10000.0), random_floats_8m)"
Query was executed successfully

$ iquery -aq "limit(random_floats_8m, 5)"
{i} val
{1} 0.4131
{2} 0.8689
{3} 0.1653
{4} 0.2763
{5} 0.554

Let’s actually export it out and convert it to a large blob of text. We’ll save it as TSV and then delimit it using ‘|’ instead of newline:

$ iquery -anq "save(random_floats_8m, '/tmp/random_floats', 0, 'tsv')"
Query was executed successfully

$ head -n 5 /tmp/random_floats
0.4131
0.8689
0.1653
0.2763
0.554

$ cat /tmp/random_floats | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/|/g' > /tmp/random_floats_single_line

And now let’s load that one big blob back into an array. From the summarize command you can see it becomes one value that is over 50MB in size:

$ iquery -anq "store(input(<val:string not null>[i=0:0,1,0], '/tmp/random_floats_single_line', 0, 'tsv'), big_blob)"
Query was executed successfully

$ iquery -aq "summarize(big_blob)"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'all',1,55110443,2,1,1,1,96,2.75552e+07,55110347

A scan on it will dump a big blob of the values separated by |. So how can we parse this?

There are a few options. First, we can treat | as the “line delimiter” and parse into a single attribute. Note also parse requires the input to be two-dimensional so we use reshape to add a second meaningless dimension:

$ time iquery -anq "store(parse(reshape(big_blob, <val:string>[i=0:0,1,0, j=0:0,1,0]), 'num_attributes=1', 'line_delimiter=|'), first_result)"
UserException in file: src/array/MemChunk.cpp function: writeItem line: 911
Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_NO_CURRENT_ELEMENT
Error description: Error during query execution. No current element.

real	0m1.384s
user	0m0.017s
sys	0m0.006s

Herein we run into the error you saw! The default chunk size of parse is 1 million; it makes room for that many rows. But we can easily increase that with the chunk size option:

$ time iquery -anq "store(parse(reshape(big_blob, <val:string>[i=0:0,1,0, j=0:0,1,0]), 'num_attributes=1', 'line_delimiter=|', 'chunk_size=8000000'), first_result)"
Query was executed successfully

real	0m8.346s
user	0m0.014s
sys	0m0.009s

$ iquery -aq "summarize(first_result)"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'all',8000000,95120769,3,8000000,8e+06,8000000,48,3.17069e+07,95120649

$ iquery -aq "limit(first_result, 5)"
{source_instance_id,chunk_no,line_no} a0,error
{0,0,0} '0.4131',null
{0,0,1} '0.8689',null
{0,0,2} '0.1653',null
{0,0,3} '0.2763',null
{0,0,4} '0.554',null

So now we have separated the data into 8 million distinct values. Note the size of the chunk increased now - because each string must be prepended with its size. So for space saving we really want to cast the string to a float:

$ time iquery -anq "store(project(apply(parse(reshape(big_blob, <val:string>[i=0:0,1,0, j=0:0,1,0]), 'num_attributes=1', 'line_delimiter=|', 'chunk_size=8000000'), val_float, dcast(a0, float(null))), val_float), second_result)"
Query was executed successfully

real	0m14.904s
user	0m0.015s
sys	0m0.007s

$ iquery -aq "limit(second_result, 5)"
{source_instance_id,chunk_no,line_no} val_float
{0,0,0} 0.4131
{0,0,1} 0.8689
{0,0,2} 0.1653
{0,0,3} 0.2763
{0,0,4} 0.554

$ iquery -aq "summarize(second_result)"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'all',8000000,32017156,2,8000000,8e+06,8000000,48,1.60086e+07,32017108

So this takes a bit longer now with the casting. Dcast is definitely faster than the float(...) form. But the semantics are different - dcast returns the supplied default (null) on failed cast; float() will throw an error.

Let me know if this is illustrative. You may also be having trouble if you are trying to interpret your data as a single large row instead of a column. It’d be good if you can provide an example or the dimensions what it looks like.

And have a good weekend!


#3

In thinking about this further - it doesn’t escape me that we should add a new interface to streaming to work with Python at some point, just like our interface for R is much more efficient than TSV:

There’s been talk about adding a Feather / Arrow interface for example. As a Python / Stream user, is that something you think would be interesting?