SciDB count number of occurrences of a particular character in a string


#1

There is an array DataI with string attribute VCF_AD. I want to count all the possible number of comma-s in the VCF_AD column (using AFL).

I am writing sample column per row in the output below:

{Dims  }  VCF_AD       count_of_commas
{0,31,0}  '129'        <= 0
{0,36,0}  '75,82'      <= 1
{0,101,0} '99,96'      <= 1
{0,117,0} '92,69,44'   <= 2
{0,353,0} '78,75'      <= 1

The answer I calculated on R is 0 through 6, but it takes a bunch of time to download into R first (the download is ~9 GB!).

The R way is given below:

xx = as.R(db$project(DataI, 'VCF_AD'))
object.size(xx)
# 9405736192 bytes 

library(stringr)
sort(unique(str_count(xx$VCF_AD, ",")))
# [1] 0 1 2 3 4 5 6

#2

Here is a solution using char_count function built into SciDB:

iquery -aq "uniq(
   sort(
    project(
      apply(DataI, number_of_comma, char_count(VCF_AD, ',')), 
    number_of_comma)
   )
)"

Turns out that for a really big DataI array (293318219 elements), the above query takes too long (especially the sort). I found the following to be more optimized, as it reduces the results significantly before the most time-consuming sort operation (using the first uniq(unpack(.., i)) pair).

time iquery -aq "uniq(
  sort(
    uniq(
      project(
        unpack(
          project(
            apply(DataI, number_of_comma, char_count(VCF_AD, ',')), 
            number_of_comma), 
          i), 
        number_of_comma)
      )
  )
)
# {i} number_of_comma
# {0} 0
# {1} 1
# {2} 2
# {3} 3
# {4} 4
# {5} 5
# {6} 6
# 
# real	0m45.724s
# user	0m0.011s
# sys	0m0.000s