Feature vector normalization


#1

Hi all,

I’m new to SciDB and i have a question.
I have an array with 2 dimensions (patch_id and feature) and one double attribute named ‘value’:

fv<value:double> [patch_id=0:*,1000000,0,feature=0:*,1000000,0]

There will be about 300’000 patch_ids (rows) and 1024 features (columns) similar to this:

[
[0.2, 0.7, 0.4, ...],
[0.8, 0.6, 0.1, ...],
[0.3, 0.5, 0.9, ...],
...
]

What i would like to do:
Get the sum of all values in one row (patch_id dimension), then create a new array where each value is represented as value/sum:

[
[0.2/1.3, 0.7/1.3, 0.4/1.3], => sum = 1.3
[0.8/1.5, 0.6/1.5, 0.1/1.5], => sum = 1.5
[0.3/1.7, 0.5/1.7, 0.9/1.7], => sum = 1.7
]

i figured out how to get the sum for the rows:

aggregate(fv,sum(value),patch_id);

However i’m more or less stuck here.
Also, what would be the easiest way to do this for all rows (patch_ids)? Is there some kind of loop?

Thanks a lot for your answers,
-Ivan


#2

I believe what you want is to have an array where each element is divided by the row-sum.

Let me show this with an example array (similar to yours, just smaller):

iquery -aq "create array fv<value:double> [patch_id=0:3,1000000,0,feature=0:2,1000000,0]"
# Query was executed successfully

iquery -aq "store(build(fv, (patch_id+1)*(feature+1)), fv)"
# {patch_id,feature} value
# {0,0} 1
# {0,1} 2
# {0,2} 3
# {1,0} 2
# {1,1} 4
# {1,2} 6
# {2,0} 3
# {2,1} 6
# {2,2} 9
# {3,0} 4
# {3,1} 8
# {3,2} 12

# Calculate the row-sum
iquery -aq "store(aggregate(fv,sum(value),patch_id), temp)"
# {patch_id} value_sum
# {0} 6
# {1} 12
# {2} 18
# {3} 24

# This is the important step
# Cross-join the row-sum with the original array
iquery -aq "store(cross_join(fv as A, temp as B, A.patch_id, B.patch_id), temp2)"
# {patch_id,feature} value,value_sum
# {0,0} 1,6
# {0,1} 2,6
# {0,2} 3,6

# Now apply the formula (value/value_sum)
iquery -aq "project(apply(temp2, value_by_sum, value/value_sum), value_by_sum)"
# {patch_id,feature} value_by_sum
# {0,0} 0.166667
# {0,1} 0.333333
# {0,2} 0.5

# You do not need to store intermediate results. You can write one big query to do everything together
ROWSUM="aggregate(fv,sum(value),patch_id)"
CROSSJOIN="cross_join(fv as A, $ROWSUM as B, A.patch_id, B.patch_id)"
NORMALIZE="project(apply($CROSSJOIN, value_by_sum, value/value_sum), value_by_sum)"
iquery -aq "$NORMALIZE"

#3

Thanks a lot for the detailed answer. It works.
I wasn’t aware that the cross_join would be the solution in this case.
-Ivan