SciDB float datatype same as 32bit float


#1

Hello,

I have an array which originally is a 32bit float dataset. I have built the array using float. However, I am running into an issue, when attempting to reclassify. It seems the precision that I am being displayed as an output is not exactly what is being stored in the database?

AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, -999));
{y,x} value,newvalue
{10452,9567} 0.000212804,-999
{10452,9568} 0.000212804,-999
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, -999));
{y,x} value,newvalue
{10452,9567} 0.000212804,-999
{10452,9568} 0.000212804,-999
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, -999));
{y,x} value,newvalue
{10452,9567} 0.000212804,-999
{10452,9568} 0.000212804,-999
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804268812761, 1, -999));
{y,x} value,newvalue
{10452,9567} 0.000212804,-999
{10452,9568} 0.000212804,-999
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, -999));
{y,x} value,newvalue
{10452,9567} 0.000212804,-999
{10452,9568} 0.000212804,-999
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804,3
{10452,9568} 0.000212804,3
AFL% apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value > 0.000212803, 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804,1
{10452,9568} 0.000212804,1

#2

You can use the -w max-digits with iquery.

Changing String Output Length of Floating Point Numbers

$ iquery -a -w 15 
AFL% build(<val:float>[i=1:10], 1.0/i);
{i} val
{1} 1
{2} 0.5
{3} 0.333333
{4} 0.25
{5} 0.2
{6} 0.166667
{7} 0.142857
{8} 0.125
{9} 0.111111
{10} 0.1
AFL% build(<val:double>[i=1:10], 1.0/i);
{i} val
{1} 1
{2} 0.5
{3} 0.333333333333333
{4} 0.25
{5} 0.2
{6} 0.166666666666667
{7} 0.142857142857143
{8} 0.125
{9} 0.111111111111111
{10} 0.1

Note. Single precision (float type) can only accurately store a base-10 number with 6 significant digits. A double can accurately store a base-10 number up to 15 digits. [This is a limitation of how floating point numbers are represented according to the IEEE 754 specification].

Note: The output from iquery may change in future releases. To accurately represent the “binary value stored in the database” a larger number of digits is really needed. The maximum output digits currently correspond to FLT_DIG and DBL_DIG.

Name Bits Max digits for decimal-binary-decimal round-trip Min digits for binary-decimal-binary round-trip
Half 11 3 5
Single 24 6 9
Double 53 15 17
Extended 64 18 21
Quadruple 113 33 36

#3

Hmm,

That’s not it. There is definitely an issue on my end when I loaded the data. But what I can’t determine is why SciDB doesn’t seem to be able to give me enough information in the output. The -w flag you mentioned didn’t provide any more decimals unfortunately.

Oddly I get the result when I cast this to text.

scidb@scidb-vm:~$ iquery -a -w 18 -q "apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, 3));"
{y,x} value,newvalue
{10452,9567} 0.000212804,3
{10452,9568} 0.000212804,3
scidb@scidb-vm:~$ iquery -a -w 18 -q "apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(string(value) = '0.000212804', 1, 3));"
{y,x} value,newvalue
{10452,9567} 0.000212804,1
{10452,9568} 0.000212804,1
scidb@scidb-vm:~$ iquery -aq "show(asian_2010)"
{i} schema
{0} 'asian_2010<value:float> [y=0:21737:0:1000; x=0:19370:0:1000]'
scidb@scidb-vm:~$ iquery -a -w 18 -q "apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(value = float(0.000212804), 1, 3));"
{y,x} value,newvalue
{10452,9567} 0.000212804,3
{10452,9568} 0.000212804,3

#4

David, what version are you running on?


#5

I’m running SciDB 16.9


#6

You probably want to use a double and not a float.

If you are using float you will never get more then 6 significant decimal digits.

The -w is the maximum possible number of digits that can be output. The number of actual decimals output can be seen as:
min( MaximumSignificantDigits_for_the_type, Value_specified_by_the_w_argument)

So if you specify -w 10203030303 but you are using floats:
min(6, 10203030303) ==> 6
For doubles:
min(15, 10203030303) ==> ‘15’

There is no way to give more resolution (significant digits) than the underlying type allows.

(Again: The output from iquery is not ideal if you are attempting to do “base_2 -> base_10 round trip”, DBL_DIG and FLT_DIG are common C defines that are used for the “base_10 -> base_2 round trip”. )


#7

I reloaded the dataset as double and found something interesting…

scidb@scidb-vm:~$ iquery -aq "show(asian_2010)"
{i} schema
{0} 'asian_2010<value:double> [y=0:21737:0:1000; x=0:19370:0:1000]'

I was able to get the outcome I wanted by casting. But it is strange.

scidb@scidb-vm:~$ iquery -a -q "apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(float(value) = float(0.000212804268812761), 1, -99 ))";
{y,x} value,newvalue
{10452,9567} 0.000212804,1
{10452,9568} 0.000212804,1
scidb@scidb-vm:~$ iquery -a -q "apply(between(asian_2010, 10452, 9567, 10452, 9568), newvalue, iif(double(value) = double(0.000212804268812761), 1, -99 ))";
{y,x} value,newvalue
{10452,9567} 0.000212804,-99
{10452,9568} 0.000212804,-99

I am assuming my method for loading the data is incorrect, but I can’t determine why the scidb won’t evaluate this. I’ve never had problems with integer values.


#8

A couple of things about the queries you provided.

  1. A decimal value (base=10) with more than 6 significant digits cannot be faithfully stored in a float, so 0.000212804268812761 can never be faithfully represented as a float, the best you can do is 0.000212804. Floating point numbers are stored as d-p2-p + d-p+12-p+1 … +d-020 + d121+… but you are trying to represent D-p10-p + D-p+110-p+1 … +D0101 + D1101+… Not all base 10 numbers can be exactly represented in base 2 in the finite number of bits available. double has more bits so you can represent more decimal significant digits (15),
  2. As a result of rounding error in the different bases, doing a equal comparison with decimal value may not evaluate as expected due to rounding error in the binary representation.
    Since you are working for equivalency it’s usually better to use
    abs(value - valwhat_we_want_to_equal) <= ε. The value of ε will depend upon your application.

This appears to have everything to do with floating point arithmetic. If you want to see this in other places, python is a simple way:

>>> a = 0.1
>>> b = 100 + a
>>> a == b - 100
False
>>> a
0.1
>>> b - 100
0.09999999999999432

Integers don’t have this problem. Every integer can be represented in the range of the integral type. (for a 64 bit signed integer the range is [-(263), 263 - 1]. The number of integers in this range is finite.

What Every Computer Scientist Should Know About Floating-Point Arithmetic, by DAVID GOLDBERG, provides a much more thorough explanation of floating point. (PDFs of his paper are available in lots of places on the web).

If I’m still mis-interpreting your question, I apologize.

I hope this helps,
;-b


#9

Thanks for the overview. I’ll try to briefly explain the mess I’m working with.
A researcher provided me a dataset (geoTiff) the metadata on the dataset indicated it was 32bit precision. Mistake 1. I loaded it accordingly with SciDB array float. I rewrote the metadata so it could be picked up as 64bit float and loaded with SciDB double.

The issue is that I don’t seem to be able to find the absolute value that is being stored in the database. Even with the extra precision flag you provided. I ran iquery -w 33 for the following. I think that is the only thing that is concerning me.

scidb@scidb-vm:~$ iquery -a -w 33
AFL% between(asian_2010_double, 10452, 9567, 10452, 9568);
{y,x} value
{10452,9567} 0.000212804268812761
{10452,9568} 0.000212804268812761
AFL% between(asian_2010_single, 10452, 9567, 10452, 9568);
{y,x} value
{10452,9567} 0.000212804
{10452,9568} 0.000212804
AFL% between(asian_2010, 10452, 9567, 10452, 9568);
{y,x} value
{10452,9567} 0.000212804268812761
{10452,9568} 0.000212804268812761

If you use the exact output from iquery it fails. There must be some precision or rounding effect happening when the information is output. But I find the same values when saving to csv.

AFL% apply(between(asian_2010_single, 10452, 9567, 10452, 9568), newvalue, iif(value = 0.000212804, 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804,3
{10452,9568} 0.000212804,3
AFL% apply(between(asian_2010_single, 10452, 9567, 10452, 9568), newvalue, iif(float(value) = float(0.000212804268812761), 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804,1
{10452,9568} 0.000212804,1
AFL% apply(between(asian_2010_double, 10452, 9567, 10452, 9568), newvalue, iif(float(value) = float(0.000212804268812761), 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804268812761,1
{10452,9568} 0.000212804268812761,1
AFL% apply(between(asian_2010_double, 10452, 9567, 10452, 9568), newvalue, iif(double(value) = double(0.000212804268812761), 1, 3));
{y,x} value,newvalue
{10452,9567} 0.000212804268812761,3
{10452,9568} 0.000212804268812761,3

#10

This is the python function I am using to write the array

def ArrayToBinary(theArray, binaryFilePath, attributeName='value', yOffSet=0):
    """
    Use Numpy tricks to write a numpy array in binary format with indices 

    input: Numpy 2D array
    output: Numpy 2D array in binary format
    """
    import numpy as np

    # if theArray.dtype == 'float32': 
    #     print("Changing array dataype")
    #     theArray = theArray.astype('float64')
    # print("Writing out file: %s" % (binaryFilePath))
    col, row = theArray.shape
    with open(binaryFilePath, 'ab') as fileout:
        #Oneliner that creates the column index. Pull out [y for y in range(col)] to see how it works
        column_index = np.array(np.repeat([y for y in np.arange(0+yOffSet, col+yOffSet) ], row), dtype=np.dtype('int64'))
        
        #Oneliner that creates the row index. Pull out the nested loop first: [x for x in range(row)]
        #Then pull the full list comprehension: [[x for x in range(row)] for i in range(col)]
        row_index = np.array(np.concatenate([[x for x in range(row)] for i in range(col)]), dtype=np.dtype('int64'))

        #Oneliner for writing out the file
        #Add this to make it a csv tofile(binaryFilePath), "," and modify the open statement to 'w'
        np.core.records.fromarrays([column_index, row_index, theArray.ravel()], dtype=[('y','int64'),('x','int64'),(attributeName,theArray.dtype)]).ravel().tofile(fileout) 

    
    del column_index, row_index, theArray

#11

I should have been more clear… the “guard digits” of the float (the 9 digits for float, or the 17 digits for double in the “binary->decimal->binary” ) cannot be output (currently) in iquery.

So if you have some float/double in the database having a particular “bit pattern”. And you are trying to do an equivalency check using the output from iquery ( val = 0.123456 ), you will have problems (which I think is what you are saying). The number 0.123456 is max ‘precision’ you can put in, but to get the “string that I need so that I get the exact same float (bit for bit in all 64 bits that make up a double (or 32 bits in single precision) that’s in scidb” requires that those last two digits be made available.

That is the Min digits for binary->decimal-> binary round-trip in the table. Before. I was focusing on the Max digits in the decimal->binary->decimal.

So you’ve entered some data into scidb (and calculated and whatnot) and now you want to find a way to do a equality comparison with one of those doubles (or floats, in either case you’ve got the same issue)?

We actually have an open bug to allow the user to output “those needed digits to allow for the binary->decimal->binary” round trip (which would give you those digits with -w 17 (the output would be 9 digits long for float and 17 for double in iquery).

Correct me if I’m wrong, but what you are trying to do is:

  1. query scidb
  2. Take the decimal representation of that float/double value given by iquery (in decimal form) – (cut-n-paste)
  3. use that output decimal form to compare to other doubles in scidb.
  4. Your iff(value = cut_n_pasted_value) isn’t matching what you thought it should?

This is because iquery isn’t giving you the correct “maximum number of printed decimal digits” so that you can go binary->decimal->binary. The number of digits should be 9/17 not 6/15?


#12

Yep that is exactly what I am trying to do.


#13

Unfortunately, you’re going to have to resort to the abs(V-Vwanted_for_comparison) <ε approach (even this may not be the best solution depending upon the magnitude differences in V and
Vwanted_for_comparison)

Comparing floating point numbers is, unfortunately, a rather complex computing issue.

(I’m not vouching for the quality of the algorithms but here’s one blog post I found online talking about it…https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/)


#14

I just want to make sure I have this correct.

If I do
float --> binary --> float I will be unable to determine the exact value.

However
I just go from float --> float then I will be able to determine the exact value.


#15

I spent some time trying with Python and numpy trying to figure out a work around.

floatList = [ [2.421,3.4242,5.442,1.132,4.12,4.1212],
[3.3131,4.131,1.1221,1.1212,3.122,1.555],
[4.525,2.645,1.74,4.446,1.46,5.4646],
[6.646,4.646,7.876,2.868,6.686,1.68]
]

integerArray = np.array(integerList, dtype=np.dtype(‘int8’))
floatArray = np.array(floatList, dtype=np.dtype(‘float32’))
floatStringArray = np.chararray(floatList, dtype=np.dtype(’|S’))
floatStringArray.tostring()
floatStringArray.tostring()
b’2.421\x003.42425.442\x001.132\x004.12\x00\x004.12123.31314.131\x001.12211.12123.122\x001.555\x004.525\x002.645\x001.74\x00\x004.446\x001.46\x00\x005.46466.646\x004.646\x007.876\x002.868\x006.686\x001.68\x00\x00’

I can’t get this to load in SciDB though.
Can you provide me an example of how to load float data into SciDB.


#16

Hey @dahaynes here are some examples, setting dimensions aside and just dealing with one strip of data:

floatList = [2.421,3.4242,5.442,1.132,4.12,4.1212]
import numpy as np
floatArray = np.array(floatList, dtype=np.dtype('float32'))
db.iquery("store(input(<x:float not null>[i], '{fn}', 0, '{fmt}'), foo)", upload_data=floatArray)
refetch = db.iquery("scan(foo)", fetch=True, as_dataframe=True)

#Important to keep in mind: refetch is a data frame so need to convert its column back to float.
#This should return true across the board:
refetch['x'].astype('float32') == floatArray

Or if you want to fetch back into a numpy array:

refetch2 =  db.iquery("scan(foo)", fetch=True, as_dataframe=False)

#This should also say `true` across the board:
refetch2['x']['val'] == floatArray

Does this help?


#17

@apoliakov

This makes sense, but I don’t want to go through SHIM.
I want to write out a file, potentially in parallel and load it.


#18

Gotcha. So - something like this:

from scidbpy import connect
import numpy as np

floatList = [2.421,3.4242,5.442,1.132,4.12,4.1212]
floatArray = np.array(floatList, dtype=np.dtype('float32'))
file = open("/tmp/foo.bin", "wb")
file.write(floatArray.tobytes())
file.close()

db.iquery("store(input(<x:float not null>[i], '/tmp/foo.bin', 0, '(float)'), foo)")

refetch = db.iquery("scan(foo)", fetch=True, as_dataframe=True)
refetch['x'].astype('float32') == floatArray

Essentially we follow this code path, but replacing shim with a file write:


#19

Thank Alex,

Unfortunately, that doesn’t solve my original issue of doing an equality comparison within SciDB. I want to be able to a reclassification for float point datasets. I thought the problem was originally the float --> binary --> float issue mentioned by @bjc.

Basically, I need a process for loading floating point data in a way that I can do an equality comparision.

iif(array = 2.421, 1, 3));


#20

Hey @dahaynes

It’s important to keep in mind some double-to-float conversions that can trip us up. For example, in both Python and SciDB, a string literal like “2.34” is interpreted as a double by default. And a double may not have an exact float representation. For example, here’s a pure python example that illustrates the weirdness you may run into.

In Python3:

floatList = [2.421,3.4242,5.442,1.132,4.12,4.1212]
floatList[0] == 2.421
#True

import numpy as np
floatArray = np.array(floatList, dtype=np.dtype('float32'))
floatArray[0]
#2.421

floatArray[0] == 2.421
#False !!

#But this works:
floatArray[0] == np.array(2.421, dtype=np.dtype('float32'))
#True

So in the same way in SciDB:

file = open("/tmp/foo.bin", "wb")
file.write(floatArray.tobytes())
file.close()

db.iquery("store(input(<x:float not null>[i], '/tmp/foo.bin', 0, '(float)'), foo)")

#This returns no hits
db.iquery("filter(foo, x=2.421)", fetch=True)

#This works
db.iquery("filter(foo, x=float(2.421))", fetch=True)

Does that make sense?