docs/notes/floating_point.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111


Floating point math in LRL
--------------------------

TODO remove "float" and "double" types since they are inefficient to use
     in LRL which does not use type promotion. Instead there could be a
     "sysfloat" which is the best/fastest float on the system and the
     existing floatXX types (e.g. float64, float80, etc)

Everything works as in the IEEE 754 standard without traps by default, but
with the following additions:

1) operations on non-number values, i.e. NaN and +/- infinity, can never
   produce a number again. The operations are defined as follows:
   
   n means "num", i.e. not NaN or inf
   x means number of ±inf (but not NaN)
   
    0/0      = NaN
    n/0      = inf, if n != 0 (sign is same as of n)
    n/inf    = 0
    inf/0    = NaN
    inf/inf  = NaN
    
    0*inf    = NaN
    inf*inf  = inf (sign is changed as in ±1 * ±1)
    inf*n    = inf as above, if |n| >= 1
    inf*n    = NaN, if |n| < 1
    
    inf+0    = inf
    inf+n    = inf, if n has the same sign a the inf
    inf+n    = NaN, if n and inf have different signs
    inf+inf  = inf, if the infs have the same sign
    inf+inf  = NaN, if the infs have different signs
    
    (+0)**+x     = +0 (x is a number or inf)
    0**-x        = NaN
    inf**0       = +1
    inf**+n      = inf, if n >= 1
    inf**+n      = NaN, if n < 1
    (+inf)**+inf = +inf
    (+inf)**-inf = 0
    (+x)**+inf   = inf, if x > 1
    (+x)**+inf   = +0, if x < 1
    1**inf       = 1
    (-x)**inf    = NaN (impossible to determine sign)
    more to do...
    
   Any operation involving a NaN operand produces a NaN, except for
   comparisons which return false.
    
   When operating on a NaN or inf, a function call to the system library
   or a bundled function may be made. Note that an expression may
   contain multiple operations.

2) the "validfp" type qualifier on a floating point type means that the value
   may not be NaN or inf. Performing an operation with a such target type, or
   assigning a NaN or inf value into a such type is an error.

3) the "exactfp" type qualifier means that operations that are inexact
   (as defined by the IEEE 754 standard) are an error. This applies to
   compile-time calculations also. If "validfp" is NOT used, then inexact
   results from runtime calculations became NaN.

4) the "stdfp" type qualifier causes 1) to be ignored and IEEE standard
   handling of NaNs and infs to be used.

5) the "rzfp", "rpifp" and "rnifp" type qualifiers specify rounding modes
   (towards zero, positive infintiy and negative infinity)

6) the tokens sequences for infinity, NaN, and +/-0 are as follows
   (case sensitive):

    NaN
    +inf
    -inf
    +0
    -0
    
    The +/- are the unary operators. Writing "0" is equivalent to "+0",
    and so is assigning the integer 0 to a floating point type.

7) calculations of mixed floating point types is handled as follows:

     - if there is a target type, then the target type is used for
       the operation. E.g. "singlevalue = doublevalue*doublevalue;"
       is calculated with single precision.
     - otherwise the target type must be specified with the "as" operation.

8) floating point operations must be executed as they appear in the source
   code, or in a way that is guaranteed to give the same result. This applies to
   "constexpr" (compile-time evaluated expressions) as well.

9) floating point mode handling:
     - when LRL changes the mode, it also restores the mode before entering
       non-LRL code. If the non-LRL code changes the mode, and LRL has to
       restore the mode again, it will restore to the new changed mode.
     - LRL uses the default or restored mode when no qualifiers are present.
       I.e. when round-to-nearest and no traps (validfp or exactfp) are used.
       When only some of the qualifiers are used, then the defaults are used
       for the other modes.
       
       E.g. with only "exactfp", LRL will use the default trap modes for all
       other errors than inexactness and will use the default rounding mode.

10) The C way of doing calculations in extended precision but storage in
    single or double precision can be achieved in LRL with this syntax:
    
        cfloat a = 1.123;
        cfloat b = a*a as sysfloat;