1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
|
Floating point math in LRL
--------------------------
TODO remove "float" and "double" types since they are inefficient to use
in LRL which does not use type promotion. Instead there could be a
"sysfloat" which is the best/fastest float on the system and the
existing floatXX types (e.g. float64, float80, etc)
Everything works as in the IEEE 754 standard without traps by default, but
with the following additions:
1) operations on non-number values, i.e. NaN and +/- infinity, can never
produce a number again. The operations are defined as follows:
n means "num", i.e. not NaN or inf
x means number of ±inf (but not NaN)
0/0 = NaN
n/0 = inf, if n != 0 (sign is same as of n)
n/inf = 0
inf/0 = NaN
inf/inf = NaN
0*inf = NaN
inf*inf = inf (sign is changed as in ±1 * ±1)
inf*n = inf as above, if |n| >= 1
inf*n = NaN, if |n| < 1
inf+0 = inf
inf+n = inf, if n has the same sign a the inf
inf+n = NaN, if n and inf have different signs
inf+inf = inf, if the infs have the same sign
inf+inf = NaN, if the infs have different signs
(+0)**+x = +0 (x is a number or inf)
0**-x = NaN
inf**0 = +1
inf**+n = inf, if n >= 1
inf**+n = NaN, if n < 1
(+inf)**+inf = +inf
(+inf)**-inf = 0
(+x)**+inf = inf, if x > 1
(+x)**+inf = +0, if x < 1
1**inf = 1
(-x)**inf = NaN (impossible to determine sign)
more to do...
Any operation involving a NaN operand produces a NaN, except for
comparisons which return false.
When operating on a NaN or inf, a function call to the system library
or a bundled function may be made. Note that an expression may
contain multiple operations.
2) the "validfp" type qualifier on a floating point type means that the value
may not be NaN or inf. Performing an operation with a such target type, or
assigning a NaN or inf value into a such type is an error.
3) the "exactfp" type qualifier means that operations that are inexact
(as defined by the IEEE 754 standard) are an error. This applies to
compile-time calculations also. If "validfp" is NOT used, then inexact
results from runtime calculations became NaN.
4) the "stdfp" type qualifier causes 1) to be ignored and IEEE standard
handling of NaNs and infs to be used.
5) the "rzfp", "rpifp" and "rnifp" type qualifiers specify rounding modes
(towards zero, positive infintiy and negative infinity)
6) the tokens sequences for infinity, NaN, and +/-0 are as follows
(case sensitive):
NaN
+inf
-inf
+0
-0
The +/- are the unary operators. Writing "0" is equivalent to "+0",
and so is assigning the integer 0 to a floating point type.
7) calculations of mixed floating point types is handled as follows:
- if there is a target type, then the target type is used for
the operation. E.g. "singlevalue = doublevalue*doublevalue;"
is calculated with single precision.
- otherwise the target type must be specified with the "as" operation.
8) floating point operations must be executed as they appear in the source
code, or in a way that is guaranteed to give the same result. This applies to
"constexpr" (compile-time evaluated expressions) as well.
9) floating point mode handling:
- when LRL changes the mode, it also restores the mode before entering
non-LRL code. If the non-LRL code changes the mode, and LRL has to
restore the mode again, it will restore to the new changed mode.
- LRL uses the default or restored mode when no qualifiers are present.
I.e. when round-to-nearest and no traps (validfp or exactfp) are used.
When only some of the qualifiers are used, then the defaults are used
for the other modes.
E.g. with only "exactfp", LRL will use the default trap modes for all
other errors than inexactness and will use the default rounding mode.
10) The C way of doing calculations in extended precision but storage in
single or double precision can be achieved in LRL with this syntax:
cfloat a = 1.123;
cfloat b = a*a as sysfloat;
|