Floating point math in LRL -------------------------- TODO remove "float" and "double" types since they are inefficient to use in LRL which does not use type promotion. Instead there could be a "sysfloat" which is the best/fastest float on the system and the existing floatXX types (e.g. float64, float80, etc) Everything works as in the IEEE 754 standard without traps by default, but with the following additions: 1) operations on non-number values, i.e. NaN and +/- infinity, can never produce a number again. The operations are defined as follows: n means "number", i.e. not NaN or inf x means number of ±inf (but not NaN) 0/0 = NaN n/0 = inf, if n != 0 (sign is same as of n) n/inf = 0 inf/0 = NaN inf/inf = NaN 0*inf = NaN inf*inf = inf (sign is changed as in ±1 * ±1) inf*n = inf as above, if |n| >= 1 inf*n = NaN, if |n| < 1 inf+0 = inf inf+n = inf, if n has the same sign a the inf inf+n = NaN, if n and inf have different signs inf+inf = inf, if the infs have the same sign inf+inf = NaN, if the infs have different signs (+0)**+x = +0 (x is a number or inf) 0**-x = NaN inf**0 = +1 inf**+n = inf, if n >= 1 inf**+n = NaN, if n < 1 (+inf)**+inf = +inf (+inf)**-inf = 0 (+x)**+inf = inf, if x > 1 (+x)**+inf = +0, if x < 1 1**inf = 1 (-x)**inf = NaN (impossible to determine sign) more to do... Any operation involving a NaN operand produces a NaN, except for comparisons which return false. When operating on a NaN or inf, a function call to the system library or a bundled function may be made. Note that an expression may contain multiple operations. 2) the "validfp" type qualifier on a floating point type means that the value may not be NaN or inf. Performing an operation with a such target type, or assigning a NaN or inf value into a such type is an error. 3) the "exactfp" type qualifier means that operations that are inexact (as defined by the IEEE 754 standard) are an error. This applies to compile-time calculations also. If "validfp" is NOT used, then inexact results from runtime calculations became NaN. 4) the "stdfp" type qualifier causes 1) to be ignored and IEEE standard handling of NaNs and infs to be used. 5) the "rzfp", "rpifp" and "rnifp" type qualifiers specify rounding modes (towards zero, positive infintiy and negative infinity) 6) the tokens sequences for infinity, NaN, and +/-0 are as follows (case sensitive): NaN +inf -inf +0 -0 The +/- are the unary operators. Writing "0" is equivalent to "+0", and so is assigning the integer 0 to a floating point type. 7) calculations of mixed floating point types is handled as follows: - if there is a target type, then the target type is used for the operation. E.g. "singlevalue = doublevalue*doublevalue;" is calculated with single precision. - otherwise the target type must be specified with the "as" operation. 8) floating point operations must be executed as they appear in the source code, or in a way that is guaranteed to give the same result. This applies to "constexpr" (compile-time evaluated expressions) as well. 9) floating point mode handling: - when LRL changes the mode, it also restores the mode before entering non-LRL code. If the non-LRL code changes the mode, and LRL has to restore the mode again, it will restore to the new changed mode. - LRL uses the default or restored mode when no qualifiers are present. I.e. when round-to-nearest and no traps (validfp or exactfp) are used. When only some of the qualifiers are used, then the defaults are used for the other modes. E.g. with only "exactfp", LRL will use the default trap modes for all other errors than inexactness and will use the default rounding mode. 10) The C way of doing calculations in extended precision but storage in single or double precision can be achieved in LRL with this syntax: cfloat a = 1.123; cfloat b = a*a as sysfloat;