Warning | |
---|---|

This chapter, section, or subsection is “in statu nascendi”. Therefore, it may be incomplete, or may appear in a different place in the final version, or may vanish completely. |

*Learning objectives*

Floating point numbers |

The dangers of C++'s type `double` |

The class `bigfloat` |

C++ provides the types `float`

and
`double`

for representing *floating point
numbers*. In addition, LEDA offers the type * bigfloat*.

Roughly speaking, a floating point number represents a real number with (a finite number of) digits after the decimal point, such as 17.5, 3.1415, and 2.181. It is the number type primarily used for representing non-integer values in modern computer architectures. We will shortly see the exact structure of these types and what it means that the decimal point “flows”.

Let us start our exploration of floating point types with an example in which numbers with many digits after the decimal points are used: We want to compute √2 with an accurary of n digits after the decimal point, where n is an input parameter. We use the following strategy to compute this root up to n decimal digits:

(TODO Give here a short overview over IEEE 754 representation to understand why we use the base 2 in the following.) [17]

Basically, we compute the positive root of the polynomial
x^{2} - q (with q = 2) with the well-known
and easy to implement *Bisection
method*. This methods works as follows: It
starts with an interval [a,b] that is known to
“bracket” a root, that is, f(a) and f(b) have
opposite signs (so that, provided that the function is continuous,
there must be a root of f inside [a,b] according to the
intermediate value theorem). In our implementation, we take [1,2]
as the starting interval. In each iteration, the function value
of the midpoint of the current interval is computed. Then the
interval is halved by making the midpoint the new left or right
endpoint, so that the new interval still brackets the root.

When do we know that we have reached √2 up to n decimal digits? We use the following estimation to test whether we have reached the desired accuracy: Our goal is to have

|root - √2| < 10^{-n}

Therefore, we try to make

|root - √2| < 2^{-binprec} =:eps

where we choose binprec large enough such that eps < 10^{-n}. To choose the necessary binary precision, note that
2^{-binprec} < 2^{-n}
is satisfied if binprec is larger than n·log_{2}10. Choosing it as an integer larger than 3.33·n suffices.

A purely analytical estimation gives (note the invariant 1 < root < 2 in the Bisection method):

|root - √2| = |root^{2} - 2| / |root + √2| < |root^{2} - 2| / 2

So our goal becomes to make

|root^{2} - 2| < 2 eps

This root will then be accurate up to n decimal digits after the decimal point.

Here is a straight forward implementation using doubles. In our implementation, the interval endpoints are implicitly given as the current root candidate and the current interval width dx. In each iteration, dx is halved and the root candidate is shifted by the new dx if this is necessary to keep the root bracketed in the new interval. All these shifts only go into one direction, which is defined before the first iteration (depending on whether f(a) is less than or greater 0).

#include <iostream> #include <iomanip> using std::cout; using std::cin; using std::endl; // Bisection method for computing sqrt(q) known to be in [a,b]: // Find a root of f(x) = x^2 - q with <decprec> correct decimal // digits after the decimal point int main() { double q = 2.0; // we want to compute sqrt(q) cout << "Computing sqrt(" << q << ").\n"; int binprec, decprec; cout << "Enter desired decimal accuracy (#digits after the decimal point): "; cin >> decprec; binprec = (int) (3.33 * (double) decprec + 1); cout << "Reaching a binary accuracy of " << binprec << " digits (after the decimal point) will suffice." << endl; double eps = pow( 2, -binprec ); // accuracy to reach cout << "Computing root such that |root^2 - " << q << "| (=:distance) < eps " << endl; cout << " with eps:=" << eps << "." << endl; double a = 1.0, b = 2.0; // interval boundaries [a,b] double root; // root candidate // we orient the search such that initially f(root + dx) > 0 // and then we move root only into direction sgn(dx) double f_a = a * a - q; double dx; // width of current interval if( f_a < 0 ) { dx = b - a; root = a; } // walk from left to right else { dx = a - b; root = b; } // walk from right to left double distance = fabs( root * root - q ); double xmid, fmid; while( distance >= 2 * eps ) { dx = dx / 2; xmid = root + dx; fmid = xmid * xmid - q; if(fmid <= 0) { root = xmid; } distance = fabs( root * root - q ); } // here we have: |root - sqrt(q)| = |root^2 - q| / |root + sqrt(q)| // <= distance / 2 < eps // i.e., |root - sqrt(q)| < 2^-binprec <= 10^-decprec cout << "Final distance = " << std::setprecision( 7 ) << distance << endl; cout << "Final root = " << std::setprecision( 1 + decprec ) << root << endl; }

Let us try 10 decimal digits:

Computing sqrt(2). Enter desired decimal accuracy (#digits after the decimal point):Reaching a binary accuracy of 34 digits (after the decimal point) will suffice. Computing root such that |root^2 - 2| (=:distance) < eps with eps:=5.82077e-11. Final distance = 5.077338e-11 Final root = 1.4142135624`10`

Nice! Seems to work. So let us give it a harder job, n = 100:

Computing sqrt(2). Enter desired decimal accuracy (#digits after the decimal point):Reaching a binary accuracy of 334 digits (after the decimal point) will suffice. Computing root such that |root^2 - 2| (=:distance) < eps with eps:=2.85747e-101.`100`

Oops! What's that? The program does not return! An infinite loop?

Maybe we are stuck, that is, dx becomes 0 and we do not move on any more, never fulfilling the stop criterion. To have more insight into the loop, let us add some debugging output and a guard test:

while( distance >= 2 * eps ) { cout << "root=" << std::setprecision( 1 + decprec ) << root << " "; cout << "dx=" << std::setprecision( 7 ) << dx << endl; dx = dx / 2; xmid = root + dx; fmid = xmid * xmid - q; if(fmid <= 0) root = xmid; if( dx == 0.0 ) { cout << "WARNING! dx == 0! I am stuck! Stopping iteration.\n"; break; } distance = fabs( root * root - q ); }

Here we go again on our own. (The output is truncated; only the output of the first and the last iterations are shown.)

Computing sqrt(2). Enter desired decimal accuracy (#digits after the decimal point):Reaching a binary accuracy of 334 digits (after the decimal point) will suffice. Computing root such that |root^2 - 2| (=:distance) < eps with eps:=2.85747e-101. root=1 dx=1 root=1 dx=0.5 root=1.25 dx=0.25 root=1.375 dx=0.125 root=1.375 dx=0.0625 root=1.40625 dx=0.03125 root=1.40625 dx=0.015625 ... root=1.41421356237309492343001693370752036571502685546875 dx=6.32404e-322 root=1.41421356237309492343001693370752036571502685546875 dx=3.16202e-322 root=1.41421356237309492343001693370752036571502685546875 dx=1.58101e-322 root=1.41421356237309492343001693370752036571502685546875 dx=7.90505e-323 root=1.41421356237309492343001693370752036571502685546875 dx=3.952525e-323 root=1.41421356237309492343001693370752036571502685546875 dx=1.976263e-323 root=1.41421356237309492343001693370752036571502685546875 dx=9.881313e-324 root=1.41421356237309492343001693370752036571502685546875 dx=4.940656e-324 WARNING! dx == 0! I am stuck! Stopping iteration. Final distance = 3.546425e-16 Final root = 1.41421356237309492343001693370752036571502685546875`100`

Aha! It was indeed an infinite loop!

Now there is a lot to say here:

Explain the structure of doubles (sign, significand, exponent). |

Explain arithmetics on doubles. |

Explain why we could
NEVER reach an accuracy of 100 decimal digits with
doubles, no matter what strategy we use (because p_{max}=53). |

Explain why the above dx becomes 0. |

Explain the last values of dx. (Why is there an exponent of -324 before it goes to 0?) |

Explain the subtlety involved in the loop test above
(where do we know from that root^{2}
- q is computed and compared with eps correctly? Wo do not know without an numerical analysis, making our job even harder!) |

Say that bigfloats have basically the same structure and arithmetcs as doubles, but with arbitrarily large significands and exponents.

Define the rounding modes.

Then give the program again, but with bigfloats. Note that the program is almost the same; the reader shall see: bigfloats can be used like doubles (this holds for most operations).

Concerning the program, especially explain the fiddling
and switching between rounding modes EXACT and TO_NEAREST, and
why we set the precision to binprec *plus
one*. Explain that it is crucial that the loop test is
evaluated exactly (because we do not want to make forward analysis).

#include <LEDA/bigfloat.h> using leda::bigfloat; using std::cout; using std::cin; using std::endl; // Bisection method for computing sqrt(q) known to be in [a,b]: // Find a root of f(x) = x^2 - q with <decprec> correct decimal // digits after the decimal point int main() { bigfloat q = 2.0; // we want to compute sqrt(q) cout << "Computing sqrt(" << q << ").\n"; int binprec, decprec; cout << "Enter desired decimal accuracy (#digits after the decimal point): "; cin >> decprec; binprec = (int) (3.33 * (double) decprec + 1); cout << "Reaching a binary accuracy of " << binprec << " digits (after the decimal point) will suffice." << endl; bigfloat::set_precision( binprec + 1 ); bigfloat eps = leda::ipow2( -binprec ); // accuracy to reach cout << "Computing root such that |root^2 - " << q << "| (=:distance) < eps " << endl; cout << "with eps:=" << eps << "." << endl; bigfloat a = 1.0, b = 2.0; // interval boundaries [a,b] bigfloat root; // root candidate // we orient the search such that initially f(root + dx) > 0 // and then we move root only into direction sgn(dx) bigfloat f_a = a * a - q; bigfloat dx; // width of current interval if( f_a < 0 ) { dx = b - a; root = a; } // walk from left to right else { dx = a - b; root = b; } // walk from right to left bigfloat::set_rounding_mode( leda::EXACT ); bigfloat distance = leda::abs( root * root - q ); bigfloat two_eps = 2 * eps; bigfloat::set_rounding_mode( leda::TO_NEAREST ); bigfloat xmid, fmid; while( distance >= two_eps ) { //cout << "root=" << root << " "; //cout << "dx=" << dx << endl; dx = dx / 2; xmid = root + dx; fmid = xmid * xmid - q; if(fmid <= 0) root = xmid; if( dx == 0.0 ) { cout << "WARNING! dx == 0! I am stuck! Stopping iteration.\n"; break; } bigfloat::set_rounding_mode( leda::EXACT ); distance = leda::abs( root * root - q ); //cout << "distance=" << distance << endl; bigfloat::set_rounding_mode( leda::TO_NEAREST ); } // here we have: |root - sqrt(q)| = |root^2 - q| / |root + sqrt(q)| // <= distance / 2 < eps // i.e., |root - sqrt(q)| < 2^-binprec <= 10^-decprec bigfloat::set_output_precision( 7 ); cout << "Final distance=" << distance << endl; bigfloat::set_output_precision( 1 + decprec ); cout << "root=" << root << endl; }

Sample run:

Computing sqrt(2). Enter desired decimal accuracy (#digits behind the comma): 100 Reaching a binary accuracy of 334 digits (behind the comma) will suffice. Computing root such that |root^2 - 2| (=:distance) < eps with eps:=2.857468478205687e-101. root=1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157278 distance=2.467328e-103

Works!

Now explain the interface of bigfloat in greater detail.

Give the essence of bigfloat (and double) arithmetic: If z is the exact real result of an arithmetic operation, and z' is the computed value, then

|z - z'| <= 2^{-prec} |z'|

where prec is the binary length of the signifinand in use.

List common dangers, mistakes, and fallacies in using bigfloats and doubles (extinction, error propagation,...), leading to nonsense results.

Say that all warnings that we have given for doubles also hold for bigfloats.