6.3. Floating point numbers of arbitrary size and precision (class `bigfloat`)

Warning This chapter, section, or subsection is “in statu nascendi”. Therefore, it may be incomplete, or may appear in a different place in the final version, or may vanish completely.

Learning objectives

 Floating point numbers The dangers of C++'s type `double` The class `bigfloat`

C++ provides the types `float` and `double` for representing floating point numbers. In addition, LEDA offers the type `bigfloat`.

Roughly speaking, a floating point number represents a real number with (a finite number of) digits after the decimal point, such as 17.5, 3.1415, and 2.181. It is the number type primarily used for representing non-integer values in modern computer architectures. We will shortly see the exact structure of these types and what it means that the decimal point “flows”.

Let us start our exploration of floating point types with an example in which numbers with many digits after the decimal points are used: We want to compute √2 with an accurary of n digits after the decimal point, where n is an input parameter. We use the following strategy to compute this root up to n decimal digits:

(TODO Give here a short overview over IEEE 754 representation to understand why we use the base 2 in the following.) [17]

Basically, we compute the positive root of the polynomial x2 - q (with q = 2) with the well-known and easy to implement Bisection method. This methods works as follows: It starts with an interval [a,b] that is known to “bracket” a root, that is, f(a) and f(b) have opposite signs (so that, provided that the function is continuous, there must be a root of f inside [a,b] according to the intermediate value theorem). In our implementation, we take [1,2] as the starting interval. In each iteration, the function value of the midpoint of the current interval is computed. Then the interval is halved by making the midpoint the new left or right endpoint, so that the new interval still brackets the root.

When do we know that we have reached √2 up to n decimal digits? We use the following estimation to test whether we have reached the desired accuracy: Our goal is to have

|root - √2| < 10-n

Therefore, we try to make

|root - √2| < 2-binprec =:eps

where we choose binprec large enough such that eps < 10-n. To choose the necessary binary precision, note that 2-binprec < 2-n is satisfied if binprec is larger than n·log210. Choosing it as an integer larger than 3.33·n suffices.

A purely analytical estimation gives (note the invariant 1 < root < 2 in the Bisection method):

|root - √2| = |root2 - 2| / |root + √2| < |root2 - 2| / 2

So our goal becomes to make

|root2 - 2| < 2 eps

This root will then be accurate up to n decimal digits after the decimal point.

Here is a straight forward implementation using doubles. In our implementation, the interval endpoints are implicitly given as the current root candidate and the current interval width dx. In each iteration, dx is halved and the root candidate is shifted by the new dx if this is necessary to keep the root bracketed in the new interval. All these shifts only go into one direction, which is defined before the first iteration (depending on whether f(a) is less than or greater 0).

Filename: ComputingRootsWithDoubles.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
```#include <iostream>
#include <iomanip>

using std::cout;
using std::cin;
using std::endl;

// Bisection method for computing sqrt(q) known to be in [a,b]:
// Find a root of f(x) = x^2 - q with <decprec> correct decimal
// digits after the decimal point

int main()
{
double q = 2.0; // we want to compute sqrt(q)
cout << "Computing sqrt(" << q << ").\n";

int binprec, decprec;

cout << "Enter desired decimal accuracy (#digits after the decimal point): ";
cin  >> decprec;

binprec = (int) (3.33 * (double) decprec + 1);
cout << "Reaching a binary accuracy of " << binprec
<< " digits (after the decimal point) will suffice." << endl;

double eps = pow( 2, -binprec ); // accuracy to reach
cout << "Computing root such that |root^2 - " << q
<< "| (=:distance) < eps " << endl;
cout << "  with eps:=" << eps << "." << endl;

double a = 1.0, b = 2.0; // interval boundaries [a,b]

double root; // root candidate

// we orient the search such that initially f(root + dx) > 0
// and then we move root only into direction sgn(dx)
double f_a  = a * a - q;
double dx;   // width of current interval
if( f_a < 0 ) { dx = b - a; root = a; } // walk from left to right
else          { dx = a - b; root = b; } // walk from right to left

double distance = fabs( root * root - q );

double xmid, fmid;

while( distance >= 2 * eps ) {
dx = dx / 2;
xmid = root + dx;
fmid = xmid * xmid - q;
if(fmid <= 0) {
root = xmid;
}
distance = fabs( root * root - q );
}

// here we have: |root - sqrt(q)| = |root^2 - q| / |root + sqrt(q)|
//                                <= distance / 2 < eps
// i.e., |root - sqrt(q)| < 2^-binprec <= 10^-decprec

cout << "Final distance = " << std::setprecision( 7 ) << distance << endl;
cout << "Final root = " << std::setprecision( 1 + decprec ) << root << endl;
}
```

Let us try 10 decimal digits:

```Computing sqrt(2).
Enter desired decimal accuracy (#digits after the decimal point): `10`
Reaching a binary accuracy of 34 digits (after the decimal point) will suffice.
Computing root such that |root^2 - 2| (=:distance) < eps
with eps:=5.82077e-11.
Final distance = 5.077338e-11
Final root = 1.4142135624```

Nice! Seems to work. So let us give it a harder job, n = 100:

```Computing sqrt(2).
Enter desired decimal accuracy (#digits after the decimal point): `100`
Reaching a binary accuracy of 334 digits (after the decimal point) will suffice.
Computing root such that |root^2 - 2| (=:distance) < eps
with eps:=2.85747e-101.```

Oops! What's that? The program does not return! An infinite loop?

Maybe we are stuck, that is, dx becomes 0 and we do not move on any more, never fulfilling the stop criterion. To have more insight into the loop, let us add some debugging output and a guard test:

```  while( distance >= 2 * eps ) {
cout << "root=" << std::setprecision( 1 + decprec ) << root << " ";
cout << "dx=" <<   std::setprecision( 7 )           << dx << endl;
dx = dx / 2;
xmid = root + dx;
fmid = xmid * xmid - q;
if(fmid <= 0)
root = xmid;
if( dx == 0.0 ) {
cout << "WARNING! dx == 0! I am stuck! Stopping iteration.\n";
break;
}
distance = fabs( root * root - q );
}
```

Here we go again on our own. (The output is truncated; only the output of the first and the last iterations are shown.)

```Computing sqrt(2).
Enter desired decimal accuracy (#digits after the decimal point): `100`
Reaching a binary accuracy of 334 digits (after the decimal point) will suffice.
Computing root such that |root^2 - 2| (=:distance) < eps
with eps:=2.85747e-101.
root=1 dx=1
root=1 dx=0.5
root=1.25 dx=0.25
root=1.375 dx=0.125
root=1.375 dx=0.0625
root=1.40625 dx=0.03125
root=1.40625 dx=0.015625
...
root=1.41421356237309492343001693370752036571502685546875 dx=6.32404e-322
root=1.41421356237309492343001693370752036571502685546875 dx=3.16202e-322
root=1.41421356237309492343001693370752036571502685546875 dx=1.58101e-322
root=1.41421356237309492343001693370752036571502685546875 dx=7.90505e-323
root=1.41421356237309492343001693370752036571502685546875 dx=3.952525e-323
root=1.41421356237309492343001693370752036571502685546875 dx=1.976263e-323
root=1.41421356237309492343001693370752036571502685546875 dx=9.881313e-324
root=1.41421356237309492343001693370752036571502685546875 dx=4.940656e-324
WARNING! dx == 0! I am stuck! Stopping iteration.
Final distance = 3.546425e-16
Final root = 1.41421356237309492343001693370752036571502685546875```

Aha! It was indeed an infinite loop!

Doubles and their limitations

Now there is a lot to say here:

 Explain the structure of doubles (sign, significand, exponent). Explain arithmetics on doubles. Explain why we could NEVER reach an accuracy of 100 decimal digits with doubles, no matter what strategy we use (because pmax=53). Explain why the above dx becomes 0. Explain the last values of dx. (Why is there an exponent of -324 before it goes to 0?) Explain the subtlety involved in the loop test above (where do we know from that root2 - q is computed and compared with eps correctly? Wo do not know without an numerical analysis, making our job even harder!)

LEDA's bigfloat

Say that bigfloats have basically the same structure and arithmetcs as doubles, but with arbitrarily large significands and exponents.

Define the rounding modes.

Then give the program again, but with bigfloats. Note that the program is almost the same; the reader shall see: bigfloats can be used like doubles (this holds for most operations).

Concerning the program, especially explain the fiddling and switching between rounding modes EXACT and TO_NEAREST, and why we set the precision to binprec plus one. Explain that it is crucial that the loop test is evaluated exactly (because we do not want to make forward analysis).

Filename: ComputingRootsWithBigfloats.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
```#include <LEDA/bigfloat.h>

using leda::bigfloat;

using std::cout;
using std::cin;
using std::endl;

// Bisection method for computing sqrt(q) known to be in [a,b]:
// Find a root of f(x) = x^2 - q with <decprec> correct decimal
// digits after the decimal point

int main()
{
bigfloat q = 2.0; // we want to compute sqrt(q)
cout << "Computing sqrt(" << q << ").\n";

int binprec, decprec;

cout << "Enter desired decimal accuracy (#digits after the decimal point): ";
cin  >> decprec;

binprec = (int) (3.33 * (double) decprec + 1);
cout << "Reaching a binary accuracy of " << binprec
<< " digits (after the decimal point) will suffice." << endl;

bigfloat::set_precision( binprec + 1 );

bigfloat eps = leda::ipow2( -binprec ); // accuracy to reach
cout << "Computing root such that |root^2 - " << q
<< "| (=:distance) < eps " << endl;
cout << "with eps:=" << eps << "." << endl;

bigfloat a = 1.0, b = 2.0; // interval boundaries [a,b]

bigfloat root; // root candidate

// we orient the search such that initially f(root + dx) > 0
// and then we move root only into direction sgn(dx)
bigfloat f_a  = a * a - q;
bigfloat dx;   // width of current interval
if( f_a < 0 ) { dx = b - a; root = a; } // walk from left to right
else          { dx = a - b; root = b; } // walk from right to left

bigfloat::set_rounding_mode( leda::EXACT );
bigfloat distance = leda::abs( root * root - q );
bigfloat two_eps = 2 * eps;
bigfloat::set_rounding_mode( leda::TO_NEAREST );

bigfloat xmid, fmid;

while( distance >= two_eps ) {
//cout << "root=" << root << " ";
//cout << "dx="   << dx   << endl;
dx = dx / 2;
xmid = root + dx;
fmid = xmid * xmid - q;
if(fmid <= 0)
root = xmid;
if( dx == 0.0 ) {
cout << "WARNING! dx == 0! I am stuck! Stopping iteration.\n";
break;
}
bigfloat::set_rounding_mode( leda::EXACT );
distance = leda::abs( root * root - q );
//cout << "distance=" << distance << endl;
bigfloat::set_rounding_mode( leda::TO_NEAREST );

}

// here we have: |root - sqrt(q)| = |root^2 - q| / |root + sqrt(q)|
//                                <= distance / 2 < eps
// i.e., |root - sqrt(q)| < 2^-binprec <= 10^-decprec

bigfloat::set_output_precision( 7 );
cout << "Final distance=" << distance << endl;
bigfloat::set_output_precision( 1 + decprec );
cout << "root="     << root     << endl;
}
```

Sample run:

```Computing sqrt(2).
Enter desired decimal accuracy (#digits behind the comma): 100
Reaching a binary accuracy of 334 digits (behind the comma) will suffice.
Computing root such that |root^2 - 2| (=:distance) < eps
with eps:=2.857468478205687e-101.
root=1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157278
distance=2.467328e-103
```

Works!

Now explain the interface of bigfloat in greater detail.

Give the essence of bigfloat (and double) arithmetic: If z is the exact real result of an arithmetic operation, and z' is the computed value, then

|z - z'| <= 2-prec |z'|

where prec is the binary length of the signifinand in use.

bigfloats, doubles, and their dangers

List common dangers, mistakes, and fallacies in using bigfloats and doubles (extinction, error propagation,...), leading to nonsense results.

Say that all warnings that we have given for doubles also hold for bigfloats.

Exercises

 Exercise 81. Use bigfloats to implement Newton's iteration method for root finding.