Floating point data types in C

05 Feb 2023

Floating point data types, or short: ‘floats’, can be used to store numbers with a decimal point. But there is much more to it! This is what I recently learned about floats.

Prior knowledge

binary numbers
exponential functions
scientific notation
(signed) integer data types

Definitions

To understand floats, it is important to distinguish between:

number types – different types of numbers in mathematical sense
number representations – how we write them down
numerical data types – how we store them in computers

Note that ‘floats’ are not a number type, but rather a number representation or a data type.

Number types – mathematical definitions

Different types of numbers:

Natural numbers: non-negative integers, a.k.a. counting numbers. \([0, 1, 2, 3, 4, … ] \)
Integers: all whole numbers, including negatives. \([ …, −3, −2, −1, 0, 1, 2, 3 … ] \)
Rational numbers: fractions; \(a / b\), where \(a\) and \(b\) are integers and \(b\) is not zero, e.g. \(5 / 2\). Note that integers are also considered to be rational numbers as they could be written as a fraction as well.
Real numbers: all 1 dimensional numbers. This includes the rational numbers but also the numbers that can not be expressed as a fraction, like \(\pi\) and \(\sqrt{2}\).

Note that these number types are in order, so each type also includes all previous types.

Number representations

Examples of number representations are:

Positional numeral systems, like binary or base-10, optionally with a decimal separator. A limitation of positional number systems is that they can not perfectly represent all fractions. If you represent \(1/3\) as \(0.333\), for example, you introduce rounding errors. Likewise, numbers like \(\pi\) and \(\sqrt{2}\) can not perfectly be represented in positional number representations, but only be approximated.
Scientific notation, in form: \(significand*10^{exponent}\). For example \(4.300 * 10 ^ 3\), which is a representation of \(4300\) in base-10. In standard scientific notation the significand is a number in range \([1, 10)\) and the exponent is an integer. Sometimes the term ‘mantissa’ is used instead of ‘significand’, but these terms can be considered synonyms.
Floating point numbers. Similar to scientific notation, but can have any base. It consists of three components: significand, base, and exponent, and has the following form. \(significand*base^{exponent}\). Note that there are multiple possible floating point number representations for the same number. The following are all representations of 4.5:
- \(4.5 * 10 ^ 0\)
- \(1.125 * 2 ^ 2\)
- \(0.5625 * 2 ^ 3\)

Binary numbers with points

Base-10 numbers might have a decimal point. Similarly, binary numbers can have a binary point. The base neutral term for these separator points is ‘radix point’. In binary notation, the numbers before the binary point represent \(1\), \(\:2\), \(\:4\), \(\:8\), etc. The numbers after the radix point represent \(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{8}\), etc. Decimal number \(5.75\), for example, equals \(101.11\) in binary. This can be varified by converting back to base-10: \((1*4) + (0*2) + (1*1) + (1*1/2) + (1 * 1/4) = 5 \frac{3} {4}\), or \(5.75\). Just like base-10 numbers can not be used to exactly represent fractions like \(1/3\), binary representations often introduce rounding errors. Only numbers that are a summation of powers of 2 can be stored exactly. This means that many ordinary decimal numbers (like \(0.1\)) can not be exactly stored in floats with base 2.

Float data types in C

In C, there are three data types for floating point numbers:

float – 4 byte single precision float
double – 8 byte double precision float
long double – 16 byte extended precision float

Normalization

In practice, the float data types in C always use base 2. This can be verified by checking the value in macro FLT_RADIX in the <float.h> library. Furthermore, the significand in always in range \([0.5, 1)\). This means that for every number, there is a single, normalized, combination of a significand and an exponent to represent it.

Library function ‘frexp()’ from the <math.h> library can be used to retrieve the normalized significand and exponent for a specific number. Number \(4.5\), for example, has normalized floating point represenation \(0.5625 * 2 ^ 3\).

#include <math.h>
#include <stdio.h>

int main(void)
{
    float f = 4.5;
    int exp;

    printf("significand: %f\n", frexp(f, &exp));
    printf("exponent: %d\n", exp);
}

Output:

significand: 0.562500
exponent: 3

Bit distribution

The IEEE 754 standard prescribes how the available memory should be allocated to store the significand and the exponent. The base does not need to be stored since it is a constant.

Data type	Sign bit	Exponent bits	Significand bits	Total bits
float	1	8	23	32
double	1	11	52	64
long double	1	15	64	80

The range of values that can be stored in these data types can be retrieved via a number of macros in the <float.h> library. The table below lists a number of these values.

Data Type	size (bytes)	highest	lowest	lowest (denormalized)	number of significant decimal digits
float	4	3.402823 E+38	1.175494 E‑38	1.401298 E‑45	6
double	8	1.797693 E+308	2.225074 E‑308	4.940656 E‑324	15
long double	16	1.189731 E+4932	3.362103 E‑4932	3.645196 E‑4951	18
macros		FLT_MAX, DBL_MAX, LDBL_MAX	FLT_MIN, DBL_MAX, LDBL_MAX	FLT_TRUE_MIN, DBL_TRUE_MIN, LDBL_TRUE_MIN	FLT_DIG, DBL_DIG, LDBL_DIG

Checking the numbers

To understand where these numbers come from, let’s have a closer look at the single precision float. The following macros from the <float.h> library show the minimum and maximum values of the exponent and the maximum number of binary digits of the significand.

FLT_MIN_EXP   // minimum value of exponent: -125
FLT_MAX_EXP   // maximum value of exponent: 128
FLT_MANT_DIG  // number of binary digits in the significand: 24

The maximum value of the significand is close to 1, namely: \(\frac{1}{2^1}+\frac{1}{2^2}+\frac{1}{2^3}+…+\frac{1}{2^{24}} = 1 – \frac{1}{2^{24}} ≈ 0.999999940\). The maximum value that can be stored in the float therefore is \((1 – \frac{1}{2^{24}}) * 2 ^{128}\), which is indeed the value from FLT_MAX. FLT_MIN equals \(0.5 *2^{-125}\) because the minimum value of the normalized significand is 0.5 and the minimum value of the exponent is -125.

Even lower numbers can be stored by dropping the normalization of the significand. The smallest denormalized number is \(2^{-149}\), which is the value in FLT_TRUE_MIN. Denormalized numbers are discssed in more detail further down.

FLT_DIG indicates the number of significant decimal digits that can be stored in a float without rounding issues. It is calculated as (FLT_MANT_DIG – 1) \(\:*\:log_{10}(2)\), rounded down.

Note that for data types float and double, the maximum number of binary digits of the significand is one more than the number of bits reserved to store it. Furthermore, looking at the minimum and maximum values of the exponent of the float, we see that 254 unique value are covered, while the allocated 8 bits allow for 256 different values. These peculiarities can be explained by diving a bit deeper in the binary encoding.

Binary encoding

As seen in the example above, value \(4.5\) can be stored as a combination of a sign bit (\(0\)), an exponent of \(3\), and a significand of \(0.5625\)

The 8 exponent bits cover range \([0, 255]\). To store the range specified by FLT_MIN_EXP and FLT_MAX_EXP, a bias value of \(126\) is applied. So to store exponent \(3\), value \(129\) is stored in binary, which is \(10000001\). Note that the minimum and maximum encodings (\(00000000\) and \(11111111\)) are used for special purposes: \(NaN\), \(\pm {infinity}\) and denormalized values (see below).

The significand of \(0.5625\) can be written in binary as \(0.1001\); \((1 * 1/2) + (0 * 1/4) + (0 * 1/8) + (1 * 1/16)\). The significand is normalized between \(0.5\) (inclusive) and \(1\) (exclusive). This means that in binary it will always start with \(0.1\). Because it always starts with this part, it doesn’t need to be stored in memory. Therefore, the significand will be encoded as \(001\). Because the fist bit of the significand is not stored, the value provided by FLT_MANT_DIG is one higher than the actual number of reserved bits. The remaining bits of the significand can be filled with zeros, which is equivalent to adding extra zeros behind a decimal separator: it won’t impact the number. The significand bits, in total, are \(00100000000000000000000\).

The total bit-string is a concatenation of the sign bit, the exponent part, and the significand part, so \(01000000100100000000000000000000\).

Note that C uses exponents normalized in range \([0.5, 1)\), and an exponent bias value of \(126\). Alternative systems and the IEEE-754 specification use exponents in range \([1, 2)\) and a bias value of \(127\). Note that this is just a different way to describe the same encoding process, and it leads to exactly the same IEEE-754 compliant binary representation. Useful tools like this float converter can therefore be a bit confusing, while in fact creating correct results.

Peculiarities of the long double

Data type ‘long double’ is a bit different from the ‘float’ and the ‘double’. Differences are:

Only 10 bytes (80 bits) of the available 16 bytes of the long double are actually used to store the number. The remaining 6 bytes are just padding.
Unlike floats and doubles, the long double explicitly stores the first digit of the significand (which is always 1). This means the value in LDBL_MANT_DIG, which is 64, corresponds with the actual number of bits reserved to store the significand. A significand of \(0.5625\), therefore will be stored as \(1001\).
The long double is typically implemented as the IEEE-754 16 byte extended precision float, as described above. However, it might potentially also be implemented as quadruple float. A difference is that the quadruple float doesn’t have padding and therefore has room for larger exponents and significands. The reason that most C implementations (currently) favor extended precision floats over quadruple floats is that most CPUs are optimized for the single, double and extended precision floats and not so much for the quadruple float. While the long double is mostly implemented as 80-bit extended precision float, this is not guaranteed. As a consequence, using the long double might introduce compatibility issues.

Special numbers

The 8 bits used to store the exponent, in combination with the bias value of \(126\), imply that the exponent can be in range \([-126, 129]\). The values FLT_MIN_EXP and FLT_MAX_EXP, however, show range \([-125, 128]\). The reason of this difference is that the minimum and maxium value that can be stored in the exponent bits are reserved for special numbers.

Denormalized numbers and zero

Regular (normalized) floats have exponents in range \([-125, 128]\) and have normalized significands in range \([0.5, 1)\). On top of these normalized float representations, there is a set of denormalized floats that all have exponent \(-126\). These floats, however, apply a different encoding for their significands. Instead of the normalized significand range of \([0.5, 1)\), the denomalized floats have significands in range \([0, 1)\).

Consider again the decimal value \(0.5625\) which can be written in binary as \(0.1001\). The normalized encoding of this significand is \(001\) (the leading \(0.1\) is implied). The denormalized ecoding of the same number is \(1001\) (the leading \(0.\) is implied). The denormalized numbers allow for significands smaller then \(0.5\), including \(0\). This means the standard floating point expression used for zero is \(0*2^{-126}\). The purpose of denormalized floats is to store numbers that are even smaller than FLT_MIN (see the table above). Denormalized numbers smaller than the smallest normalized number are called subnormal numbers. The smallest subnormal number a single precision float can hold is \(2^{-23}*2^{-126}=2^{-149}\), which is about \(1.401298 * 10^{-45}\).

Infinity and NaN

The maximum exponent is also not used for regular numbers. Instead it is used to represent infinity or NaN values. NaN means “Not a number” and is used to represent undefined values.

	sign bit	exponent bits	significand bits
zero	0	all 0	all 0
+ infinity	0	all 1	all 0
– infinity	1	all 1	all 0
NaN	0	all 1	all 1

Literals and printf

To distiguish between floats, doubles and floats, literal suffixes are needed:

f or F defines a float
(no suffix) defines a double
l or L defines a long double

3.14f  // this is a float
3.14   // this is a double
3.14L  // this is a long double

Exponents can also be used, with e or E:

6.022E23L  // this is a long double

There are several options to print literals or variables of a floating point type with ‘printf()’:

specifier for float or double	specifier for long double	result
`%f` or `%F`	`%Lf` or `%LF`	always in decimal form
`%e` or `%E`	`%Le` or `%LE`	always in exponential form
`%g` or `%G`	`%Lg` or `%LG`	either in decimal or exponential form, chosing the smallest printing length
`%a` or `%A`	`%La` or `%LA`	combination of hexadecimal significand and a decimal power of 2

Note: upper case specifiers will result in uppercase INF, NAN, E and hexadecimal characters

Limitations

Since floats hold values in the form \((sign) * significand * 2 ^ {exponent}\), their possible values range of values from very large to very small. Numbers between \(-1\) and \(1\) will have a negative exponent. There are, however, also some important limitations:

many base-10 numbers can not be exactly stored, which means that the value stored is often slightly different than intended. \(0.1\), for example, can not be stored exactly, but instead value \(0.100000001490…\) will be stored. To prevent random deviations in calculations, standard floats should always be rounded to \(6\) significant decimals (see FLT_DIG in the table below)
Similarly, the fact that large numbers can be stored does not mean that all large values can be stored. The possible numbers that can be stored in a float are not distributed evenly. For example, value \(1,000,000,000\) can be exactly stored, and \(1,000,000,064\) as well. All values in between, however, can not be precisely stored and will be rounded to the closest value that can be stored.
Since rounding errors are so common, exact comparisons should be avoided. A famous example is \(0.1 + 0.2 == 0.3\), which evaluates to false.

Alternatives

The distribution of bits over the significand and the exponent is fixed. CPUs are optimized for these standard float types. It is, therefore, not possible to adjust the float to reduce the number of bits used for the exponent in favor of the significand or vice versa.

Some alternatives to consider:

Fixed-point arithmetic – This means using integer types, but interpret them as them as fixed point numbers. A common use-case is to store monetary values in integers and assume 2 decimal places. So instead of 1.25, the integer 125 is stored.
Decimal Floating Types – In standard C implementations, FLT_RADIX is set to 2. Decimal floating types are therefore not available. They do exist in some other languages however.
External libraries – List of arbitrary precision arithmetic software

Useful libraries and functions

<float.h> – contains macros with the limitations of the different float types
<math.h> – frexpf(), frexp(), frexpl()
<stdlib.h> – strtof(), strtod(), strtold()

References

David Goldberg, “What Every Computer Scientist Should Know About Floating-Point Arithmetic”; https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Chris Hecker (www.chrishecker.com), “Let’s Get to the (Floating) Point”; http://www.chrishecker.com/images/f/fb/Gdmfp.pdf
V Rajaraman, “IEEE Standard for Floating Point Numbers”; https://www.ias.ac.in/article/fulltext/reso/021/01/0011-0030
H. Schmid, “IEEE-754 Floating Point Converter”; https://www.h-schmidt.net/FloatConverter/IEEE754.html