Comparing Floating-Point Values in C++

Here is a C++ implementation of the AlmostEqual2sComplement algorithm from "Comparing Floating Point Numbers" by Bruce Dawson. The implementation comes from the Google C++ Test Framework (Copyright 2005, Google Inc. All rights reserved, authors Zhanyong Wan and Sean Mcafee), so nothing here is original. I like the way it uses uniquely C++ idioms to produce a subtly elegant result.

The AlmostEqual2sComplement algorithm exploits the format of an IEEE floating-point variable. The bits of floating-point variables are converted to integers with the same lexicographical ordering, and then the integers are checked for approximate equality. The integer difference corresponds to the difference in "units in the last place" (ULP) of the floating-point values. The C code for doing this is (copied verbatim from "Comparing Floating Point Numbers"):

bool AlmostEqual2sComplement(float A, float B, int maxUlps)
{
    // Make sure maxUlps is non-negative and small enough that the
    // default NAN won't compare as equal to anything.
    assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
    int aInt = *(int*)&A;
    // Make aInt lexicographically ordered as a twos-complement int
    if (aInt < 0)
        aInt = 0x80000000 - aInt;
    // Make bInt lexicographically ordered as a twos-complement int
    int bInt = *(int*)&B;
    if (bInt < 0)
        bInt = 0x80000000 - bInt;
    int intDiff = abs(aInt - bInt);
    if (intDiff <= maxUlps)
        return true;
   return false;
}

Buried in the Google C++ Testing Framework, however, is a nice implementation of this algorithm that uses C++ templates to produce a clean interface for both float and double types. Check the internals/gtest-internal.h and internals/gtest-port.h header files to find it. Since I need the ability to compare floating-point values in my application logic, not just in test code, I have copied this implementation. And since the approach uses a couple powerful techniques less frequently seen in C++ code, and it took me a few minutes to process it the first time, I thought I would describe the code, from the ground up.

Floating-Point Bits

The IEEE 754 standard specifies that the bits of a floating-point value are arranged as follows: the left-most (most-signficant) bit is a sign bit, the next bits are an exponent, and the remaining bits are the fractional part of the significand. The exponent bits are an Excess-N, or "biased", representation. The fractional-part bits are a magnitude. This means that the bits of the binary floating-point value can be treated as a sign-and-magnitude integer value and the ordering of the values will be preserved. We can convert a floating point value to a signed integer as the AlmostEqual2sComplement code does, or we can convert to an unsigned integer (a biased/Excess-N representation itself) by taking the two's-complement of the negative values, the lower half of the range, and shifting all the positive values to the upper half range.

//
// Here the Bits type is an unsigned integer representation of the
// floating-point type, defined below.
//
Bits SignAndMagnitudeToBiased(const Bits& sam)
{
    if (kSignBitMask & sam) {
        return ~sam + 1;  // two's complement
    } else {
        return kSignBitMask | sam;  // + half-range
    }
}

The biased representation can then be used to get the difference in units of the last place of floating-point values (avoiding a call to abs()):

Bits ULP_diff(const Bits& sam1, const Bits& sam2)
{
    const Bits biased1 = SignAndMagnitudeToBiased(sam1);
    const Bits biased2 = SignAndMagnitudeToBiased(sam2);
 
    return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
}

C++ templates provide a nice mechanism for defining the Bits type.

Using Partial Template Specialization to Map Types

In "Modern C++ Design: Generic Programming and Design Patterns Applied", Andrei Alexandrescu describes how partial template specialization in C++ can be used to map integral constants to types, or to select a type using a (boolean) constant. The Google C++ Test Framework uses this technique to select an appropriately sized integer type for the float and double floating-point types. The code looks something like this:

//
// The default mapping to void makes sure that TypeWithSize<N> will only
// work with the N values for which the template has been specialized.
//
template <size_t bytes>
struct TypeWithSize
{
    typedef void UInt;
};
 
template <>
struct TypeWithSize<4>
{
    typedef uint32_t UInt;
};
 
template <>
struct TypeWithSize<8>
{
    typedef uint64_t UInt;
};

Actually, that's not the entire story. The uint32_t and uint64_t types have to be defined somewhere. When using GCC (and others?) they come from the stdint.h header; when using Visual C++ they come from the built-in __int32 and __int64 types (at least for Visual C++ 8.0 and older). Thus the code above needs the following:

#if defined _MSC_VER  // may need to check your VC++ version
  typedef unsigned __int32 uint32_t;
  typedef unsigned __int64 uint64_t;
#else
  #include <stdint.h>
#endif

Armed with the partial template specializations above, we can define the Bits type based on the size of our floating-point type.

A Generic Interface for Floating-Point Comparison

We can now use the above partial specializations with another template construct to write a generic interface for comparing either float or double data.

template <typename RawType>
class FloatingPoint {
public:
    //
    // Using the TypeWithSize specializations above:
    //
    typedef typename TypeWithSize<sizeof(RawType)>::Uint Bits;
 
    static const size_t kBitCount = 8*sizeof(RawType);
    static const Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount-1);
    static const size_t kMaxUlps = 4;
 
    explicit FloatingPoint(const RawType& x) : value_(x) {}
 
    //
    // Using the ULP_diff() implementation above:
    //
    bool AlmostEquals(const FloatingPoint& rhs) const {
        return ULP_diff(bits_, rhs.bits_) <= kMaxUlps;
    }
 
private:
    //
    // The partial template specializations above assure that the
    // size of the Bits integer type matches the size of the RawType.
    //
    union {
        RawType value_;  // The floating-point number, float or double.
        Bits bits_;      // The bits that represent the number
    };
};

We're almost there, but there is one more issue to deal with. We still need to test for "not a number" (NaN) values.

Testing for NaN, Generically and Portably

So that the AlmostEquals() comparison behaves the same as the == operator when comparing exactly identical values we need to handle the special NaN quantity separately (the ULP_diff() already works for infinities). The simplest way of doing this portably and robustly (a compiler may optimize (x != x) to false) is to add an is_nan() member function to the FloatingPoint class. NaN is defined by all exponent bits set to 1 and at least one fractional bit being 1. The C++ numeric_limits class lets us write this method generically for either float or double types.

#include <limits>
 
    //...in addition to the above constants...
    static const size_t kFracBitCount = std::numeric_limits<RawType>::digits - 1;
    static const size_t kExpBitCount = kBitCount - 1 - kFracBitCount;
 
    static const Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount - 1);
    static const Bits kFracBitMask = ~static_cast<Bits>(0) >> (kExpBitCount + 1);
    static const Bits kExpBitMask = ~(kSignBitMask | kFractionBitMask);
 
 
    bool is_nan() const {
        return ((kExpBitMask & bits_) == kExpBitMask) &&
	        ((kFracBitMask & bits_) != 0);
    }

Putting it all together we have:

// -*- c++ -*-
//
// Copyright 2005, Google Inc.
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
//     * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//     * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
//     * Neither the name of Google Inc. nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Authors: Zhanyong Wan, Sean Mcafee
//
// Taken from The Google C++ Testing Framework (Google Test).
// Modified for this discussion by Fred Richards.
//
 
#ifndef FLOATINGPOINT_H
#define FLOATINGPOINT_H
 
#include <cstddef>
#if defined _MSC_VER  // may need to check your VC++ version
  typedef unsigned __int32 uint32_t;
  typedef unsigned __int64 uint64_t;
#else
  #include <stdint.h>
#endif
#include <limits>
 
 
namespace
{
    template <size_t bytes>
    struct TypeWithSize
    {
        typedef void UInt;
    };
 
    template <>
    struct TypeWithSize<4>
    {
        typedef uint32_t UInt;
    };
 
    template <>
    struct TypeWithSize<8>
    {
        typedef uint64_t UInt;
    };
}
 
 
template <typename RawType>
class FloatingPoint
{
public:
    typedef typename TypeWithSize<sizeof(RawType)>::UInt Bits;
 
    static const size_t kBitCount = 8*sizeof(RawType);
    static const size_t kFracBitCount = std::numeric_limits<RawType>::digits - 1;
    static const size_t kExpBitCount = kBitCount - 1 - kFracBitCount;
 
    static const Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount - 1);
    static const Bits kFracBitMask = ~static_cast<Bits>(0) >> (kExpBitCount + 1);
    static const Bits kExpBitMask = ~(kSignBitMask | kFracBitMask);
 
    static const size_t kMaxUlps = 4;
 
 
    explicit FloatingPoint(const RawType& x) : value_(x) {}
 
    //
    // Now checking for NaN to match == behavior.
    //
    bool AlmostEquals(const FloatingPoint& rhs) const {
        if (is_nan() || rhs.is_nan()) return false;
        return ULP_diff(bits_, rhs.bits_) <= kMaxUlps;
    }
 
private:
    bool is_nan() const {
        return ((kExpBitMask & bits_) == kExpBitMask) &&
	        ((kFracBitMask & bits_) != 0);
    }
 
    Bits SignAndMagnitudeToBiased(const Bits& sam) const {
        if (kSignBitMask & sam) {
            return ~sam + 1;  // two's complement
        } else {
            return kSignBitMask | sam;  // * 2
        }
    }
 
    Bits ULP_diff(const Bits& sam1, const Bits& sam2) const
    {
        const Bits biased1 = SignAndMagnitudeToBiased(sam1);
        const Bits biased2 = SignAndMagnitudeToBiased(sam2);
 
        return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
    }
 
    union {
        RawType value_;
        Bits bits_;
    };
};
 
#endif // FLOATINGPOINT_H

And here's what it looks like in action, assuming the above code is in FloatingPoint.h:

#include "FloatingPoint.h"
#include <iostream>
 
int main()
{
    typedef float RawType;      // float -> double for different output
 
    RawType x = static_cast<RawType>(3.1415962);
    RawType y = static_cast<RawType>(3.14159625);
 
    std::cout << x;
 
    if (FloatingPoint<RawType>(x).AlmostEquals(FloatingPoint<RawType>(y))) {
        std::cout <<  " is ";
    } else {
        std::cout << " is not ";
    }
    std::cout << "nearly equal to " << y << std::endl;
 
    return 0;
}