For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language ebook, it was identified {that a} single-precision “float” sort has a 32-bit dimension and a double-precision sort has 64 bits. There was additionally an 80-bit “lengthy double” sort with prolonged precision, and all these varieties coated nearly all of the wants for floating-point knowledge processing. Nevertheless, throughout the previous few years, the appearance of huge neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level varieties as a lot as attainable.
Actually, I used to be stunned once I found that the 4-bit floating-point format exists. How on Earth can or not it’s attainable? One of the simplest ways to know is to check it on our personal. On this article, we’ll uncover the preferred floating level codecs, make a easy neural community, and see the way it works.
Let’s get began.
A “Normal” 32-bit Floating level
Earlier than going into “excessive” codecs, let’s recall a normal one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float sort appears like this:
Right here, the primary bit is an indication, the subsequent 8 bits symbolize an exponent, and the final bits symbolize the mantissa. The ultimate worth is calculated utilizing the method:
This straightforward helper perform permits us to print a floating level worth in binary kind:
import structdef print_float32(val: float):
""" Print Float32 in a binary kind """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000
Let’s additionally make one other helper for backward conversion, which will likely be helpful later:
def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary knowledge into the floating level worth """
sign_mult = -1 if signal == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in vary(mant_len - 1, -1, -1):
if mantissa & (2 **…