Why does AI (LLM) model’s quality degrades when converting from bfloat16 to float16
As you must have notices most of the AI (LLM) models are trained in bfloat16 instead of float16.
Why bfloat16?
In bfloat16, 16 bits are occupied as following
1 bit - sign bit
8 bits - exponent bits
7 bits - fraction bits
and in float16, 16 bits are occupied like this
1 bit - sign bit
5 bits - exponent bits
10 bits - fraction bits
As we can see bfloat16 has 8 exponent bits as compared to 5 bits in float16, and if we compare this to float32 whose bits are occupied like this,
1 bit- sign bit
8 bits- exponent bits
23 bits- fraction bits
float32 has 8 exponent bits, which is same as bfloat16.
So bfloat16 will store exponant values like float32, but it will still occupy 16 bits only.
This is why most AI models are now using bfloat16 insted of float16 for training.
Ok, now coming back to main point, as mentioned above float32 has only 5 exponent bits so when we try to convert bfloat16 to float16 we will loose 3 bits, which in terms will loose lot of valuable information.
How to solve it
We can overcome this issue by converting bfloat16 to float32, as we just need to add zeros to fraction bit, this conversion will make sure all the information is reserved. This will double the model size, but that’s the drawback of having good accuracy.