State-of-the-art massive language fashions (LLMs) are pre-trained with billions of parameters. Whereas pre-trained LLMs can carry out many duties, they’ll grow to be significantly better as soon as fine-tuned.
Because of LoRA, fine-tuning prices may be dramatically decreased. LoRA provides low-rank tensors, i.e., a small variety of parameters (hundreds of thousands), on prime of the frozen unique parameters. Solely the parameters within the added tensors are educated throughout fine-tuning.
LoRA nonetheless requires the mannequin to be loaded in reminiscence. To scale back the reminiscence price and speed-up fine-tuning, a brand new method proposes quantization-aware LoRA (QA-LoRA) fine-tuning.
On this article, I clarify QA-LoRA and evaluate its efficiency in contrast with earlier work (particularly QLoRA). I additionally present easy methods to use QA-LoRA to fine-tune your personal quantization-aware LoRA for Llama 2.
Advantageous-tuning LoRA on prime of a quantized LLM is one thing that may already be carried out with QLoRA. In my earlier articles, I used it many occasions to fine-tune LLMs, as an illustration, Llama 2 and GPT-NeoX, on my desktop laptop or utilizing the free occasion of Google Colab.
Earlier than delving into QA-LoRA, it’s fascinating to grasp what are the present limits of QLoRA.
The NormalFloat4 (NF4) Quantization
LLM quantization algorithms often quantize parameters to a 4-bit precision utilizing the INT4 information kind. Computation with this information kind is increasingly optimized with current GPUs.
QLoRA doesn’t use INT4 by default however one other information kind known as NormalFloat4 (NF4). You may see it as a compressed float quantity. In response to the authors of QLoRA, NF4 is superior to INT4. LLMs quantized with NF4 obtain a decrease perplexity.
Nevertheless, NF4 computation isn’t optimum for quick inference. This is likely one of the the explanation why…