Lately, quite a few open-source giant language fashions (LLMs) have been launched. These highly effective fashions maintain nice potential for a variety of functions. Nonetheless, one main problem that arises is the limitation of sources relating to testing these fashions. Whereas platforms like Google Colab Professional supply the flexibility to check as much as 7B fashions, what choices do we’ve after we want to experiment with even bigger fashions, comparable to 13B?
On this weblog submit, we’ll see how can we run Llama 13b and openchat 13b fashions on a single GPU. Right here we’re utilizing Google Colab Professional’s GPU which is T4 with 25 GB of system RAM. Let’s verify how you can run it step-by-step.
Step 1:
Set up the necessities, it is advisable set up the speed up and transformers from the supply and be sure to have put in the newest model of bitsandbytes library (0.39.0).
!pip set up -q -U bitsandbytes
!pip set up -q -U git+https://github.com/huggingface/transformers.git
!pip set up -q -U git+https://github.com/huggingface/peft.git
!pip set up -q -U git+https://github.com/huggingface/speed up.git
!pip set up sentencepiece
Step 2:
We’re utilizing the quantization approach in our method, using the BitsAndBytes performance from the transformers library. This system permits us to carry out quantization utilizing varied 4-bit variants, comparable to NF4 (normalized float 4, which is the default) or pure FP4 quantization. With 4-bit bitsandbytes, weights are saved in 4 bits, whereas the computation can nonetheless happen in 16 or 32 bits. Totally different combos, together with float16, bfloat16, and float32, might be chosen for computation.
To reinforce the effectivity of matrix multiplication and coaching, we advocate using a 16-bit compute dtype, with the default being torch.float32. The latest introduction of the BitsAndBytesConfig in transformers gives the flexibleness to switch these parameters in response to particular necessities.
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
Step 3:
As soon as we’ve added the configuration, now on this step we’ll load the tokenizer and the mannequin, Right here we’re utilizing Openchat mannequin, you should utilize any 13b mannequin accessible on HuggingFace Mannequin.
If you wish to use Llama 13 mannequin, then simply change the model-id to “openlm-research/open_llama_13b” and once more run the steps under
model_id = "openchat/openchat_8192"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
Step 4:
As soon as we’ve loaded the mannequin, it’s time to check it. You possibly can present any enter of your alternative, and likewise improve the “max_new_tokens” parameter to the variety of tokens you want to generate.
textual content = "Q: What's the largest animal?nA:"
gadget = "cuda:0"
inputs = tokenizer(textual content, return_tensors="pt").to(gadget)
outputs = model_bf16.generate(**inputs, max_new_tokens=35)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output:
You should use any 13b mannequin utilizing this quantization approach utilizing a single GPU or Google Colab Professional.