Massive Language Fashions (LLMs) proceed to soar in recognition as a brand new one is launched almost each week. With the variety of these fashions growing, so are the choices for a way we will host them. In my earlier article we explored how we might make the most of DJL Serving inside Amazon SageMaker to effectively host LLMs. On this article we discover one other optimized mannequin server and resolution in HuggingFace Text Generation Inference (TGI).
NOTE: For these of you new to AWS, be sure you make an account on the following link if you wish to comply with alongside. The article additionally assumes an intermediate understanding of SageMaker Deployment, I’d recommend following this article for understanding Deployment/Inference extra in depth.
DISCLAIMER: I’m a Machine Studying Architect at AWS and my opinions are my very own.
Why HuggingFace Textual content Era Inference? How Does It Work With Amazon SageMaker?
TGI is a Rust, Python, gRPC mannequin server created by HuggingFace that can be utilized to host particular giant language fashions. HuggingFace has lengthy been the central hub for NLP and it incorporates a big set of optimizations in terms of LLMs particularly, look under for a number of and the documentation for an in depth record.
- Tensor Parallelism for environment friendly internet hosting throughout a number of GPUs
- Token Streaming with SSE
- Quantization with bitsandbytes
- Logits warper (completely different params resembling temperature, top-k, top-n, and so on)
A big constructive of this resolution that I famous is the simplicity of use. TGI at this second helps the next optimized mannequin architectures you can immediately deploy using the TGI containers.