International Journal of All Research Education & Scientific Methods

An ISO Certified Peer-Reviewed Journal

ISSN: 2455-6211

Latest News

Visitor Counter
6172944141

LLM Quantization for Cheaper and Faster Infer...

You Are Here :
> > > >
LLM Quantization for Cheaper and Faster Infer...

LLM Quantization for Cheaper and Faster Inference

Author Name : Ashish Bansal

ABSTRACT While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neu- ral network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. The prominence and growing reliance on LLMs in industriesranging from tech to finance underscore their importance. However, the ”bigger is better” mantra doesn’t always apply, especially when precision, efficiency, and real-world appli- cation are paramount. For Real-World applications there are many trade-offs need to be made for making such technology in action. In this paper, we explained multiple state-of-the-art algo- rithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations. For quantizing LLM’s there are two main classes of algorithms: Post-Training Quantization (PTQ) andQuantizationAware-Training (QAT). PTQ requires no re- training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient forachieving 8-bit quantization with close to floating-point accu- racy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive re- sults. We provided which quantization actually works for va- riety of scenarios and which can be leveraged for inferences.