IJARESM Menu

Download

Latest News

Peer-Reviewed Journals List

Posted Date : 07th Mar, 2025

Peer-Reviewed Journals List: A Guide to Quality Research Publications ...

More...
How to Choose the Right Peer-Reviewed Jo...

Posted Date : 07th Mar, 2025

Choosing the right journal is crucial for successful publication. Cons...

More...
Why Peer-Reviewed Journals Matter ?

Posted Date : 27th Feb, 2025

Why Peer-Reviewed Journals Matter Quality Control: The peer revie...

More...
What is Peer Review Process?

Posted Date : 27th Feb, 2025

The Peer Review Process The peer review process typically follows sev...

More...
Peer-Reviewed Journals

Posted Date : 27th Feb, 2025

What Are Peer-Reviewed Journals? A peer-reviewed journal is a publica...

More...

Visitor Counter

6172944141

LLM Quantization for Cheaper and Faster Infer...

You Are Here :

Issues

Volume 12

Issue 3 (March 2024)

LLM Quantization for Cheaper and Faster Infer...

LLM Quantization for Cheaper and Faster Inference

Author Name : Ashish Bansal

ABSTRACT While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neu- ral network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. The prominence and growing reliance on LLMs in industriesranging from tech to finance underscore their importance. However, the ”bigger is better” mantra doesn’t always apply, especially when precision, efficiency, and real-world appli- cation are paramount. For Real-World applications there are many trade-offs need to be made for making such technology in action. In this paper, we explained multiple state-of-the-art algo- rithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations. For quantizing LLM’s there are two main classes of algorithms: Post-Training Quantization (PTQ) andQuantizationAware-Training (QAT). PTQ requires no re- training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient forachieving 8-bit quantization with close to floating-point accu- racy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive re- sults. We provided which quantization actually works for va- riety of scenarios and which can be leveraged for inferences.