IJARESM Menu

Download

Latest News

Peer-Reviewed Journals List

Posted Date : 07th Mar, 2025

Peer-Reviewed Journals List: A Guide to Quality Research Publications ...

More...
How to Choose the Right Peer-Reviewed Jo...

Posted Date : 07th Mar, 2025

Choosing the right journal is crucial for successful publication. Cons...

More...
Why Peer-Reviewed Journals Matter ?

Posted Date : 27th Feb, 2025

Why Peer-Reviewed Journals Matter Quality Control: The peer revie...

More...
What is Peer Review Process?

Posted Date : 27th Feb, 2025

The Peer Review Process The peer review process typically follows sev...

More...
Peer-Reviewed Journals

Posted Date : 27th Feb, 2025

What Are Peer-Reviewed Journals? A peer-reviewed journal is a publica...

More...

Visitor Counter

5014297907

Transforming Images into High-Quality Speech ...

You Are Here :

Issues

Volume 13

Issue 2, February 2025

Transforming Images into High-Quality Speech ...

Transforming Images into High-Quality Speech Using Vision Transformers, Large Language Models, and Wave Net

Author Name : Gedela Vinay Visesh, Chippada Yashwanth Sai, Vakada Goutham Yadav, Dadi Sai Hanisha, Polinati Mounika

ABSTRACT The fusion of Vision Transformers (ViT), Large Language Models (LLMs), and advanced Text-to-Speech (TTS) models like WaveNet and Tacotron2 has led to significant improvements in generating natural-sounding speech from image descriptions. This paper explores a novel pipeline where images are processed through ViT or CLIP models to extract features, which are then translated into descriptive captions using LLMs. The generated textual descriptions are subsequently converted into high-quality, human-like speech using state-of-the-art TTS models. Our approach ensures multilingual support, improved speech fidelity, and efficient real-time performance. We present extensive experimental evaluations showcasing enhanced caption accuracy, speech quality, and system efficiency, with a mean opinion score (MOS) of up to 4.75/5. The proposed model provides a scalable solution for applications ranging from assistive technologies to digital media enhancements.