International Journal of All Research Education & Scientific Methods

An ISO Certified Peer-Reviewed Journal

ISSN: 2455-6211

Latest News

Visitor Counter
5014297907

Transforming Images into High-Quality Speech ...

You Are Here :
> > > >
Transforming Images into High-Quality Speech ...

Transforming Images into High-Quality Speech Using Vision Transformers, Large Language Models, and Wave Net

Author Name : Gedela Vinay Visesh, Chippada Yashwanth Sai, Vakada Goutham Yadav, Dadi Sai Hanisha, Polinati Mounika

ABSTRACT The fusion of Vision Transformers (ViT), Large Language Models (LLMs), and advanced Text-to-Speech (TTS) models like WaveNet and Tacotron2 has led to significant improvements in generating natural-sounding speech from image descriptions. This paper explores a novel pipeline where images are processed through ViT or CLIP models to extract features, which are then translated into descriptive captions using LLMs. The generated textual descriptions are subsequently converted into high-quality, human-like speech using state-of-the-art TTS models. Our approach ensures multilingual support, improved speech fidelity, and efficient real-time performance. We present extensive experimental evaluations showcasing enhanced caption accuracy, speech quality, and system efficiency, with a mean opinion score (MOS) of up to 4.75/5. The proposed model provides a scalable solution for applications ranging from assistive technologies to digital media enhancements.