Posted Date : 07th Mar, 2025
Peer-Reviewed Journals List: A Guide to Quality Research Publications ...
Posted Date : 07th Mar, 2025
Choosing the right journal is crucial for successful publication. Cons...
Posted Date : 27th Feb, 2025
Why Peer-Reviewed Journals Matter Quality Control: The peer revie...
Posted Date : 27th Feb, 2025
The Peer Review Process The peer review process typically follows sev...
Posted Date : 27th Feb, 2025
What Are Peer-Reviewed Journals? A peer-reviewed journal is a publica...
Transforming Images into High-Quality Speech Using Vision Transformers, Large Language Models, and Wave Net
Author Name : Gedela Vinay Visesh, Chippada Yashwanth Sai, Vakada Goutham Yadav, Dadi Sai Hanisha, Polinati Mounika
ABSTRACT The fusion of Vision Transformers (ViT), Large Language Models (LLMs), and advanced Text-to-Speech (TTS) models like WaveNet and Tacotron2 has led to significant improvements in generating natural-sounding speech from image descriptions. This paper explores a novel pipeline where images are processed through ViT or CLIP models to extract features, which are then translated into descriptive captions using LLMs. The generated textual descriptions are subsequently converted into high-quality, human-like speech using state-of-the-art TTS models. Our approach ensures multilingual support, improved speech fidelity, and efficient real-time performance. We present extensive experimental evaluations showcasing enhanced caption accuracy, speech quality, and system efficiency, with a mean opinion score (MOS) of up to 4.75/5. The proposed model provides a scalable solution for applications ranging from assistive technologies to digital media enhancements.