International Journal of All Research Education & Scientific Methods

An ISO Certified Peer-Reviewed Journal

ISSN: 2455-6211

Latest News

Visitor Counter
4671267658

OCR and Image Captioning Using Florence-2 Vi...

You Are Here :
> > > >
OCR and Image Captioning Using Florence-2 Vi...

OCR and Image Captioning Using Florence-2 Vision Transformer Model

Author Name : Sushant Kumar Mahato, Janaki Kandasamy, Deepshik Sharma

ABSTRACT This research project explores advancements in Optical Character Recognition (OCR) and image caption generation using Microsoft’s Florence-2 Vision Transformer model, a cutting-edge deep learning framework in computer vision. Florence-2, known for its robustness in processing visual data, leverages large-scale pretraining on multimodal datasets, enabling high accuracy in both text recognition and visual understanding tasks. The project aims to develop an integrated system for OCR and automatic captioning, where the Florence-2 model is adapted for recognizing complex text layouts and generating descriptive captions for varied visual contexts. This dual functionality is crucial for applications in assistive technology, document analysis, and autonomous systems that require contextual understanding of text and image data simultaneously. By fine-tuning Florence-2 on diverse datasets, including documents, scenes with embedded text, and web images, the model's capacity for nuanced captioning and accurate OCR is enhanced. We evaluate the system’s performance through metrics like BLEU and ROUGE for captions, and character accuracy rates for OCR. Our results demonstrate that Florence2’s multimodal learning approach improves accuracy in OCR and contextual richness in captions, suggesting its potential for transformative applications in content accessibility and automated document processing.