Please use this identifier to cite or link to this item: https://idr.l1.nitk.ac.in/jspui/handle/123456789/17747
Title: A Region Based Semantic Composition Framework to Visual Image and Video Event Specificatioa
Authors: Naik, Dinesh
Supervisors: C D, Jaidhar
Keywords: Computer Vision;Object Detection;Semantic Segmentation;Ob- ject Recognition
Issue Date: 2023
Publisher: National Institute Of Technology Karnataka Surathkal
Abstract: A long-standing goal of artificial intelligence in Computer Vision has been to de- velop models capable of perceiving and comprehending the complex visual environ- ment around us and communicating with us in natural language about it. Significant progress has been achieved toward this goal over the last few years as a result of paral- lel advancements in computing systems, data collection, and algorithms. Visual recog- nition has advanced at a breakneck pace, with computers now capable of classifying images, recognising them, and describing them in even longer words. They exceed humans in various categories, even surpassing them in some instances. Despite tremen- dous progress, the majority of improvements in visual recognition continue to occur when an image is labelled with one or a few different labels and swiftly explained in natural language. The majority of people find it straightforward to watch a brief video and describe what occurred (in words). Machines have a difficult time extracting meaning from video frames and generating a sentence description. Computer vision research has long been focused on comprehending visual media, such as images and videos. Additionally, a new issue within the scope of this study area, dynamic image and video transcription, has sparked the interest of a large number of people. This re- search presents models and methods for associating visual data with semantic labels and visual data with natural language utterances, thereby simplifying translation be- tween domain constituents. Semantic segmentation is a fundamental component of object recognition models, as it aims to classify things on a pixel-by-pixel basis. The primary goal of this re- search is to classify an individual object within an image pixel by pixel. The provided image is evaluated to ascertain the pixel-level properties that are present. Second, we suggested an encoder-decoder architecture with a hybrid loss function that employs a layered LSTM as the encoder and an LSTM model combined with an attention mecha- nism as the decoder. Thirdly, we propose a unique framework for video captioning that combines a bidirectional multi-layer LSTM encoder and a unidirectional decoder with a temporal attention technique to produce superior global representations for videos. Finally, we propose an efficient method for captioning videos using CNN in conjunc- tion with a short-connected LSTM-based encoder-decoder model and a phrase context vector.
URI: http://idr.nitk.ac.in/jspui/handle/123456789/17747
Appears in Collections:1. Ph.D Theses

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.