The Library
Dense-CaptionNet : a sentence generation architecture for fine-grained description of image semantics
Tools
Khurram, I., Fraz, Muhammad Moazam, Shahzad, M. and Rajpoot, Nasir M. (Nasir Mahmood) (2021) Dense-CaptionNet : a sentence generation architecture for fine-grained description of image semantics. Cognitive Computation, 13 . pp. 595-611. doi:10.1007/s12559-019-09697-1 ISSN 1866-9956.
|
PDF
WRAP-Dense-CaptionNet-sentence-generation-architecture-Rajpoot-2019.pdf - Accepted Version - Requires a PDF viewer. Download (2739Kb) | Preview |
Official URL: https://doi.org/10.1007/s12559-019-09697-1
Abstract
Automatic image captioning, a highly challenging research problem, aims to understand and describe the contents of the complex scene in human understandable natural language. The majority of the recent solutions are based on holistic approaches where the scene is described as a whole, potentially losing the important semantic relationship of objects in the scene. We propose Dense-CaptionNet, a region-based deep architecture for fine-grained description of image semantics, which localizes and describes each object/region in the image separately and generates a more detailed description of the scene. The proposed network contains three components which work together to generate a fine-grained description of image semantics. Region descriptions and object relationships are generated by the first module, whereas the second one generates the attributes of objects present in the scene. The textual descriptions obtained as an output of the two modules are concatenated to feed as an input to the sentence generation module, which works on encoder-decoder formulation to generate a grammatically correct but single line, fine-grained description of the whole scene. The proposed Dense-CaptionNet is trained and tested using Visual Genome, MSCOCO, and IAPR TC-12 datasets. The results establish a new state-of-the-art when compared with the existing top performing methodologies, e.g., Up-Down-Captioner, Show, Attend and Tell, Semstyle, and Neural Talk, especially on complex scenes.
Item Type: | Journal Article | ||||||||
---|---|---|---|---|---|---|---|---|---|
Subjects: | P Language and Literature > P Philology. Linguistics Q Science > QA Mathematics T Technology > TA Engineering (General). Civil engineering (General) |
||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||||||
Library of Congress Subject Headings (LCSH): | Computer vision, Image processing -- Computer programs, Image processing -- Digital techniques, Photograph captions, Semantics -- Data processing, Computer graphics, Neural networks (Computer science) | ||||||||
Journal or Publication Title: | Cognitive Computation | ||||||||
Publisher: | Springer | ||||||||
ISSN: | 1866-9956 | ||||||||
Official Date: | May 2021 | ||||||||
Dates: |
|
||||||||
Volume: | 13 | ||||||||
Page Range: | pp. 595-611 | ||||||||
DOI: | 10.1007/s12559-019-09697-1 | ||||||||
Status: | Peer Reviewed | ||||||||
Publication Status: | Published | ||||||||
Reuse Statement (publisher, data, author rights): | This is a post-peer-review, pre-copyedit version of an article published in Cognitive Computation. The final authenticated version is available online at: http://dx.doi.org/10.1007/s12559-019-09697-1 | ||||||||
Access rights to Published version: | Restricted or Subscription Access | ||||||||
Date of first compliant deposit: | 20 November 2019 | ||||||||
Date of first compliant Open Access: | 2 March 2021 | ||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year