Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Mix-ViT : mixing attentive vision transformer for ultra-fine-grained visual categorization

Tools
- Tools
+ Tools

Yu, Xiaohan, Wang, Jun, Zhao, Yang and Gao, Yongsheng (2023) Mix-ViT : mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recognition, 135 . 109131. doi:10.1016/j.patcog.2022.109131 ISSN 0031-3203.

Research output not available from this repository.

Request-a-Copy directly from author or use local Library Get it For Me service.

Official URL: http://dx.doi.org/10.1016/j.patcog.2022.109131

Request Changes to record.

Abstract

Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks. The code is available at https://github.com/Markin-Wang/MixViT

Item Type: Journal Article
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Journal or Publication Title: Pattern Recognition
Publisher: Pergamon
ISSN: 0031-3203
Official Date: March 2023
Dates:
DateEvent
March 2023Published
28 October 2022Available
22 October 2022Accepted
Volume: 135
Article Number: 109131
DOI: 10.1016/j.patcog.2022.109131
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Open Access (Creative Commons)

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item
twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us