ViTGaze 👀

Gaze Following with Interaction Features in Vision Transformers

Yuehao Song¹ , Xinggang Wang^1,✉️ , Jingfeng Yao¹ , Wenyu Liu¹ , Jinglin Zhang² , Xiangmin Xu³

¹ Huazhong University of Science and Technology, ² Shandong University, ³ South China University of Technology

(^✉️ corresponding author)

Accepted by Visual Intelligence (Paper)

News

Nov. 21th, 2024: ViTGaze is accepted by Visual Intelligence! 🎉
Mar. 25th, 2024: We release an initial version of ViTGaze.
Mar. 19th, 2024: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Introduction

Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!

Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, ViTGaze. In contrast to previous methods, it creates a brand new gaze following framework based mainly on powerful encoders (relative decoder parameter less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less.

Results

Results from the ViTGaze paper

Results on GazeFollow			Results on VideoAttentionTarget
AUC	Avg. Dist.	Min. Dist.	AUC	Dist.	AP
0.949	0.105	0.047	0.938	0.102	0.905

Corresponding checkpoints are released:

GazeFollow: GoogleDrive
VideoAttentionTarget: GoogleDrive

Getting Started

Acknowledgements

ViTGaze is based on detectron2. We use the efficient multi-head attention implemented in the xFormers library.

Citation

If you find ViTGaze is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{song2024vitgaze,
  title   = {ViTGaze: Gaze Following with Interaction Features in Vision Transformers},
  author  = {Song, Yuehao and Wang, Xinggang and Yao, Jingfeng and Liu, Wenyu and Zhang, Jinglin and Xu, Xiangmin},
  journal = {Visual Intelligence},
  volume  = {2},
  number  = {31},
  year    = {2024},
  url     = {https://doi.org/10.1007/s44267-024-00064-9}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
configs		configs
data		data
docs		docs
engine		engine
modeling		modeling
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.sh		train.sh
val.sh		val.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTGaze 👀

Gaze Following with Interaction Features in Vision Transformers

News

Introduction

Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!

Results

Getting Started

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

hustvl/ViTGaze

Folders and files

Latest commit

History

Repository files navigation

ViTGaze 👀

Gaze Following with Interaction Features in Vision Transformers

News

Introduction

Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!

Results

Getting Started

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages