CAI Logo

Multi-modal Graphomer for Action Recognition in Egocentric Videos


Description: Given an egocentric video, the task of action recognition refers to classify the action being performed in the video. The actions performed by the human are highly related to different modalities, i.e. hand-object interaction, eye-hand coordination, the motion of objects, hands and gaze and the semanctics of scene. Previous methods have used graphs to represnet some of the modalities (e.g. objects, motion.) and use graph neural network-based methods to recognize objects. Recent work shows the power of the applying transformer with graphs. In this project, we will use graphs to represent all four modalities and use Graphormer-based method for action recognition in egocentric videos.

Goal: Model four modalities with graphs and develop a method based on Graphormer for action recognition.

Supervisor: Lei Shi

Distribution: : 70% implementation, 20% analysis, 10% literature review

Requirements: Strong programming skills, Experience in deep learning, Familiar with PyTorch, Knowledge in transformer.

Literature:

Ying, Chengxuan, et al. "Do transformers really perform badly for graph representation?." Advances in neural information processing systems 34 (2021): 28877-28888.

Kazakos, Evangelos, et al. "With a little help from my temporal context: Multimodal egocentric action recognition." arXiv preprint arXiv:2111.01024 (2021).

Herzig, Roei, et al. "Spatio-temporal action graph networks." Proceedings of the IEEE/CVF international conference on computer vision workshops. 2019.