recent news
-
•I will join Northeastern University as a Professor & President Joseph E. Aoun Chair in Fall 2025.
-
✓I have several openings for postdocs, visiting researchers, and PhD students to work on exciting new projects in embodied AI, video understanding, and multimodal learning. Prospective applicants should contact me with a CV and a one-page research statement.
-
•Our state-space video model BIMBA won the first place in the EgoSchema Challenge at CVPR 2025
-
•Three of our recently published articles received Distinguished Paper Awards at the CVPR 2025 EgoVis Workshop:
-
✓Video ReCap: Recursive Captioning of Hour-Long Videos,
with Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, and Gedas Bertasius. Originally published at CVPR 2024. -
✓Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities,
with Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, and Miguel Martin. Originally published at NeurIPS 2023. -
✓HierVL: Learning Hierarchical Video-Language Embeddings.,
with Kumar Ashutosh, Rohit Girdhar, and Kristen Grauman. Originally published at CVPR 2023.
These awards recognize publications that have advanced egocentric vision through original and innovative contributions.
-
•Two papers to be presented at CVPR 2025:
-
✓BIMBA: Selective-Scan Compression for Long-Range Video Question Answering,
with Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, and Gedas Bertasius. -
✓VITED: Video Temporal Evidence Distillation,
with Yujie Lu, Yale Song, William Wang, and Tushar Nagarajan.
-
•One paper presented at ECCV 2024:
-
✓4Diff: 3D- Aware Diffusion Model for Third-to-First Viewpoint Translation,
with Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Gedas Bertasius, and Kristen Grauman.
-
•Four papers presented at CVPR 2024:
-
✓Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives,
with 100 co-authors! Accepted as oral (<1% accept rate). -
✓Learning to Segment Referred Objects from Narrated Egocentric Videos,
with Yuhan Shen, Huiyu Wang, Xitong Yang, Matt Feiszli, Ehsan Elhamifar, and Effrosyni Mavroudi. Accepted as oral (<1% accept rate). -
✓Step Differences in Instructional Video,
with Tushar Nagarajan. -
✓Video ReCap: Recursive Captioning of Hour-Long Videos,
with Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, and Gedas Bertasius.
research
My research is in computer vision and multimodal (video-audio-language) learning. I aim to develop perceptual AI agents that can assist humans in their daily activities by understanding their behavior from video and communicating through language and actions. I am particularly motivated to apply this research to AI/AR coaching, episodic memory retrieval, and human-robot interaction.