End-to-end contextual perception and prediction with interaction transformer

Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean Segal, Raquel Urtasun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. Towards this goal, we design a novel approach that explicitly takes into account the interactions between actors. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer [1] architecture, which we call the Interaction Transformer. Importantly, our model can be trained end-to-end, and runs in real-time. We validate our approach on two challenging real-world datasets: ATG4D [2] and nuScenes [3]. We show that our approach can outperform the state-of-the-art on both datasets. In particular, we significantly improve the social compliance between the estimated future trajectories, resulting in far fewer collisions between the predicted actors.

Original languageEnglish (US)
Title of host publication2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5784-5791
Number of pages8
ISBN (Electronic)9781728162126
DOIs
StatePublished - Oct 24 2020
Event2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020 - Las Vegas, United States
Duration: Oct 24 2020Jan 24 2021

Publication series

NameIEEE International Conference on Intelligent Robots and Systems
ISSN (Print)2153-0858
ISSN (Electronic)2153-0866

Conference

Conference2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
Country/TerritoryUnited States
CityLas Vegas
Period10/24/201/24/21

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Software
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'End-to-end contextual perception and prediction with interaction transformer'. Together they form a unique fingerprint.

Cite this