TY - GEN
T1 - ActFormer
T2 - 2024 IEEE International Conference on Robotics and Automation, ICRA 2024
AU - Huang, Suozhi
AU - Zhang, Juexiao
AU - Li, Yiming
AU - Feng, Chen
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address scalable camera-based collaborative perception with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of [email protected] with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.
AB - Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address scalable camera-based collaborative perception with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of [email protected] with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.
UR - http://www.scopus.com/inward/record.url?scp=85197598997&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197598997&partnerID=8YFLogxK
U2 - 10.1109/ICRA57147.2024.10610997
DO - 10.1109/ICRA57147.2024.10610997
M3 - Conference contribution
AN - SCOPUS:85197598997
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 14716
EP - 14723
BT - 2024 IEEE International Conference on Robotics and Automation, ICRA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 13 May 2024 through 17 May 2024
ER -