TY - GEN
T1 - Swoosh! Rattle! Thump!-Actions that Sound
AU - Gandhi, Dhiraj
AU - Gupta, Abhinav
AU - Pinto, Lerrel
N1 - Funding Information:
Acknowledgments: We thank DARPA MCS, ONR MURI and ONR Young Investigator award for funding this work. We also thank Xiaolong Wang and Olivia Watkins for insightful comments and discussions.
Publisher Copyright:
© 2020, MIT Press Journals. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however, we have often ignored a key sense: sound. This is primarily due to the lack of data that captures the interplay of action and sound. In this work, we perform the first large-scale study of the interactions between sound and robotic action. To do this, we create the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using our robotic platform Tilt-Bot. By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action and present three key insights. First, sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench. Second, sound also contains information about the causal effects of an action, i.e. given the sound produced, we can predict what action was applied to the object. Finally, object representations derived from audio embeddings are indicative of implicit physical properties. We demonstrate that on previously unseen objects, audio embeddings generated through interactions can predict forward models 24% better than passive visual embeddings.
AB - Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however, we have often ignored a key sense: sound. This is primarily due to the lack of data that captures the interplay of action and sound. In this work, we perform the first large-scale study of the interactions between sound and robotic action. To do this, we create the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using our robotic platform Tilt-Bot. By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action and present three key insights. First, sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench. Second, sound also contains information about the causal effects of an action, i.e. given the sound produced, we can predict what action was applied to the object. Finally, object representations derived from audio embeddings are indicative of implicit physical properties. We demonstrate that on previously unseen objects, audio embeddings generated through interactions can predict forward models 24% better than passive visual embeddings.
UR - http://www.scopus.com/inward/record.url?scp=85093960505&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093960505&partnerID=8YFLogxK
U2 - 10.15607/RSS.2020.XVI.002
DO - 10.15607/RSS.2020.XVI.002
M3 - Conference contribution
AN - SCOPUS:85093960505
SN - 9780992374761
T3 - Robotics: Science and Systems
BT - Robotics
A2 - Toussaint, Marc
A2 - Bicchi, Antonio
A2 - Hermans, Tucker
PB - MIT Press Journals
T2 - 16th Robotics: Science and Systems, RSS 2020
Y2 - 12 July 2020 through 16 July 2020
ER -