TY - GEN
T1 - The curious robot
T2 - 14th European Conference on Computer Vision, ECCV 2016
AU - Pinto, Lerrel
AU - Gandhi, Dhiraj
AU - Han, Yuanfeng
AU - Park, Yong Lae
AU - Gupta, Abhinav
N1 - Publisher Copyright:
© Springer International Publishing AG 2016.
PY - 2016
Y1 - 2016
N2 - What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3%.
AB - What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3%.
UR - http://www.scopus.com/inward/record.url?scp=84990833502&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990833502&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-46475-6_1
DO - 10.1007/978-3-319-46475-6_1
M3 - Conference contribution
AN - SCOPUS:84990833502
SN - 9783319464749
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 18
BT - Computer Vision - 14th European Conference, ECCV 2016, Proceedings
A2 - Leibe, Bastian
A2 - Sebe, Nicu
A2 - Welling, Max
A2 - Matas, Jiri
PB - Springer Verlag
Y2 - 8 October 2016 through 16 October 2016
ER -