TY - GEN
T1 - Scene semantics from long-term observation of people
AU - Delaitre, Vincent
AU - Fouhey, David F.
AU - Laptev, Ivan
AU - Sivic, Josef
AU - Gupta, Abhinav
AU - Efros, Alexei A.
PY - 2012
Y1 - 2012
N2 - Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and object appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.
AB - Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and object appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.
UR - http://www.scopus.com/inward/record.url?scp=84867866442&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867866442&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-33783-3_21
DO - 10.1007/978-3-642-33783-3_21
M3 - Conference contribution
AN - SCOPUS:84867866442
SN - 9783642337826
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 284
EP - 298
BT - Computer Vision, ECCV 2012 - 12th European Conference on Computer Vision, Proceedings
T2 - 12th European Conference on Computer Vision, ECCV 2012
Y2 - 7 October 2012 through 13 October 2012
ER -