TY - JOUR
T1 - IntPhys 2019
T2 - A Benchmark for Visual Intuitive Physics Understanding
AU - Riochet, Ronan
AU - Castro, Mario Ynocente
AU - Bernard, Mathieu
AU - Lerer, Adam
AU - Fergus, Rob
AU - Izard, Veronique
AU - Dupoux, Emmanuel
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2022/9/1
Y1 - 2022/9/1
N2 - In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation benchmark which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events constructed with a game engine. The test requires systems to compute a physical plausibility score over an entire video. To prevent perceptual biases, the dataset is made of pixel matched quadruplets of videos, enforcing systems to focus on high level temporal dependencies between frames rather than pixel-level details. We then describe two Deep Neural Networks systems aimed at learning intuitive physics in an unsupervised way, using only physically possible videos. The systems are trained with a future semantic mask prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.
AB - In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation benchmark which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events constructed with a game engine. The test requires systems to compute a physical plausibility score over an entire video. To prevent perceptual biases, the dataset is made of pixel matched quadruplets of videos, enforcing systems to focus on high level temporal dependencies between frames rather than pixel-level details. We then describe two Deep Neural Networks systems aimed at learning intuitive physics in an unsupervised way, using only physically possible videos. The systems are trained with a future semantic mask prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.
UR - http://www.scopus.com/inward/record.url?scp=85135597678&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85135597678&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2021.3083839
DO - 10.1109/TPAMI.2021.3083839
M3 - Article
C2 - 34038357
AN - SCOPUS:85135597678
SN - 0162-8828
VL - 44
SP - 5016
EP - 5025
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 9
ER -