TY - CONF
T1 - Oracle performance for visual captioning
AU - Yao, Li
AU - Ballas, Nicolas
AU - Cho, Kyunghyun
AU - Smith, John R.
AU - Bengio, Yoshua
N1 - Funding Information:
The authors would like to ackno wledge the support of the follo wing agencies for research funding and computing support: IBM T .J. W atson Research, NSERC, Calcul Québec, Compute Canada, the Canada Research Chairs and CIF AR. W e would also like to thank the de v elopers of Theano [29] , for de v eloping such a po werful tool for scientific computing.
PY - 2016
Y1 - 2016
N2 - The task of associating images and videos with a natural language description has attracted a great amount of attention recently. The state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates performances that an oracle can obtain. In order to disentangle the contribution from visual model from the language model, our oracle assumes that high-quality visual concept extractor is available and focuses only on the language part. We demonstrate the construction of such oracles on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite the simplicity of the model and the training procedure, we show that current state-of-the-art models fall short when being compared with the learned oracle. Furthermore, it suggests the inability of current models in capturing important visual concepts in captioning tasks.
AB - The task of associating images and videos with a natural language description has attracted a great amount of attention recently. The state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates performances that an oracle can obtain. In order to disentangle the contribution from visual model from the language model, our oracle assumes that high-quality visual concept extractor is available and focuses only on the language part. We demonstrate the construction of such oracles on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite the simplicity of the model and the training procedure, we show that current state-of-the-art models fall short when being compared with the learned oracle. Furthermore, it suggests the inability of current models in capturing important visual concepts in captioning tasks.
UR - http://www.scopus.com/inward/record.url?scp=85046865379&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046865379&partnerID=8YFLogxK
U2 - 10.5244/C.30.141
DO - 10.5244/C.30.141
M3 - Paper
AN - SCOPUS:85046865379
SP - 141.1-141.13
T2 - 27th British Machine Vision Conference, BMVC 2016
Y2 - 19 September 2016 through 22 September 2016
ER -