TY - GEN
T1 - EPIC-KITCHENS VISOR Benchmark VIdeo Segmentations and Object Relations
AU - Darkhalil, Ahmad
AU - Shan, Dandan
AU - Zhu, Bin
AU - Ma, Jian
AU - Kar, Amlan
AU - Higgins, Richard
AU - Fidler, Sanja
AU - Fouhey, David
AU - Damen, Dima
N1 - Funding Information:
We gratefully acknowledge valuable support from: Michael Wray for revising the EPIC-KITCHENS-100 classes; Seung Wook Kim and Marko Boben for technical support to TORAS; Srdjan Delic for quality checks particularly on the Test set; several members of the MaVi group at Bristol for quality checking: Toby Perrett, Michael Wray, Dena Bazazian, Adriano Fragomeni, Kevin Flanagan, Daniel Whettam, Alexandros Stergiou, Jacob Chalk, Chiara Plizzari and Zhifan Zhu. Annotations were funded by a charitable unrestricted donations to the University of Bristol from Procter and Gamble and DeepMind. Research at the University of Bristol is supported by UKRI Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Program (DTP), EPSRC Fellowship UMPIRE (EP/T004991/1) and EPSRC Program Grant Visual AI (EP/T028572/1). We acknowledge the use of the ESPRC funded Tier 2 facility, JADE, and the University of Bristol's Blue Crystal 4 facility. Research at the University of Michigan is based upon work supported by the National Science Foundation under Grant No. 2006619. Research at the University of Toronto is in part sponsored by NSERC. S.F. also acknowledges support through the Canada CIFAR AI Chair program.
Funding Information:
We gratefully acknowledge valuable support from: Michael Wray for revising the EPIC-KITCHENS-100 classes; Seung Wook Kim and Marko Boben for technical support to TORAS; Srdjan Delic for quality checks particularly on the Test set; several members of the MaVi group at Bristol for quality checking: Toby Perrett, Michael Wray, Dena Bazazian, Adriano Fragomeni, Kevin Flanagan, Daniel Whettam, Alexandros Stergiou, Jacob Chalk, Chiara Plizzari and Zhifan Zhu. Annotations were funded by a charitable unrestricted donations to the University of Bristol from Procter and Gamble and DeepMind. Research at the University of Bristol is supported by UKRI Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Program (DTP), EPSRC Fellowship UMPIRE (EP/T004991/1) and EPSRC Program Grant Visual AI (EP/T028572/1). We acknowledge the use of the ESPRC funded Tier 2 facility, JADE, and the University of Bristol’s Blue Crystal 4 facility. Research at the University of Michigan is based upon work supported by the National Science Foundation under Grant No. 2006619. Research at the University of Toronto is in part sponsored by NSERC. S.F. also acknowledges support through the Canada CIFAR AI Chair program.
Publisher Copyright:
© 2022 Neural information processing systems foundation. All rights reserved.
PY - 2022
Y1 - 2022
N2 - We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR.
AB - We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR.
UR - http://www.scopus.com/inward/record.url?scp=85141420276&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141420276&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85141420276
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Neural information processing systems foundation
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Y2 - 28 November 2022 through 9 December 2022
ER -