TY - GEN
T1 - Deep End2End Voxel2Voxel Prediction
AU - Tran, Du
AU - Bourdev, Lubomir
AU - Fergus, Rob
AU - Torresani, Lorenzo
AU - Paluri, Manohar
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/12/16
Y1 - 2016/12/16
N2 - Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.
AB - Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.
UR - http://www.scopus.com/inward/record.url?scp=85010192577&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85010192577&partnerID=8YFLogxK
U2 - 10.1109/CVPRW.2016.57
DO - 10.1109/CVPRW.2016.57
M3 - Conference contribution
AN - SCOPUS:85010192577
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 402
EP - 409
BT - Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016
PB - IEEE Computer Society
T2 - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016
Y2 - 26 June 2016 through 1 July 2016
ER -