Latent video transformer

Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, Evgeny Burnaev

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The video generation task can be formulated as a prediction of future video frames given some past frames. Recent generative models for videos face the problem of high computational requirements. Some models require up to 512 Tensor Processing Units for parallel training. In this work, we address this problem via modeling the dynamics in a latent space. After the transformation of frames into the latent space, our model predicts latent representation for the next frames in an autoregressive manner. We demonstrate the performance of our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach tends to reduce requirements to 8 Graphical Processing Units for training the models while maintaining comparable generation quality.

Original languageEnglish (US)
Title of host publicationVISAPP
EditorsGiovanni Maria Farinella, Petia Radeva, Jose Braz, Kadi Bouatouch
PublisherSciTePress
Pages101-112
Number of pages12
ISBN (Electronic)9789897584886
StatePublished - 2021
Event16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021 - Virtual, Online
Duration: Feb 8 2021Feb 10 2021

Publication series

NameVISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Volume5

Conference

Conference16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021
CityVirtual, Online
Period2/8/212/10/21

Keywords

  • Deep learning
  • Generative adversarial networks
  • Video generation

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'Latent video transformer'. Together they form a unique fingerprint.

Cite this