Westlake University Researchers Propose ‘SimVP,’ An Artificial Intelligence AI-Based Video Prediction Model that is Completely Built Upon CNN and Trained by MSE Loss in an End-to-End Fashion

Video prediction is a challenging but critical task to enable intelligent machines to predict the future and ultimately perform better in the chaotic real world. The computer vision community has made significant efforts to develop effective video prediction models and tested these models in various application areas such as climate change, human movement prediction, and traffic flow prediction, to name a few.
The progress made in recent years is mainly due to the development of newer networks that combine the basic building blocks of modern deep learning architectures, i.e. recurrent, convolutional and attentional layers.

However, most video prediction architectures extract spatiotemporal features using elegant techniques such as adversarial training, teacher-student distillation, and optical flow. Furthermore, most frameworks rely on complex combinations of recurrent/attentional units with convolutional modules. Iterated and attention modules are computationally more efficient than convolutions in terms of speed and memory usage. Given such issues, modern video prediction models are difficult to scale to large datasets, and performance gains over previous methods remain elusive.

Also Read :  Drone AI untuk menyelamatkan orang dari tenggelam di Marina- The New Indian Express

Recent work proposes to simplify the architecture of video prediction models by relying only on convolutional layers. The proposed model, called Simple Video Prediction (SimVP), consists of an encoder, a translator and a decoder (shown in Figure 2 of the paper).
The first two modules distinguish spatiotemporal feature learning by first extracting spatial information (encoder), and then integrating this knowledge to learn temporal evolution (translator). Finally, the decoder integrates the processed features to predict future frames. In terms of the low-level implementation of each module, the encoder is a simple stack of standard convolutional blocks consisting of convolution, normalization, and activation layers; the translator uses Inception layers to better extract time information on the time axis; the decoder uses convolutional, normalization and leaky transposed ReLU layers to predict the next frames.

Also Read :  Dr. Grant Woods on How to Use A.I. to Improve Your Deer Hunting

The new architecture compares equally or favorably with most state-of-the-art video prediction models in popular benchmarks such as Moving MNIST, TrafficBJ, and Human3.6. SimVP achieves high video prediction results in terms of reconstruction metrics (e.g., MSE, SSIM) by reducing the training time of Moving MNIST by 5 times.

SimVP is also tested on an unsupervised domain matching task to investigate its effectiveness in an unsupervised setting (results shown in Table 6). The model was first trained on the KITTI dataset (mobile robotics and autonomous driving dataset) and evaluated on the CalTech Pedestrian. Despite the challenging three-dimensional nonlinear dynamics of various moving objects, SimVP still manages to outperform previous methods. According to the authors, however, in this more difficult scenario, there is still room to improve the reconstruction of the generated frames, especially the objects.

In summary, this paper proposed a fully convolutional architecture for video prediction called SimVP. The new architecture achieves state-of-the-art results in several popular benchmarks and is computationally more efficient than competitors in terms of training speed.

Also Read :  Tidak hanya FAANG, tetapi juga AI unicorn sedang mengalami hiruk-pikuk karyawan

Check it out paper and the code. All credit for this research goes to the researchers in this Project. Also, don’t forget to enter our Reddit page and Discord channelwhere we share the latest AI research news, great AI projects and more.

Lorenzo Brigato is a post-doctoral researcher at the ARTORG center, a research organization affiliated to the University of Bern, and currently works on the application of AI in health and nutrition. He has a Ph.D. Degree in Computer Science from Sapienza University of Rome (Italy). His Ph.D. thesis focused on image classification problems with data distributions with sparse samples and labels.


Leave a Reply

Your email address will not be published.