New video-based approach to 3D motion capture makes virtual avatars more realistic than ever
With Video Inference for Body Pose and Shape Estimation (VIBE), scientists at the Max Planck Institute for Intelligent Systems have developed a neural network that makes video-based 3D motion capture more accurate, faster, and less expensive
Tübingen. June 17, 2020 – A team of scientists at the Max Planck Institute for Intelligent Systems (MPI-IS) in Germany has developed VIBE, an algorithmic model that enables more detailed and accurate estimates of 3D human motion from video than was previously possible. They describe the model in the recently published paper, “VIBE: Video Inference for Body Pose and Shape Estimation”, which they are presenting today at the Conference on Computer Vision and Pattern Recognition (CVPR). One of the most competitive conferences in the field, CVPR 2020 kicked off on June 14 and is being held online until June 18.
“Previous frameworks do a good job of estimating 3D human pose and shape from a single image. But video-based models have not been able to mimic human motion realistically because of limited training data,” said Muhammed Kocabas, a Ph.D. student in the Perceiving Systems department at the MPI-IS and the paper’s co-author. “With VIBE, we have successfully addressed this challenge.”
VIBE is a learning-based framework that draws on AMASS, a large-scale motion capture dataset developed at MPI-IS that can be used for animation, visualization, and generating training data for deep learning. The scientists trained the VIBE algorithm on an NVIDIA GPU not only to estimate 3D human motion, but also to distinguish between real and implausible movements. Here, AMASS is used as the source of real human motion. With a single video of a human moving, the model first extracts image features using a convolutional neural network (CNN), neural networks that are often used in the field of machine learning to recognize and classify images. These features are then processed by a recurrent neural network (RNN) – a network capable of classifying temporal sequences and thus of capturing the sequential nature of human motion. The result is a smooth, realistic prediction of human pose, shape, and motion.
“What sets VIBE apart is its ability to detect a human subject’s entire range of action and motion in detail, including the way limbs and extremities move,” says Nikos Athanasiou, who is also a Ph.D. student in the Perceiving Systems Department and co-author of the paper. “From a single video, VIBE can produce realistic human motion very quickly, without any additional effort.”
With VIBE, 3D motion capture can be easier, faster, and much less expensive
VIBE could have a decisive impact on 3D animation. While high-quality virtual movement has long been a fixture of animated film and video games, producing realistic human shapes and poses generally involves a great deal of handcrafting: annotating a few seconds of video takes graphic artists and technicians several hours and requires an elaborate set-up of sensors and cameras. With VIBE, 3D motion capture can be easier, faster, and much less expensive.
“Understanding human behavior – how people move about in a scene, for example – is a fundamental task in the field of computer vision,” says Michael J. Black, Director at the Max Planck Institute for Intelligent Systems in Tübingen and head of the Perceiving Systems Department. “The VIBE model contributes to improve this understanding, and it shows promise for applications in a broad range of fields, from augmented reality to autonomous driving, robotics, and medical applications. More accurate 3D predictions of human motion will pave the way for computers to work more collaboratively with humans.”