Making the virtual world even more realistic
The MPI-IS gains an expert for capture and synthesis
Capturing the real world, synthesizing, and realistically modeling people, objects, and rooms so that a viewer no longer notices the difference – Justus Thies has set himself ambitious goals for the coming five years. His research has already made a significant contribution to making video content more photorealistic. This comes with a huge responsibility. As facial reenactment is continuously improved, it is increasingly difficult to detect deep fakes, for instance. By gaining knowledge about the creation process, Thies aims to put a strong emphasis on designing more advanced forgery detection algorithms that can recognize and flag fake video content automatically and reliably.
Tübingen – Justus Thies will start a position as a Max Planck Research Group Leader on April 1, 2021. His “Neural Capture & Synthesis” group will be situated at the Tübingen site of the Max Planck Institute for Intelligent Systems (MPI-IS). Previously a postdoctoral scientist at the Technical University of Munich (TUM), Thies plans to push the boundaries in the research field of capturing and synthesizing the real world, which includes people, objects, rooms, and even entire scenes. His goal is to synthesize and model the human body and objects so realistically that a viewer will no longer notice the difference. With the rise of deep learning methods in general and neural rendering in particular, Thies’s facial reenactment method has reached a quality that raised a great deal of attention in academia, industry, and media. By combining novel findings in machine learning with classical computer graphics and computer vision approaches, his technology shows promising results.
Creating ever more realistic avatars
Thies’ research could be applied in a broad range of fields, from autonomous vehicles or household robots to medical applications and video post-production. However, his main goal is to revolutionize telecommunications. “I want to change the way people interact with each other at a distance,” he says. “In the future, if we want to speak to a person who is elsewhere, we will wear a virtual or augmented reality lens that projects a photorealistic 3D avatar of that person right in front of us. The person will be digitally reconstructed in a way that feels like they are in the same room. This will be an amazing experience, but only once the digital human and their appearance, expressions, and natural movements are indistinguishable from the real person. Hence, I aspire for everyone to be able to create a realistic 3D avatar of themselves whose movement is realistic.”
Thies’ goal is for users to capture themselves with cameras that are commonly available, without having to rely on a complex setup to create their own avatar. Placing people in a 4D body scanner would be much too complex and expensive. “One idea is to work with the cameras of game consoles, for example, whose Kinect sensors record color and depth. A built-in distance sensor measures how long it takes for light to travel from the camera to the scene and back. This gives me a three-dimensional point-cloud of a person from which I can then create an avatar.”
Syncing the movement of a face with the input audio
Revolutionizing the way people communicate is one potential area of application. Another is in video post-production, where dubbed films would show the actors' lips moving in the target language. “To learn the expressions and appearance of a target face, our algorithm only needs a short video clip of about two minutes,” Thies says.
Also known as real-time facial reenactment, audio-driven facial video synthesis, or neural voice puppetry, this technology is only just beginning to gain momentum. It is driven by a deep neural network that employs a latent 3D face model space. In this way, a video can be created showing a celebrity speaking in a different voice or even showing the person saying things in their own voice that they never said in real life.
“This is the negative side of facial reenactment, as the generation of photorealistic imagery can be misused. The algorithms applied when dubbing and lip-syncing a film can also be used to create a speech by a head of state, putting words in their mouth that are fake and untrue. If people don’t notice the hoax, this could potentially cause confusion, disruption, and even undermine a society. Deep fakes are increasingly popping up on the internet, and they can be difficult to recognize. Digital multi-media forensics is thus another pillar of my research. Here, I aim to develop algorithms that automatically detect whether a video sequence is synthetic or has been manipulated.”
Understanding deep fake technology in order to detect it
Multi-media forensics is about training an AI algorithm with a large dataset so it can identify fake videos based on the artifacts visible in a video sequence – often occurring along the seam between the synthesized face and the real background. The algorithm also examines the colors’ noise behavior and assesses whether it is realistic. In this way, artificial intelligence can detect relatively easily if there has been a manipulation. “This is exactly the kind of technique that can then be automated,” Thies says. “There are already attempts to offer this deep-fake detection technique as a browser plug-in for anyone surfing the internet, which automatically recognizes and flags fake images and videos. This would protect against fake news.”
It’s a game of cat and mouse, as deep fakes are continuously improving. Nowadays, it is relatively easy to recognize a deep fake. In the near future, however, it will become increasingly difficult. Creating ever-more photorealistic video content thus comes with a huge responsibility. “My work raises the bar so that not just anyone can fake a video that remains unnoticed. By gaining knowledge about the creation process, we can design the most advanced forgery detection algorithms. You have to be at the forefront of this technology in order to detect, flag, and reveal the fakes,” Thies concludes.
Justus Thies is formerly a postdoctoral researcher at the Technical University of Munich. In September 2017, he joined Prof. Matthias Nießner’s Visual Computing Lab. Previously, he completed his Ph.D. at the University of Erlangen-Nürnberg under the supervision of Gunther Greiner. Thies received his master of science from the University of Erlangen-Nürnberg. During the time as a doctoral student, he collaborated with other institutions, with internships at Stanford University and the Max Planck Institute for Informatics in Saarbrücken.
A medical project marked the beginning of Justus Thies’s research on capturing and synthesizing image and video content. The aim was to capture the faces of patients with cleft palates. For several decades, patients were photographed again and again, allowing scientists to see how their faces changed over the course of treatment and how their wounds healed. With the approach he developed, patients with a cleft palate can now see the results of particular treatment methods in ten years’ time.
His research can be watched on this Youtube channel.