I've been looking into AI-generated videos to round off the trilogy of blog posts relating to AI and its ability to generate content. In recent blog posts, we’ve established that AI is getting reasonably good at generating images that could be used with no modification, we’ve also taken a look at AI’s ability to generate written content on a particular subject.
The process of generating video takes generating an image and turns it up a notch or two. Here is a little bit of background to help put this problem into perspective.
While the technology used to record and display moving images has improved dramatically since the Zoetropes of the 1800s, the process used to trick our brains into ‘seeing something moving’ is basically the same.
Quickly moving between a series of similar images with only slight variations makes us believe something is moving in front of us. This is exactly how stop motion animation, flip books, movies, TV and digital videos work. 24 frames (or images if you will) per second is enough to make us see fluid movement.
What does this mean for AI generated videos? Well, this means that the AI would need to generate 24 frames/images for each second of required video footage. Each frame needs to be related to the previous frame and the next frame with just a slight variation. If the frames are too different then you get odd flickering in the video. It requires an almost human level of subtlety.
In terms of generating an entire video from a text prompt, it looks like we are currently on the cusp of tools similar to Dall-E 2 being available to the public but there is nothing at the moment. There are a few research papers available from different sources that describe the process and show examples.
This paper is fascinating, their example videos show some of the hurdles that need to be overcome.
Prompt: A teddy bear washing dishes
We can see in the above video the importance of each frame relating to the last. You can see the plate that he is washing up disappears at certain points during the video, this is due to the plate not appearing in these generated frames.
Prompt: A happy elephant wearing a birthday hat walking under the sea.
We can see in the video of the elephant walking above that there are still issues with the elephant's leg motion while walking. Its legs seem to overlap and clip through each other in odd ways.
Another paper from Meta AI shows they are facing the same problems.
Prompt: Hyper-realistic spaceship landing on mars
This spaceship looks amazing, you can see that the AI has managed to even generate a realistic-looking shadow for the hovering UFO. Is it perfect? no. If you look closely the terrain of the ground changes slightly on the left side as the UFO ‘levels out’.
Prompt: A teddy bear painting a portrait
This video of a bear painting shows the importance of each frame relating to the others. As the bear's arm goes over the bottom right of the canvas, the definition of the canvas is completely lost.
There are a handful of companies out there at the moment that offer AI-generated videos that create a video using an AI avatar. They don’t generate something from nothing as we’ve been discussing already, instead they take a script and combine this with an avatar to create a video.
They market these for things like training videos and product marketing videos (honestly anything similar when you’d expect a presenter to be on screen, talking the viewer through a process would work).
One of the huge benefits of this kind of video is that you don’t have to pay for a presenter, and a system like this can deliver videos in a multitude of different languages with ease.
We’ve all heard the phrase ‘deepfake’ but what does this actually mean? A deepfake (or deepfake video in this context) is a video that has been produced by training an AI using a single individual with the aim to replicate this individual's unique characteristics. This differs from AI avatars mentioned above that aim to reproduce a convincing nondescript human and instead target an individual in particular.
You have likely heard about deepfakes as they are often used for nefarious means, imagine the possibility of being able to make a deepfaked world leader say anything you wish. A deepfaked video could be realistic enough that anyone watching the video would believe it to be the real person speaking. There are more legitimate uses for deepfakes for instance, the movie industry can use the technology to de-age actors or replicate a deceased actor to reprise their iconic movie role.
As you’d expect, there are huge concerns regarding the legality of deepfakes and the problems they might cause. Most academic research around deepfakes is concentrating on detecting a deepfake. Techniques include looking for inconsistencies in the reflections in the subject's eyes or detecting irregularities in blinking patterns but as deepfake technology improves, these techniques are bound to become redundant and we’ll have to find new ways to help us separate fact from fiction.
It would be very easy for me to turn around and tell you that this is the next big thing but these tools are not in the hands of the general population yet. It’s very hard to tell if this could be the next big thing or just turn out to be an over-engineered playground. If these kinds of systems aren’t well received by potential users then they can fizzle out quickly even if the technology behind them is sound and groundbreaking. Only time will tell. One thing that is for sure though is that as this technology continues to advance, it will continue to raise important questions about the ethical and societal implications of AI-generated videos.
Subscribe to get our best content. No spam, ever. Unsubscribe at any time.
Send us a message for more information about how we can help you