Kling AI just released its new Kling 2.0 model, and with it comes improvements to image-to-video and text-to-video. In particular, Kling 2.0 boasts better results with dynamic, action-packed prompts and images, which can lead to more stunning videos. Let’s run through what’s new with Kling 2.0 and how you can make the most out of your tokens.
Getting Started with Kling 2.0
First, let’s talk about what you can do with the new Kling 2.0 model.

Currently, KlingAI supports both text-to-video and image-to-video options with Kling 2.0. You can, of course, use any image you’d like, including generated images; my examples use images made with Flux. You’ll also notice a “Multi-Elements” option, which lets you swap, add, or delete sections from a video clip.

All you have to do is pause at certain times of the video, add the sections that you want to edit (in this case, I’m swapping), and Kling AI will handle the rest for you.

You’ll also want to add points on different regions of your selection to improve results. Generally, the more points you add, the better the AI will be at tracking and masking the movements. I added quite a few points in this selection because human movement is complex with many moving parts.
But you’re not quite done yet. When referencing videos with particularly complex motions (such as dances), you won’t get the best results by adding selections to just one frame.

If your video doesn’t have a lot of action, though, you’re in luck. You won’t need to add that many masks to get a decent result. In this example, I only have two masks in the video timeline, yet I still managed to get a fairly consistent result since the movements are relatively simple and the camera doesn’t move that much.

Kling 2.0 vs WAN 2.1
I mentioned earlier that Kling 2.0 lets you create videos a lot like WAN 2.1 VACE, which is an open-source model. And while it’s nice to have a free AI model running locally on your computer, most users are limited by hardware. And unless you have a top-of-the-line GPU meant for AI models, like the H100, you probably won’t get the best possible results. Even flagship consumer GPUs like the 4090 and 5090 will struggle to match the quality of videos generated through premium models like Kling 2.0.
To showcase how differently WAN 2.1 VACE and Kling 2.0 perform, I used the same images and the same prompts and put them through image-to-video. The results were very, very noticeable.

I used this image of fairies making a birthday cake in both models. With WAN 2.1, the video was pretty stale. The fairies mostly stood still, and the only real movement in the video came from the magic bubbles that floated above the cake. Not exactly a dynamic scene.
On the other hand, Kling 2.0’s video was far more action-packed. The little fairy in the middle ran around the cake, magic effects were flying out of their wands, and the cake itself grew into a much larger size. It looks much better than WAN 2.1’s result. In fact, Kling 2.0’s ability to handle fast-paced scenes outmatches its previous version, Kling 1.6.
Kling 2.0 vs Kling 1.6
In this next example, I had Kling 2.0 generate a fight scene between two female characters. The resulting video had complex martial arts movements and a fast-moving camera that circled the two as they fought. There were also lots of particle effects that gave the scene that extra flair.
On the other hand, Kling 1.6 struggled to keep up with Kling 2.0’s pace. Even with the same characters and prompt, Kling 1.6’s video was much slower, with barely any camera movement. You can really see the improvements in Kling 2.0 when comparing it with Kling 1.6 using action scenes and prompts.
Kling 2.0’s Quirks
Kling 2.0 does have its quirks, though. When I try to be a bit too specific with my prompt, the model doesn’t really handle it very well. This video of a woman on a jet-ski looks off because the woman’s head is turned backwards.
If you want to get natural-looking results, you need to keep your prompts simple. Using a simplified prompt, I got a much nicer-looking result here. This would also be a good time to mention that Kling 2.0 handles water pretty well, with realistic waves and splashes.
As long as you keep your prompts simple, you can also have the characters in your videos do interesting things, such as change their focus away from the camera.
The first frame of this video has the woman look at the camera, but as it continues, she drives off, turning her head to the road. This looks far more realistic than WAN 2.1’s version of the same prompt; while the open-source model could handle reflections and lights well, there isn’t a whole lot of movement from the woman driving the motorcycle.