Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

1University of Maryland, College Park, 2Massachusetts Institute of Technology
CVPR, 2025

Abstract

Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied upon a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.

Generalized (Zero-Shot) and Consistent Video Frame Interpolation for Real-World Unseen Videos

We present the Zero-Shot video results of our method on real-world unseen videos, compared with representative baselines. For the HQF datasets, we skip 3 frames between the start and end frames and interpolate the frames in between. For our self-collected Clear-Motion dataset, we skip 11 frames between the start and end frames and interpolate the intermediate frames. For the event-based video frame interpolation method CBMNet-Large, we used the publicly available model checkpoints trained on the same dataset (BS-ERGB) as our method. The videos display all interpolated frames between the start and end frames. As the results show, our method uniquely generalizes well across unseen real-world videos.

On Clear-Motion Test Sequences: Large Translation and Rotation of Complex Texture

On Clear-Motion Test Sequences: Large Camera Motion Capturing Nearby Objects

On HQF Testset: With Moving Cameras

More Qualitative Comparison Results

On Clear-Motion Test Sequences: Large Motion of a Checkerboard:

On Clear-Motion Test Sequences: Large Motion of Simple Texture


On Clear-Motion Test Sequences: Large Camera Motion Capturing Distant Objects

Results Showcase

Here, we present additional video results of our method on real-world unseen videos. For our self-collected Clear-Motion dataset, we skip 11 frames between the start and end frames and interpolate all frames in between. For the HQF dataset, we skip 3 frames and interpolate all intermediate frames. In the results, Input refers to the low-temporal-resolution video containing only the start and end frames, Reference refers to the captured in-between frames, and Ours refers to the interpolated frames produced by our model.

On Clear-Motion Test Sequences: Deformable Motion of Simple Texture

On Clear-Motion Test Sequences: Large Motion Along the Depth Direction of a Checkerboard

On HQF Testset: With Moving Cameras

Comparison between Event-based Video Generation and Interpolation

As mentioned in the main paper, our method was originally trained for the video generation task and can also perform video generation. The key difference between video generation and interpolation lies in the input: video generation uses only the starting frame, while interpolation utilizes both the start and end frames. Below, we compare the two tasks. Video generation often suffers from error accumulation and hallucination due to the absence of information in the start frame, whereas interpolation produces better and more consistent results by leveraging information from both the start and end frames.

On Clear-Motion Test Sequences: Large Motion of Simple Texture