Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

1University of Maryland, College Park, 2Massachusetts Institute of Technology
methods

Abstract

Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied upon a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.

Generalized (Zero-Shot) and Consistent Video Frame Interpolation for Real-World Unseen Videos

We present the Zero-Shot video results of our method on real-world unseen videos, compared with representative baselines. For the HQF datasets, we skip 3 frames between the start and end frames and interpolate the frames in between. For our self-collected Clear-Motion dataset, we skip 11 frames between the start and end frames and interpolate the intermediate frames. For the event-based video frame interpolation method CBMNet-Large, we used the publicly available model checkpoints trained on the same dataset (BS-ERGB) as our method. The videos display all interpolated frames between the start and end frames. As the results show, our method uniquely generalizes well across unseen real-world videos.

On Clear-Motion Test Sequences: Large Translation and Rotation of Complex Texture

On Clear-Motion Test Sequences: Large Camera Motion Capturing Distant Objects

On HQF Testset: With Moving Cameras

More Qualitative Comparison Results

On Clear-Motion Test Sequences: Large Motion of Simple Texture


On Clear-Motion Test Sequences: Large Camera Motion Capturing Nearby Objects

On Clear-Motion Test Sequences: Large Motion of a Checkerboard:

Results Showcase

Here, we present additional video results of our method on real-world unseen videos. For our self-collected Clear-Motion dataset, we skip 11 frames between the start and end frames and interpolate all frames in between. For the HQF dataset, we skip 3 frames and interpolate all intermediate frames. In the results, Input refers to the low-temporal-resolution video containing only the start and end frames, Reference refers to the captured in-between frames, and Ours refers to the interpolated frames produced by our model.

On Clear-Motion Test Sequences: Large Motion Along the Depth Direction of a Checkerboard

On HQF Testset: With Moving Cameras