EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

University of Bath
Under Review

Generated Videos

A family of penguins
SD3 Medium: A family of penguins huddling together in a snowstorm
A fairy flying
SD3 Medium: A fairy flying around a glowing mushroom in an enchanted forest
A dancer's silhouette
SD3 Medium: A dancer's silhouette moving gracefully in slow motion
A galaxy swirling
SD3 Medium: A galaxy swirling with stars and nebulae in deep space
A musician playing
SD3 Medium: A musician playing a slow, peaceful tune on an acoustic guitar
A peacock displaying feathers
SD3 Medium: A peacock displaying its feathers in slow motion
A person writing
SD3 Medium: A person writing slowly in a journal with an ink pen
A portal opening and closing
SD3 Medium: A portal opening and closing slowly in a mystical cave
A squirrel nibbling
SD3 Medium: A squirrel nibbling on an acorn under a tree
A dolphin gliding
SD3 Medium: A dolphin gracefully gliding through turquoise waves
A unicorn grazing
SD3 Medium: A unicorn grazing in a meadow under a rainbow
Raindrops creating ripples
SD3 Medium: Raindrops creating ripples on a still pond
A dancer's silhouette
SDXL: A dancer's silhouette moving gracefully in slow motion
A figure skater gliding
SDXL: A figure skater gliding across an ice rink with smooth turns
A full moon rising
SDXL: A full moon rising over a still ocean
A galaxy swirling
SDXL: A galaxy swirling with stars and nebulae in deep space
A lightning bug flying
SDXL: A lightning bug flying through a dark meadow
A musician playing
SDXL: A musician playing a slow, peaceful tune on an acoustic guitar
A butterfly flapping
SDXL: A butterfly gently flapping its wings while resting on a flower
A child blowing bubbles
SDXL: A child blowing bubbles that float and pop gently

Abstract

Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image-generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, Our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model’s superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.

Methodology

The EIDT-V pipeline leverages diffusion intersections to enable frame-based video generation, integrating text and video modules. The text module generates framewise prompts and identifies differences across frames, while the video module uses these prompts to ensure temporal consistency and semantic coherence. Through techniques like grid prompt switching and text-guided attention, the pipeline allows fine-grained control over variance and coherence, achieving high-quality video synthesis using generic diffusion models.

Overview of the EIDT-V Pipeline for Frame-Based Video Generation
Figure 1: Overview of the EIDT-V Pipeline. This pipeline integrates text and video modules to generate coherent frames.
Grid Prompt Switching with Text-Guided Attention
Figure 2: Grid Prompt Switching with Text-Guided Attention. Controlled transitions enhance motion coherence.

Key Results

The quantitative comparison highlights the effectiveness of EIDT-V across multiple configurations and architectures. Our method demonstrates superior temporal coherence, perceptual quality, and structural consistency compared to baseline methods. The user study further corroborates these findings, indicating that EIDT-V achieves the best overall performance in temporal coherence, fidelity, and user satisfaction, all while making no modifications to model architecture.

Main Quantitative Results
User Study Results

BibTeX

BibTex Code coming Soon