Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image-generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, Our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model’s superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
The EIDT-V pipeline leverages diffusion intersections to enable frame-based video generation, integrating text and video modules. The text module generates framewise prompts and identifies differences across frames, while the video module uses these prompts to ensure temporal consistency and semantic coherence. Through techniques like grid prompt switching and text-guided attention, the pipeline allows fine-grained control over variance and coherence, achieving high-quality video synthesis using generic diffusion models.
The quantitative comparison highlights the effectiveness of EIDT-V across multiple configurations and architectures. Our method demonstrates superior temporal coherence, perceptual quality, and structural consistency compared to baseline methods. The user study further corroborates these findings, indicating that EIDT-V achieves the best overall performance in temporal coherence, fidelity, and user satisfaction, all while making no modifications to model architecture.
BibTex Code coming Soon