Publications | Matan Levy

Story2Board teaser

Story2Board: A Training‑Free Approach for Expressive Storyboard Generation

arXiv, 2025

David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski

Training-free method that turns natural-language stories into coherent, expressive storyboards, keeping characters and layout consistent while scenes stay diverse. We also introduce a rich storyboard benchmark and a scene diversity metric.

Abstract Project Page arXiv

Abstract

Story2Board: A Training-Free Approach for Expressive Storyboard Generation

×

We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

OmnimatteZero teaser

OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models

SIGGRAPH Asia, 2025

Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari

OmnimatteZero is a training-free, real-time method for decomposing videos into background and foreground layers. Unlike existing approaches that require heavy computation or supervised training, it can remove objects with their footprints (shadows, reflections) and blend them seamlessly into new videos. Running at 25 FPS on an A100 GPU, it achieves this by directly manipulating the spatio-temporal latent space of pre-trained video diffusion models.

Abstract Project Page arXiv

Abstract

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

×

In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Find your Needle teaser

Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

NeurIPS, 2025

Michael Green*, Matan Levy*, Issar Tzachor*, Dvir Samuel, Nir Darshan, Rami Ben-Ari

We tackle the problem of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing specific small objects within cluttered scenes. We establish new benchmarks and introduce Multi-object Attention Optimization (MaO), a novel framework that significantly outperforms existing methods, paving the way for future advancements in efficient, fine-grained retrieval tasks.

Abstract Project Page arXiv

Abstract

Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

×

We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.

Gray-box teaser

Task-Specific Adaptation with Restricted Model Access

arXiv, 2025

Matan Levy, Rami Ben-Ari, Dvir Samuel, Nir Darshan, Dani Lischinski

In this work, we propose "Gray-box" fine-tuning frameworks that enables task-specific adaptation of foundational models without exposing their weights or architecture. Using lightweight input and output adapters, our approach effectively adapts models while keeping them fixed. We introduce DarkGray-box and LightGray-box variants, demonstrating competitive performance with full fine-tuning on tasks like text-image and text-video alignment.

Abstract arXiv

Abstract

Task-Specific Adaptation with Restricted Model Access

×

The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring "Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.

EffoVPR teaser

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

ICLR 2025

Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Korngut Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari

This work introduces a new method for visual place recognition (VPR) that uses features from foundation models to improve accuracy. It excels in handling challenging scenarios like occlusions, seasonal changes, and day-night variations, offering more efficient and accurate results than previous methods.

Abstract arXiv

Abstract

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

×

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.

Where's Waldo teaser

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

NeurIPS 2024

Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, Gal Chechik

This work leverages text-to-image diffusion models for personalized image segmentation and retrieval, using features from pre-trained models. It surpasses existing methods in identifying specific objects within images without additional training.

Abstract Project Page arXiv Code

Abstract

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

×

Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

ChatIR teaser

Chatting Makes Perfect: Chat-based Image Retrieval

NeurIPS 2023

Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

This work proposes a chat-based image retrieval system that refines search results through interactive dialogue. By asking follow-up questions, the system improves retrieval accuracy and surpasses traditional single-query methods in performance.

Abstract Project Page arXiv Code

Abstract

Chatting Makes Perfect: Chat-based Image Retrieval

×

Chats emerge as an effective user-friendly approach for information retrieval, and are successfully employed in many domains, such as customer service, healthcare, and finance. However, existing image retrieval approaches typically address the case of a single query-to-image round, and the use of chats for image retrieval has been mostly overlooked. In this work, we introduce ChatIR: a chat-based image retrieval system that engages in a conversation with the user to elicit information, in addition to an initial query, in order to clarify the user's search intent. Motivated by the capabilities of today's foundation models, we leverage Large Language Models to generate follow-up questions to an initial image description. These questions form a dialog with the user in order to retrieve the desired image from a large corpus. In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for ChatIR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of CharIR under different settings. Project repository is available at https://github.com/levymsn/ChatIR.

LaSCo teaser

Data Roaming and Quality Assessment for Composed Image Retrieval

AAAI 2024

Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

This work introduces a new dataset for Composed Image Retrieval (CoIR) and a model that significantly improves retrieval tasks. The dataset enhances query richness and reduces redundancy, achieving state-of-the-art results on benchmarks like FashionIQ and CIRR.

Abstract Project Page arXiv Code

Abstract

Data Roaming and Quality Assessment for Composed Image Retrieval

×

The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.

CRCT teaser

Classification-Regression for Chart Comprehension

ECCV 2022

Matan Levy, Rami Ben-Ari, Dani Lischinski

This work presents a model for chart question answering that combines visual and textual data, significantly improving performance on complex charts. It excels in handling out-of-vocabulary and regression tasks, achieving strong results on the PlotQA dataset.

Abstract Project Page arXiv Code

Abstract

Classification-Regression for Chart Comprehension

×

Chart question answering (CQA) is a task used for assessing chart comprehension, which is fundamentally different from understanding natural images. CQA requires analyzing the relationships between the textual and the visual components of a chart, in order to answer general questions or infer numerical values. Most existing CQA datasets and models are based on simplifying assumptions that often enable surpassing human performance. In this work, we address this outcome and propose a new model that jointly learns classification and regression. Our language-vision setup uses co-attention transformers to capture the complex real-world interactions between the question and the textual elements. We validate our design with extensive experiments on the realistic PlotQA dataset, outperforming previous approaches by a large margin, while showing competitive performance on FigureQA. Our model is particularly well suited for realistic questions with out-of-vocabulary answers that require regression.