This repository contains the official implementation of the accepted conference paper:
InfoCIR: Multimedia Analysis for Composed Image Retrieval
Paper ID: 7973
Accepted at IEEE PacificVis 2026 (Conference Papers Track)
This repository hosts the official implementation of the paper.
InfoCIR is an open-source web application (built with Plotly Dash) for exploring a Composed Image Retrieval (CIR) system through interactive visualizations and explainability tools. It integrates a state-of-the-art CIR model (SEARLE) with a rich dashboard that lets you query by combining an image + text description, visualize the results in an embedding space, analyze class distributions, and refine your text prompts with AI assistance. The goal is to help users understand why certain images are retrieved and how slight changes in wording can affect the results, by coupling retrieval with explainability and prompt engineering in one interface.
The user interface of our system consists of six main panels. (A) The Composed Image Retrieval Panel allows users to input an image and a text prompt and select the number of images k to be retrieved. (B) The Query Results Panel displays the top-k images ranked by similarity; any image can be clicked to mark it as an ideal target for prompt refinement. (C) The Histogram / Wordcloud Panel includes a class-frequency histogram (C1) and a word cloud (C2), summarizing labels within the current top-k. (D) The central Embedding View shows a 2D UMAP projection of the dataset, highlighting the reference image, the composed query embedding, and the top-k results. (E) The Prompt Enhancement Panel proposes alternative prompts conditioned on the selected ideals, using an LLM and retrieval metrics. (F) The Explanation Panel visualizes model attribution using a saliency map (F1), a token attribution bar chart (F2), and a Rank-Ξ heatmap (F3).
- Reference Image + Text Query: Upload an image and describe desired modifications (e.g., "make it red", "add a hat")
- SEARLE Integration: Uses pre-trained SEARLE models for high-quality composed queries
- Multiple Model Support: SEARLE, SEARLE-XL, Phi networks, and OTI methods
- Real-time Search: Interactive search with adjustable top-K results
- Visual Explanations: GradECLIP-based saliency maps showing which image regions influence retrieval
- Text Attribution: Token-level attribution showing which words matter most
- Reference vs Candidate Analysis: Compare attention patterns between query and results
- Interactive Navigation: Browse through saliency maps for multiple retrieved images
- Context-Aware Generation: Automatically enhance prompts using visual context from selected images
- Multi-Candidate Analysis: Generate and compare multiple enhanced prompts
- Performance Metrics: NDCG, mAP, and MRR evaluation of prompt improvements
- Rank Delta Analysis: Visualize how different prompts affect retrieval rankings
- UMAP Projections: 2D visualization of high-dimensional image embeddings
- Smart Clustering: Enhanced UMAP with debiasing techniques for better semantic grouping
- Gallery View: Responsive image grid with selection and filtering
- Word Clouds: Dynamic visualization of class distributions
- Histograms: Interactive class frequency analysis
- Scatterplot: Multi-selection and zoom-responsive visualizations
- Python 3.8+ (tested with 3.8-3.11)
- CUDA-capable GPU (recommended for optimal performance)
- 8GB+ RAM (16GB+ recommended for large datasets)
git clone <repository-url>
cd multimedia-analytics# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
cd cir_app
# Install dependencies
pip install -r requirements.txtThe app automatically downloads SEARLE models via PyTorch Hub. For offline usage:
# Pre-download SEARLE models
python -c "import torch; torch.hub.load('miccunifi/SEARLE', 'searle', backbone='ViT-B/32')"
python -c "import torch; torch.hub.load('miccunifi/SEARLE', 'searle', backbone='ViT-L/14')"You need to have your dataset somewhere on your system with this structure:
/your/path/to/dataset/
βββ class1_dir/ # class directory
βββ class2_dir/ # class directory
βββ ...
βββ class_names.csv # CSV with columns: dir_name,class_name
The class_names.csv should look like:
dir_name,class_name
class1_dir,Class 1 Name
class2_dir,Class 2 Name
...Then, configure the dataset path:
Edit src/config.py and change:
DATASET_ROOT_PATH = '/your/actual/path/to/dataset'cd cir_app
python run.pyThe app will be available at http://localhost:8051
- Dataset Processing: The app will automatically process your dataset on first run
- Feature Extraction: CLIP embeddings and UMAP projections will be generated
- Database Creation: Search index will be built for fast retrieval
- Upload Image: Click the upload area and select a reference image
- Enter Query: Type a modification description (e.g., "wearing sunglasses")
- Search: Click "Search" to find similar images
- Explore Results: Use visualizations to understand retrieval patterns
- Enhance: Select ideal results and click "Enhance Prompt" for better queries
Reference Image + Text Modification β Composed Query
Example: [Dog Image] + "wearing a red collar" β Search for dogs with red collars
- View Results: Top-K most similar images ranked by semantic similarity
- Visualize Embeddings: See how your query relates to the dataset in 2D space
- Analyze Patterns: Use histogram and word clouds to understand result distributions
- Saliency Maps: See which parts of images drive similarity
- Text Attribution: Understand which words are most influential
- Comparative Analysis: Compare attention between reference and candidates
- Select Ideal Results: Choose images that best match your intent
- Enhance Prompts: Generate contextually-aware improved queries
- Compare Performance: Use metrics to validate improvements
- Single Selection: Click any image to select and highlight in visualizations
- Multi-Selection: The gallery allows selecting multiple images at the same time
- Context Menu: Right-click for additional options
- Zoom: Click +/- buttons or select are to zoom
- Pan: Click and drag to navigate
- Selection: Lasso or box select multiple points
- Class Highlighting: Click histogram bars to highlight classes
- Candidate Generation: System generates multiple enhanced prompts
- Performance Metrics: NDCG@K, mAP@K, MRR, Coverage, Mean Rank and Mean Similarity scores for each candidate
- Visual Comparison: Side-by-side results for prompt comparison
# Add new CIR models in src/shared/cir_systems.py
class CustomCIRSystem:
def query(self, reference_image_path, text_prompt, top_k):
# Your custom implementation
return results# Configure saliency generation in config.py
SALIENCY_MAX_CANDIDATES = 5 # Limit saliency to top-5 results
SALIENCY_GENERATE_TEXT_ATTRIBUTION = True # Enable text attribution# Enhanced UMAP with debiasing in config.py
NEW_UMAP_CONFIG = {
'style_debiasing': True, # Remove style bias
'contrastive_debiasing': True, # Enhance semantic separation
'semantic_enhancement': True, # Boost semantic clustering
'alternative_projection': 'ica', # Use ICA for projection
}DATASET_ROOT_PATH = '/path/to/dataset' # Your dataset location
DATASET_SAMPLE_SIZE = 30000 # Max images to process
IMAGES_DIR = DATASET_ROOT_PATH # Images folder
CLASS_NAMES_PATH = 'class_names.csv' # Class labels fileCLIP_MODEL_NAME = 'ViT-B/32' # CLIP backbone
CIR_EVAL_TYPE = 'searle' # CIR method
CIR_PREPROCESS_TYPE = 'targetpad' # Image preprocessingIMAGE_GALLERY_SIZE = 24 # Gallery grid size
IMAGE_GALLERY_ROW_SIZE = 4 # Images per row
MAX_IMAGES_ON_SCATTERPLOT = 100 # Scatterplot limit
PORT = 8051 # Server portENHANCEMENT_CANDIDATE_PROMPTS = 10 # Number of enhanced prompts
ENHANCEMENT_USE_CONTEXT = True # Use visual contextSALIENCY_ENABLED = True # Enable saliency maps
SALIENCY_MAX_CANDIDATES = None # Max candidates (None = all)
SALIENCY_GENERATE_REFERENCE = True # Reference saliency
SALIENCY_GENERATE_CANDIDATES = True # Candidate saliency
SALIENCY_GENERATE_TEXT_ATTRIBUTION = True # Text attributionNEW_UMAP_CONFIG = {
# Core UMAP parameters
'pca_components': 80, # PCA preprocessing
'n_neighbors': 25, # Neighborhood size
'min_dist': 0.1, # Minimum distance
'spread': 55.0, # Spread parameter
'target_weight': 0.3, # Target weighting
'local_connectivity': 3.0, # Local connectivity
'n_epochs': 1200, # Training epochs
# Debiasing techniques
'style_debiasing': True, # Remove style bias
'contrastive_debiasing': True, # Enhance contrast
'contrastive_weight': 0.7, # Contrast strength
'alternative_projection': 'ica', # Projection method
'semantic_enhancement': False, # Semantic boost
'augment_embeddings': False, # Data augmentation
# Advanced options
'force_approximation_algorithm': False, # Force approximation
'enhanced_parameter_tuning': True, # Auto-tuning
'calculate_quality_metrics': False, # Quality metrics
'use_hdbscan': False, # Post-clustering
'hdbscan_min_cluster_size': 10, # Cluster size
}multimedia-analytics/
βββ cir_app/ # Main application
β βββ src/ # Source code
β β βββ callbacks/ # Dash callbacks
β β β βββ SEARLE/ # SEARLE integration
β β β βββ freedom/ # Alternative CIR method
β β β βββ saliency_callbacks.py
β β βββ widgets/ # UI components
β β βββ saliency/ # Saliency system
β β βββ assets/ # CSS/JS assets
β β βββ config.py # Configuration
β βββ run.py # Application entry point
Dataset β CLIP Embeddings β UMAP Projection β Visualization
β
Reference Image + Text β SEARLE β Similarity Search β Results
β
Selected Results β Saliency Analysis β Explanations
β
Context Extraction β Prompt Enhancement β Improved Queries
- CLIP: Multi-modal embeddings (OpenAI)
- SEARLE: Composed image retrieval (University of Florence)
- GradECLIP: Gradient-based saliency (Custom implementation)
- UMAP: Dimensionality reduction with enhancements
- Plotly: Interactive plotting and dashboards
- Dash: Web application framework
- PyTorch: Deep learning framework
- NumPy/Pandas: Data processing
- Pillow: Image processing
- Scikit-learn: Machine learning utilities