Spaces:

manisharma494
/

Virtual-Search-System

Sleeping

App Files Files Community

manisharma494 commited on Sep 5

Commit

fdd82d2

verified ·

1 Parent(s): dbb6043

Update README.md

Browse files

Files changed (1) hide show

README.md +49 -68

README.md CHANGED Viewed

@@ -12,15 +12,15 @@ license: mit
 # 🔍 Visual Search System
-A comprehensive Streamlit application for browsing and searching through a large dataset of high-quality images from Unsplash.
 ## ✨ Features
-- **🔎 Search by ID**: Find specific images by their ID number
-- **📦 Browse by Block**: Navigate through images in organized blocks of 100
-- **📥 Automatic Downloads**: Automatically downloads missing images with parallel processing
-- **🚀 Smart Dependencies**: Auto-installs required packages
-- **📱 Responsive UI**: Clean, modern interface optimized for all devices
 ## 🚀 Quick Start
@@ -44,102 +44,83 @@ A comprehensive Streamlit application for browsing and searching through a large
 ### Hugging Face Spaces Deployment
-1. **Create a new Space** on Hugging Face
-2. **Choose Streamlit** as the SDK
-3. **Upload these files:**
-   - `app.py` (main application)
-   - `download_images.py` (image downloading logic)
-   - `photos_url.csv` (image dataset)
-   - `requirements.txt` (dependencies)
-   - `README.md` (this file)
-The app will automatically:
-- Install dependencies
-- Check for downloaded images
-- Download missing images if needed
-- Launch the Streamlit interface
 ## 📁 Project Structure
 ```
 visual-search-system/
-├── app.py                 # Main Streamlit application
-├── download_images.py     # Image downloading utilities
-├── photos_url.csv        # Dataset with 25,000+ image URLs
-├── requirements.txt      # Python dependencies
-├── README.md            # This file
-└── images/              # Downloaded images (created automatically)
 ```
 ## 🎯 How It Works
-### Search by ID
-- Enter a specific image ID (e.g., "0001", "1234")
-- Leave empty to browse the first 500 images
-- Results update in real-time
-### Range by Block
-- Each block contains 100 images
-- Enter a number between 1-250
-- Example: Block 100 shows images 10001-10100
-### Image Management
-- Automatically detects existing images
-- Downloads missing images in parallel (20 workers)
-- Optimizes images to 800x800 pixels
-- Saves as compressed JPEGs
 ## 📊 Dataset Information
-- **Total Images**: 25,000+
-- **Source**: Unsplash (high-quality stock photos)
-- **Format**: JPEG, optimized for web
-- **Size**: Approximately 1.5GB total
-- **Resolution**: 800x800 pixels (maintains aspect ratio)
 ## 🛠️ Technical Details
 ### Dependencies
-- `streamlit` - Web interface framework
-- `pandas` - Data manipulation
-- `requests` - HTTP requests for image downloads
-- `pillow` - Image processing
-- `tqdm` - Progress bars
 ### Performance Features
-- **Parallel Downloads**: Uses ThreadPoolExecutor for speed
-- **Retry Logic**: Handles failed downloads gracefully
-- **Smart Caching**: Skips already downloaded images
-- **Memory Efficient**: Processes images in chunks
 ## 🔧 Configuration
 ### Environment Variables
-- No environment variables required
-- All configuration is built-in
-### Customization
-- Modify `MAX_DISPLAY_IMAGES` in `app.py` to change display limit
-- Adjust `max_workers` in download functions for different performance
-- Change `target_size` for different image resolutions
 ## 🚨 Troubleshooting
 ### Common Issues
-1. **"No application file found" on Hugging Face**
-   - Ensure `app.py` is the main file (not `start_app.py`)
-   - Check that `requirements.txt` is present
-   - Verify Streamlit SDK is selected
 2. **Image download failures**
    - Check internet connection
    - Verify `photos_url.csv` is present
    - Check available disk space
-3. **Dependency issues**
-   - Ensure Python 3.8+ is used
-   - Try updating pip: `pip install --upgrade pip`
 ### Performance Tips

 # 🔍 Visual Search System
+A Streamlit app that downloads images from `photos_url.csv`, builds lightweight visual embeddings, and lets you search by text (optional, via Hugging Face Inference API) or by uploading an image.
 ## ✨ Features
+- **📥 Automatic downloads**: Pulls images from `photos_url.csv` with retries and optimization
+- **🧠 Embeddings**: Creates simple, robust RGB histogram embeddings locally (no GPU needed)
+- **🔤 Text search (optional)**: Uses `openai/clip-vit-base-patch32` via HF Inference API when `HF_TOKEN` is provided
+- **📁 Image similarity search**: Upload an image and find visually similar images
+- **📱 Modern UI**: Streamlit interface with responsive layout and status tracking
 ## 🚀 Quick Start
 ### Hugging Face Spaces Deployment
+1. Create a new Space and select the SDK: `Streamlit`.
+2. Ensure this repository contains at least: `app.py`, `photos_url.csv`, `requirements.txt`, `README.md`.
+3. Optional: Set a Space secret named `HF_TOKEN` if you want text search enabled.
+   - In your Space, go to Settings → Secrets → Add `HF_TOKEN` (a valid Hugging Face token).
+4. Push/Upload files. The build will install `requirements.txt` and start `app.py` automatically.
+Notes:
+- You do NOT need a Dockerfile for Streamlit Spaces (the metadata header in this README is sufficient).
+- Without `HF_TOKEN`, the app still works with image upload search; text search will be disabled with a warning.
 ## 📁 Project Structure
 ```
 visual-search-system/
+├── app.py                 # Main Streamlit application (entry point)
+├── download_images.py     # Optional: standalone downloader utility
+├── photos_url.csv         # Dataset with image URLs
+├── requirements.txt       # Python dependencies
+├── README.md              # This file (contains HF Spaces metadata)
+└── images/                # Downloaded images (created automatically)
 ```
 ## 🎯 How It Works
+1. On first run, the app reads `photos_url.csv` and downloads up to 250 images (configurable).
+2. It creates local visual embeddings using RGB histograms and saves them to `embeddings/`.
+3. In the UI you can:
+   - Perform text search (requires `HF_TOKEN`) against `openai/clip-vit-base-patch32` via Inference API.
+   - Upload an image to find visually similar images using cosine similarity over local embeddings.
 ## 📊 Dataset Information
+This repository expects a `photos_url.csv` with at least one column containing HTTP/HTTPS image URLs.
+Images are stored as JPEG, optimized to ~800×800 pixels to balance quality and performance.
 ## 🛠️ Technical Details
 ### Dependencies
+- `streamlit` - web interface
+- `pandas` - CSV handling
+- `requests` - HTTP downloads
+- `pillow` - image processing
+- `numpy` - embeddings and similarity
+- `tqdm` - used by `download_images.py` (optional utility)
 ### Performance Features
+- Parallel downloads with retries and exponential backoff
+- Atomic writes for embedding/index files to avoid corruption
+- Progress persisted to `progress.json` for resilience
 ## 🔧 Configuration
 ### Environment Variables
+- `HF_TOKEN` (optional): Hugging Face token to enable text search via Inference API.
+### Customization (in `app.py`)
+- `MAX_IMAGES`: number of images to process (default 250)
+- `MAX_WORKERS`: parallel download workers (default 6)
+- `TARGET_MAX_SIZE`: image resize target (default 800×800)
 ## 🚨 Troubleshooting
 ### Common Issues
+1. **Space fails to start (HF Spaces)**
+   - Ensure the SDK in the Space is set to Streamlit and this README has the metadata block
+   - Confirm `app.py` and `requirements.txt` exist at the repo root
 2. **Image download failures**
    - Check internet connection
    - Verify `photos_url.csv` is present
    - Check available disk space
+   - Reduce `MAX_WORKERS` if hitting rate limits
+3. **Text search not working**
+   - Add `HF_TOKEN` as a Space secret
+   - Ensure the CLIP model endpoint is reachable
 ### Performance Tips