manisharma494 commited on
Commit
fdd82d2
Β·
verified Β·
1 Parent(s): dbb6043

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -68
README.md CHANGED
@@ -12,15 +12,15 @@ license: mit
12
 
13
  # πŸ” Visual Search System
14
 
15
- A comprehensive Streamlit application for browsing and searching through a large dataset of high-quality images from Unsplash.
16
 
17
  ## ✨ Features
18
 
19
- - **πŸ”Ž Search by ID**: Find specific images by their ID number
20
- - **πŸ“¦ Browse by Block**: Navigate through images in organized blocks of 100
21
- - **πŸ“₯ Automatic Downloads**: Automatically downloads missing images with parallel processing
22
- - **πŸš€ Smart Dependencies**: Auto-installs required packages
23
- - **πŸ“± Responsive UI**: Clean, modern interface optimized for all devices
24
 
25
  ## πŸš€ Quick Start
26
 
@@ -44,102 +44,83 @@ A comprehensive Streamlit application for browsing and searching through a large
44
 
45
  ### Hugging Face Spaces Deployment
46
 
47
- 1. **Create a new Space** on Hugging Face
48
- 2. **Choose Streamlit** as the SDK
49
- 3. **Upload these files:**
50
- - `app.py` (main application)
51
- - `download_images.py` (image downloading logic)
52
- - `photos_url.csv` (image dataset)
53
- - `requirements.txt` (dependencies)
54
- - `README.md` (this file)
55
-
56
- The app will automatically:
57
- - Install dependencies
58
- - Check for downloaded images
59
- - Download missing images if needed
60
- - Launch the Streamlit interface
61
 
62
  ## πŸ“ Project Structure
63
 
64
  ```
65
  visual-search-system/
66
- β”œβ”€β”€ app.py # Main Streamlit application
67
- β”œβ”€β”€ download_images.py # Image downloading utilities
68
- β”œβ”€β”€ photos_url.csv # Dataset with 25,000+ image URLs
69
- β”œβ”€β”€ requirements.txt # Python dependencies
70
- β”œβ”€β”€ README.md # This file
71
- └── images/ # Downloaded images (created automatically)
72
  ```
73
 
74
  ## 🎯 How It Works
75
 
76
- ### Search by ID
77
- - Enter a specific image ID (e.g., "0001", "1234")
78
- - Leave empty to browse the first 500 images
79
- - Results update in real-time
80
-
81
- ### Range by Block
82
- - Each block contains 100 images
83
- - Enter a number between 1-250
84
- - Example: Block 100 shows images 10001-10100
85
-
86
- ### Image Management
87
- - Automatically detects existing images
88
- - Downloads missing images in parallel (20 workers)
89
- - Optimizes images to 800x800 pixels
90
- - Saves as compressed JPEGs
91
 
92
  ## πŸ“Š Dataset Information
93
 
94
- - **Total Images**: 25,000+
95
- - **Source**: Unsplash (high-quality stock photos)
96
- - **Format**: JPEG, optimized for web
97
- - **Size**: Approximately 1.5GB total
98
- - **Resolution**: 800x800 pixels (maintains aspect ratio)
99
 
100
  ## πŸ› οΈ Technical Details
101
 
102
  ### Dependencies
103
- - `streamlit` - Web interface framework
104
- - `pandas` - Data manipulation
105
- - `requests` - HTTP requests for image downloads
106
- - `pillow` - Image processing
107
- - `tqdm` - Progress bars
 
108
 
109
  ### Performance Features
110
- - **Parallel Downloads**: Uses ThreadPoolExecutor for speed
111
- - **Retry Logic**: Handles failed downloads gracefully
112
- - **Smart Caching**: Skips already downloaded images
113
- - **Memory Efficient**: Processes images in chunks
114
 
115
  ## πŸ”§ Configuration
116
 
117
  ### Environment Variables
118
- - No environment variables required
119
- - All configuration is built-in
120
 
121
- ### Customization
122
- - Modify `MAX_DISPLAY_IMAGES` in `app.py` to change display limit
123
- - Adjust `max_workers` in download functions for different performance
124
- - Change `target_size` for different image resolutions
125
 
126
  ## 🚨 Troubleshooting
127
 
128
  ### Common Issues
129
 
130
- 1. **"No application file found" on Hugging Face**
131
- - Ensure `app.py` is the main file (not `start_app.py`)
132
- - Check that `requirements.txt` is present
133
- - Verify Streamlit SDK is selected
134
 
135
  2. **Image download failures**
136
  - Check internet connection
137
  - Verify `photos_url.csv` is present
138
  - Check available disk space
 
139
 
140
- 3. **Dependency issues**
141
- - Ensure Python 3.8+ is used
142
- - Try updating pip: `pip install --upgrade pip`
143
 
144
  ### Performance Tips
145
 
 
12
 
13
  # πŸ” Visual Search System
14
 
15
+ A Streamlit app that downloads images from `photos_url.csv`, builds lightweight visual embeddings, and lets you search by text (optional, via Hugging Face Inference API) or by uploading an image.
16
 
17
  ## ✨ Features
18
 
19
+ - **πŸ“₯ Automatic downloads**: Pulls images from `photos_url.csv` with retries and optimization
20
+ - **🧠 Embeddings**: Creates simple, robust RGB histogram embeddings locally (no GPU needed)
21
+ - **πŸ”€ Text search (optional)**: Uses `openai/clip-vit-base-patch32` via HF Inference API when `HF_TOKEN` is provided
22
+ - **πŸ“ Image similarity search**: Upload an image and find visually similar images
23
+ - **πŸ“± Modern UI**: Streamlit interface with responsive layout and status tracking
24
 
25
  ## πŸš€ Quick Start
26
 
 
44
 
45
  ### Hugging Face Spaces Deployment
46
 
47
+ 1. Create a new Space and select the SDK: `Streamlit`.
48
+ 2. Ensure this repository contains at least: `app.py`, `photos_url.csv`, `requirements.txt`, `README.md`.
49
+ 3. Optional: Set a Space secret named `HF_TOKEN` if you want text search enabled.
50
+ - In your Space, go to Settings β†’ Secrets β†’ Add `HF_TOKEN` (a valid Hugging Face token).
51
+ 4. Push/Upload files. The build will install `requirements.txt` and start `app.py` automatically.
52
+
53
+ Notes:
54
+ - You do NOT need a Dockerfile for Streamlit Spaces (the metadata header in this README is sufficient).
55
+ - Without `HF_TOKEN`, the app still works with image upload search; text search will be disabled with a warning.
 
 
 
 
 
56
 
57
  ## πŸ“ Project Structure
58
 
59
  ```
60
  visual-search-system/
61
+ β”œβ”€β”€ app.py # Main Streamlit application (entry point)
62
+ β”œβ”€β”€ download_images.py # Optional: standalone downloader utility
63
+ β”œβ”€β”€ photos_url.csv # Dataset with image URLs
64
+ β”œβ”€β”€ requirements.txt # Python dependencies
65
+ β”œβ”€β”€ README.md # This file (contains HF Spaces metadata)
66
+ └── images/ # Downloaded images (created automatically)
67
  ```
68
 
69
  ## 🎯 How It Works
70
 
71
+ 1. On first run, the app reads `photos_url.csv` and downloads up to 250 images (configurable).
72
+ 2. It creates local visual embeddings using RGB histograms and saves them to `embeddings/`.
73
+ 3. In the UI you can:
74
+ - Perform text search (requires `HF_TOKEN`) against `openai/clip-vit-base-patch32` via Inference API.
75
+ - Upload an image to find visually similar images using cosine similarity over local embeddings.
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## πŸ“Š Dataset Information
78
 
79
+ This repository expects a `photos_url.csv` with at least one column containing HTTP/HTTPS image URLs.
80
+ Images are stored as JPEG, optimized to ~800Γ—800 pixels to balance quality and performance.
 
 
 
81
 
82
  ## πŸ› οΈ Technical Details
83
 
84
  ### Dependencies
85
+ - `streamlit` - web interface
86
+ - `pandas` - CSV handling
87
+ - `requests` - HTTP downloads
88
+ - `pillow` - image processing
89
+ - `numpy` - embeddings and similarity
90
+ - `tqdm` - used by `download_images.py` (optional utility)
91
 
92
  ### Performance Features
93
+ - Parallel downloads with retries and exponential backoff
94
+ - Atomic writes for embedding/index files to avoid corruption
95
+ - Progress persisted to `progress.json` for resilience
 
96
 
97
  ## πŸ”§ Configuration
98
 
99
  ### Environment Variables
100
+ - `HF_TOKEN` (optional): Hugging Face token to enable text search via Inference API.
 
101
 
102
+ ### Customization (in `app.py`)
103
+ - `MAX_IMAGES`: number of images to process (default 250)
104
+ - `MAX_WORKERS`: parallel download workers (default 6)
105
+ - `TARGET_MAX_SIZE`: image resize target (default 800Γ—800)
106
 
107
  ## 🚨 Troubleshooting
108
 
109
  ### Common Issues
110
 
111
+ 1. **Space fails to start (HF Spaces)**
112
+ - Ensure the SDK in the Space is set to Streamlit and this README has the metadata block
113
+ - Confirm `app.py` and `requirements.txt` exist at the repo root
 
114
 
115
  2. **Image download failures**
116
  - Check internet connection
117
  - Verify `photos_url.csv` is present
118
  - Check available disk space
119
+ - Reduce `MAX_WORKERS` if hitting rate limits
120
 
121
+ 3. **Text search not working**
122
+ - Add `HF_TOKEN` as a Space secret
123
+ - Ensure the CLIP model endpoint is reachable
124
 
125
  ### Performance Tips
126