File size: 7,639 Bytes
1691ca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b38b9a9
 
 
 
 
1691ca8
 
 
 
 
 
 
 
 
b38b9a9
1691ca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6789f6f
1691ca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6789f6f
1691ca8
 
 
 
 
 
 
 
 
6789f6f
1691ca8
 
 
 
 
 
6789f6f
1691ca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
title: TextLens - AI-Powered OCR
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: mit
---

# πŸ” TextLens - AI-Powered OCR

A modern Vision-Language Model (VLM) based OCR application that extracts text from images using Microsoft Florence-2 model with intelligent fallback systems.

## ✨ Features

- **πŸ€– Advanced VLM OCR**: Uses Microsoft Florence-2 for state-of-the-art text extraction
- **πŸ”„ Smart Fallback System**: Automatically falls back to EasyOCR if Florence-2 fails
- **πŸ§ͺ Demo Mode**: Test mode for demonstration when other methods are unavailable
- **🎨 Modern UI**: Clean, responsive Gradio interface with excellent UX
- **πŸ“± Multiple Input Methods**: Upload, webcam, clipboard support
- **⚑ Real-time Processing**: Automatic text extraction on image upload
- **πŸ“‹ Copy Functionality**: Easy text copying from results
- **πŸš€ GPU Acceleration**: Supports CUDA, MPS, and CPU inference
- **πŸ›‘οΈ Error Handling**: Robust error handling and user-friendly messages

## πŸ—οΈ Architecture

```
textlens-ocr/
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ README.md             # Project documentation
β”œβ”€β”€ models/               # OCR processing modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── ocr_processor.py  # Advanced OCR class with fallbacks
β”œβ”€β”€ utils/                # Utility functions
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── image_utils.py    # Image preprocessing utilities
└── ui/                   # User interface components
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ interface.py      # Gradio interface
    β”œβ”€β”€ handlers.py       # Event handlers
    └── styles.py         # CSS styling
```

## πŸš€ Quick Start

### Local Development

1. **Clone the repository**

   ```bash
   git clone https://github.com/KumarAmrit30/textlens-ocr.git
   cd textlens-ocr
   ```

2. **Set up Python environment**

   ```bash
   python3 -m venv textlens_env
   source textlens_env/bin/activate  # On Windows: textlens_env\Scripts\activate
   ```

3. **Install dependencies**

   ```bash
   pip install -r requirements.txt
   ```

4. **Run the application**

   ```bash
   python app.py
   ```

5. **Open your browser**
   Navigate to `http://localhost:7860`

### Quick Test

Run the test suite to verify everything works:

```bash
python test_ocr.py
```

## πŸ”§ Technical Details

### OCR Processing Pipeline

1. **Primary**: Microsoft Florence-2 VLM

   - State-of-the-art vision-language model
   - Supports both basic OCR and region-based extraction
   - GPU accelerated inference

2. **Fallback**: EasyOCR

   - Traditional OCR with good accuracy
   - Works when Florence-2 fails to load
   - Multi-language support

3. **Demo Mode**: Test Mode
   - Demonstration functionality
   - Shows interface working correctly
   - Used when other methods are unavailable

### Model Loading Strategy

The application uses an intelligent loading strategy:

```python
try:
    # Try Florence-2 with specific revision
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-base",
        revision='refs/pr/6',
        trust_remote_code=True
    )
except:
    # Fall back to default Florence-2
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-base",
        trust_remote_code=True
    )
```

### Device Detection

Automatically detects and uses the best available device:

- **CUDA**: NVIDIA GPUs with CUDA support
- **MPS**: Apple Silicon Macs (M1/M2/M3)
- **CPU**: Fallback for all systems

## πŸ“Š Performance

| Model            | Size   | Speed  | Accuracy  | Use Case              |
| ---------------- | ------ | ------ | --------- | --------------------- |
| Florence-2-base  | 230M   | Fast   | High      | General OCR           |
| Florence-2-large | 770M   | Medium | Very High | High accuracy needs   |
| EasyOCR          | ~100MB | Medium | Good      | Fallback/Multilingual |

## πŸ” Supported Image Formats

- **JPEG** (.jpg, .jpeg)
- **PNG** (.png)
- **WebP** (.webp)
- **BMP** (.bmp)
- **TIFF** (.tiff, .tif)
- **GIF** (.gif)

## 🎯 Use Cases

- **πŸ“„ Document Digitization**: Convert physical documents to text
- **πŸͺ Receipt Processing**: Extract data from receipts and invoices
- **πŸ“± Screenshot Text Extraction**: Get text from app screenshots
- **πŸš— License Plate Reading**: Extract text from vehicle plates
- **πŸ“š Book/Article Scanning**: Digitize printed materials
- **🌐 Multilingual Text**: Process text in various languages

## πŸ› οΈ Configuration

### Model Selection

Change the model in `models/ocr_processor.py`:

```python
# For faster inference
ocr = OCRProcessor(model_name="microsoft/Florence-2-base")

# For higher accuracy
ocr = OCRProcessor(model_name="microsoft/Florence-2-large")
```

### UI Customization

Modify the Gradio interface in `app.py`:

- Update colors and styling in the CSS section
- Change layout in the `create_interface()` function
- Add new features or components

## πŸ§ͺ Testing

The project includes comprehensive tests:

```bash
# Run all tests
python test_ocr.py

# Test specific functionality
python -c "from models.ocr_processor import OCRProcessor; ocr = OCRProcessor(); print(ocr.get_model_info())"
```

## πŸš€ Deployment

### HuggingFace Spaces

1. Fork this repository
2. Create a new Space on HuggingFace
3. Connect your repository
4. The app will automatically deploy

### Docker Deployment

```dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 7860

CMD ["python", "app.py"]
```

### Local Server

```bash
# Production server
pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:7860 app:create_interface().app
```

## πŸ” Environment Variables

| Variable               | Description           | Default                |
| ---------------------- | --------------------- | ---------------------- |
| `GRADIO_SERVER_PORT`   | Server port           | 7860                   |
| `TRANSFORMERS_CACHE`   | Model cache directory | `~/.cache/huggingface` |
| `CUDA_VISIBLE_DEVICES` | GPU device selection  | All available          |

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request

## πŸ“ API Reference

### OCRProcessor Class

```python
from models.ocr_processor import OCRProcessor

# Initialize
ocr = OCRProcessor(model_name="microsoft/Florence-2-base")

# Extract text
text = ocr.extract_text(image)

# Extract with regions
result = ocr.extract_text_with_regions(image)

# Get model info
info = ocr.get_model_info()
```

## πŸ› Troubleshooting

### Common Issues

1. **Model Loading Errors**

   ```bash
   # Install missing dependencies
   pip install einops timm
   ```

2. **CUDA Out of Memory**

   ```python
   # Use CPU instead
   ocr = OCRProcessor()
   ocr.device = "cpu"
   ```

3. **SSL Certificate Errors**
   ```bash
   # Update certificates (macOS)
   /Applications/Python\ 3.x/Install\ Certificates.command
   ```

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- **Microsoft** for the Florence-2 model
- **HuggingFace** for the transformers library
- **Gradio** for the web interface framework
- **EasyOCR** for fallback OCR capabilities

## πŸ“ž Support

- Create an issue for bug reports
- Start a discussion for feature requests
- Check existing issues before posting

---

**Made with ❀️ for the AI community**