Spaces:
Sleeping
Sleeping
| title: FAQ Chatbot Using RAG | |
| emoji: 💬 | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: streamlit | |
| sdk_version: "1.30.0" | |
| app_file: app.py | |
| pinned: false | |
| # FAQ Chatbot Using RAG for Customer Support - Setup Instructions | |
| Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU. | |
| ## Prerequisites | |
| - Python 3.8 or higher | |
| - CUDA-compatible GPU with 8-11GB VRAM | |
| - 16-19GB RAM | |
| - Internet connection (for downloading models and datasets) | |
| ## Step 1: Create Project Directory Structure | |
| ```bash | |
| # Create the project directory | |
| mkdir faq-rag-chatbot | |
| cd faq-rag-chatbot | |
| # Create the source directory | |
| mkdir -p src data | |
| ``` | |
| ## Step 2: Create Virtual Environment | |
| ```bash | |
| # Create a virtual environment | |
| python -m venv venv | |
| # Activate the virtual environment | |
| # On Windows: | |
| venv\Scripts\activate | |
| # On macOS/Linux: | |
| source venv/bin/activate | |
| ``` | |
| ## Step 3: Create Project Files | |
| Create all the required files with the optimized code provided: | |
| 1. `requirements.txt` | |
| 2. `src/__init__.py` | |
| 3. `src/data_processing.py` | |
| 4. `src/embedding.py` | |
| 5. `src/llm_response.py` | |
| 6. `src/utils.py` | |
| 7. `app.py` | |
| ## Step 4: Install Dependencies | |
| ```bash | |
| # Install required packages | |
| pip install -r requirements.txt | |
| # Additional dependency for memory monitoring | |
| pip install psutil | |
| ``` | |
| ## Step 5: Run the Application | |
| ```bash | |
| # Make sure the virtual environment is activated | |
| # Run the Streamlit app | |
| streamlit run app.py | |
| ``` | |
| ## Memory Optimization Notes | |
| This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU: | |
| 1. **Default to Smaller Models**: The app defaults to Phi-2 which works well on 8GB GPUs | |
| 2. **4-bit Quantization**: Uses 4-bit quantization for larger models like Mistral-7B | |
| 3. **Memory Offloading**: Offloads weights to CPU when not in use | |
| 4. **Batch Processing**: Processes embeddings in smaller batches | |
| 5. **Garbage Collection**: Aggressively frees memory after operations | |
| 6. **Response Length Limits**: Generates shorter responses to save memory | |
| 7. **CPU Embedding**: Keeps the embedding model on CPU to save GPU memory for the LLM | |
| ## Using the Chatbot | |
| 1. The application will automatically download the e-commerce FAQ dataset from Hugging Face | |
| 2. Choose an appropriate model based on your available GPU memory: | |
| - For 8GB GPU: Use Phi-2 (default) | |
| - For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization | |
| - For limited GPU or CPU-only: Use TinyLlama-1.1B | |
| 3. Type a question or select a sample question | |
| 4. The system will retrieve relevant FAQs and generate a response | |
| 5. Monitor memory usage in the sidebar | |
| ## Troubleshooting | |
| - **Out of Memory Errors**: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B | |
| - **Slow Response Times**: First response may be slow as the model loads, subsequent responses will be faster | |
| - **Model Loading Issues**: If Mistral-7B fails to load, the system will automatically fall back to Phi-2 | |
| ## Performance Considerations | |
| - The embedding and retrieval components work efficiently even on limited hardware | |
| - Response generation speed depends on the model size and available GPU memory | |
| - For optimal performance with 8GB GPU, stick with Phi-2 model | |
| - For faster responses with less accuracy, use TinyLlama-1.1B |