Spaces:
Sleeping
Sleeping
metadata
title: FAQ Chatbot Using RAG
emoji: 💬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
FAQ Chatbot Using RAG for Customer Support - Setup Instructions
Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.
Prerequisites
- Python 3.8 or higher
- CUDA-compatible GPU with 8-11GB VRAM
- 16-19GB RAM
- Internet connection (for downloading models and datasets)
Step 1: Create Project Directory Structure
# Create the project directory
mkdir faq-rag-chatbot
cd faq-rag-chatbot
# Create the source directory
mkdir -p src data
Step 2: Create Virtual Environment
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
Step 3: Create Project Files
Create all the required files with the optimized code provided:
requirements.txt
src/__init__.py
src/data_processing.py
src/embedding.py
src/llm_response.py
src/utils.py
app.py
Step 4: Install Dependencies
# Install required packages
pip install -r requirements.txt
# Additional dependency for memory monitoring
pip install psutil
Step 5: Run the Application
# Make sure the virtual environment is activated
# Run the Streamlit app
streamlit run app.py
Memory Optimization Notes
This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:
- Default to Smaller Models: The app defaults to Phi-2 which works well on 8GB GPUs
- 4-bit Quantization: Uses 4-bit quantization for larger models like Mistral-7B
- Memory Offloading: Offloads weights to CPU when not in use
- Batch Processing: Processes embeddings in smaller batches
- Garbage Collection: Aggressively frees memory after operations
- Response Length Limits: Generates shorter responses to save memory
- CPU Embedding: Keeps the embedding model on CPU to save GPU memory for the LLM
Using the Chatbot
- The application will automatically download the e-commerce FAQ dataset from Hugging Face
- Choose an appropriate model based on your available GPU memory:
- For 8GB GPU: Use Phi-2 (default)
- For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
- For limited GPU or CPU-only: Use TinyLlama-1.1B
- Type a question or select a sample question
- The system will retrieve relevant FAQs and generate a response
- Monitor memory usage in the sidebar
Troubleshooting
- Out of Memory Errors: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
- Slow Response Times: First response may be slow as the model loads, subsequent responses will be faster
- Model Loading Issues: If Mistral-7B fails to load, the system will automatically fall back to Phi-2
Performance Considerations
- The embedding and retrieval components work efficiently even on limited hardware
- Response generation speed depends on the model size and available GPU memory
- For optimal performance with 8GB GPU, stick with Phi-2 model
- For faster responses with less accuracy, use TinyLlama-1.1B -->