faq-rag-chatbot / README.md
Techbite's picture
fix
3c4eeeb
|
raw
history blame
3.28 kB
metadata
title: FAQ Chatbot Using RAG
emoji: 💬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false

FAQ Chatbot Using RAG for Customer Support - Setup Instructions

Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU with 8-11GB VRAM
  • 16-19GB RAM
  • Internet connection (for downloading models and datasets)

Step 1: Create Project Directory Structure

# Create the project directory
mkdir faq-rag-chatbot
cd faq-rag-chatbot

# Create the source directory
mkdir -p src data

Step 2: Create Virtual Environment

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Step 3: Create Project Files

Create all the required files with the optimized code provided:

  1. requirements.txt
  2. src/__init__.py
  3. src/data_processing.py
  4. src/embedding.py
  5. src/llm_response.py
  6. src/utils.py
  7. app.py

Step 4: Install Dependencies

# Install required packages
pip install -r requirements.txt

# Additional dependency for memory monitoring
pip install psutil

Step 5: Run the Application

# Make sure the virtual environment is activated
# Run the Streamlit app
streamlit run app.py

Memory Optimization Notes

This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:

  1. Default to Smaller Models: The app defaults to Phi-2 which works well on 8GB GPUs
  2. 4-bit Quantization: Uses 4-bit quantization for larger models like Mistral-7B
  3. Memory Offloading: Offloads weights to CPU when not in use
  4. Batch Processing: Processes embeddings in smaller batches
  5. Garbage Collection: Aggressively frees memory after operations
  6. Response Length Limits: Generates shorter responses to save memory
  7. CPU Embedding: Keeps the embedding model on CPU to save GPU memory for the LLM

Using the Chatbot

  1. The application will automatically download the e-commerce FAQ dataset from Hugging Face
  2. Choose an appropriate model based on your available GPU memory:
    • For 8GB GPU: Use Phi-2 (default)
    • For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
    • For limited GPU or CPU-only: Use TinyLlama-1.1B
  3. Type a question or select a sample question
  4. The system will retrieve relevant FAQs and generate a response
  5. Monitor memory usage in the sidebar

Troubleshooting

  • Out of Memory Errors: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
  • Slow Response Times: First response may be slow as the model loads, subsequent responses will be faster
  • Model Loading Issues: If Mistral-7B fails to load, the system will automatically fall back to Phi-2

Performance Considerations

  • The embedding and retrieval components work efficiently even on limited hardware
  • Response generation speed depends on the model size and available GPU memory
  • For optimal performance with 8GB GPU, stick with Phi-2 model
  • For faster responses with less accuracy, use TinyLlama-1.1B -->