Spaces:

Techbite
/

faq-rag-chatbot

Sleeping

App Files Files Community

faq-rag-chatbot / README.md

Techbite

fix

3c4eeeb 5 months ago

preview code

raw

history blame

3.28 kB

metadata

title: FAQ Chatbot Using RAG
emoji: 💬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false

FAQ Chatbot Using RAG for Customer Support - Setup Instructions

Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU with 8-11GB VRAM
16-19GB RAM
Internet connection (for downloading models and datasets)

Step 1: Create Project Directory Structure

# Create the project directory
mkdir faq-rag-chatbot
cd faq-rag-chatbot

# Create the source directory
mkdir -p src data

Step 2: Create Virtual Environment

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Step 3: Create Project Files

Create all the required files with the optimized code provided:

requirements.txt
src/__init__.py
src/data_processing.py
src/embedding.py
src/llm_response.py
src/utils.py
app.py

Step 4: Install Dependencies

# Install required packages
pip install -r requirements.txt

# Additional dependency for memory monitoring
pip install psutil

Step 5: Run the Application

# Make sure the virtual environment is activated
# Run the Streamlit app
streamlit run app.py

Memory Optimization Notes

This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:

Default to Smaller Models: The app defaults to Phi-2 which works well on 8GB GPUs
4-bit Quantization: Uses 4-bit quantization for larger models like Mistral-7B
Memory Offloading: Offloads weights to CPU when not in use
Batch Processing: Processes embeddings in smaller batches
Garbage Collection: Aggressively frees memory after operations
Response Length Limits: Generates shorter responses to save memory
CPU Embedding: Keeps the embedding model on CPU to save GPU memory for the LLM

Using the Chatbot

The application will automatically download the e-commerce FAQ dataset from Hugging Face
Choose an appropriate model based on your available GPU memory:
- For 8GB GPU: Use Phi-2 (default)
- For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
- For limited GPU or CPU-only: Use TinyLlama-1.1B
Type a question or select a sample question
The system will retrieve relevant FAQs and generate a response
Monitor memory usage in the sidebar

Troubleshooting

Out of Memory Errors: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
Slow Response Times: First response may be slow as the model loads, subsequent responses will be faster
Model Loading Issues: If Mistral-7B fails to load, the system will automatically fall back to Phi-2

Performance Considerations

The embedding and retrieval components work efficiently even on limited hardware
Response generation speed depends on the model size and available GPU memory
For optimal performance with 8GB GPU, stick with Phi-2 model
For faster responses with less accuracy, use TinyLlama-1.1B -->