Spaces:

Techbite
/

faq-rag-chatbot

Sleeping

File size: 3,284 Bytes

---
title: FAQ Chatbot Using RAG
emoji: 💬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.44.1"
app_file: app.py
pinned: false
---

# FAQ Chatbot Using RAG for Customer Support - Setup Instructions

Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.

## Prerequisites

- Python 3.8 or higher
- CUDA-compatible GPU with 8-11GB VRAM
- 16-19GB RAM
- Internet connection (for downloading models and datasets)

## Step 1: Create Project Directory Structure

```bash
# Create the project directory
mkdir faq-rag-chatbot
cd faq-rag-chatbot

# Create the source directory
mkdir -p src data
```

## Step 2: Create Virtual Environment

```bash
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
```

## Step 3: Create Project Files

Create all the required files with the optimized code provided:

1. `requirements.txt`
2. `src/__init__.py`
3. `src/data_processing.py`
4. `src/embedding.py`
5. `src/llm_response.py`
6. `src/utils.py`
7. `app.py`

## Step 4: Install Dependencies

```bash
# Install required packages
pip install -r requirements.txt

# Additional dependency for memory monitoring
pip install psutil
```

## Step 5: Run the Application

```bash
# Make sure the virtual environment is activated
# Run the Streamlit app
streamlit run app.py
```

## Memory Optimization Notes

This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:

1. **Default to Smaller Models**: The app defaults to Phi-2 which works well on 8GB GPUs
2. **4-bit Quantization**: Uses 4-bit quantization for larger models like Mistral-7B
3. **Memory Offloading**: Offloads weights to CPU when not in use
4. **Batch Processing**: Processes embeddings in smaller batches
5. **Garbage Collection**: Aggressively frees memory after operations
6. **Response Length Limits**: Generates shorter responses to save memory
7. **CPU Embedding**: Keeps the embedding model on CPU to save GPU memory for the LLM

## Using the Chatbot

1. The application will automatically download the e-commerce FAQ dataset from Hugging Face
2. Choose an appropriate model based on your available GPU memory:
   - For 8GB GPU: Use Phi-2 (default)
   - For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
   - For limited GPU or CPU-only: Use TinyLlama-1.1B
3. Type a question or select a sample question
4. The system will retrieve relevant FAQs and generate a response
5. Monitor memory usage in the sidebar

## Troubleshooting

- **Out of Memory Errors**: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
- **Slow Response Times**: First response may be slow as the model loads, subsequent responses will be faster
- **Model Loading Issues**: If Mistral-7B fails to load, the system will automatically fall back to Phi-2

## Performance Considerations

- The embedding and retrieval components work efficiently even on limited hardware
- Response generation speed depends on the model size and available GPU memory
- For optimal performance with 8GB GPU, stick with Phi-2 model
- For faster responses with less accuracy, use TinyLlama-1.1B -->