File size: 3,284 Bytes
033798f
 
 
 
 
 
f730649
033798f
 
 
 
 
26d1a81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4eeeb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: FAQ Chatbot Using RAG
emoji: 💬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.44.1"
app_file: app.py
pinned: false
---

# FAQ Chatbot Using RAG for Customer Support - Setup Instructions

Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.

## Prerequisites

- Python 3.8 or higher
- CUDA-compatible GPU with 8-11GB VRAM
- 16-19GB RAM
- Internet connection (for downloading models and datasets)

## Step 1: Create Project Directory Structure

```bash
# Create the project directory
mkdir faq-rag-chatbot
cd faq-rag-chatbot

# Create the source directory
mkdir -p src data
```

## Step 2: Create Virtual Environment

```bash
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
```

## Step 3: Create Project Files

Create all the required files with the optimized code provided:

1. `requirements.txt`
2. `src/__init__.py`
3. `src/data_processing.py`
4. `src/embedding.py`
5. `src/llm_response.py`
6. `src/utils.py`
7. `app.py`

## Step 4: Install Dependencies

```bash
# Install required packages
pip install -r requirements.txt

# Additional dependency for memory monitoring
pip install psutil
```

## Step 5: Run the Application

```bash
# Make sure the virtual environment is activated
# Run the Streamlit app
streamlit run app.py
```

## Memory Optimization Notes

This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:

1. **Default to Smaller Models**: The app defaults to Phi-2 which works well on 8GB GPUs
2. **4-bit Quantization**: Uses 4-bit quantization for larger models like Mistral-7B
3. **Memory Offloading**: Offloads weights to CPU when not in use
4. **Batch Processing**: Processes embeddings in smaller batches
5. **Garbage Collection**: Aggressively frees memory after operations
6. **Response Length Limits**: Generates shorter responses to save memory
7. **CPU Embedding**: Keeps the embedding model on CPU to save GPU memory for the LLM

## Using the Chatbot

1. The application will automatically download the e-commerce FAQ dataset from Hugging Face
2. Choose an appropriate model based on your available GPU memory:
   - For 8GB GPU: Use Phi-2 (default)
   - For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
   - For limited GPU or CPU-only: Use TinyLlama-1.1B
3. Type a question or select a sample question
4. The system will retrieve relevant FAQs and generate a response
5. Monitor memory usage in the sidebar

## Troubleshooting

- **Out of Memory Errors**: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
- **Slow Response Times**: First response may be slow as the model loads, subsequent responses will be faster
- **Model Loading Issues**: If Mistral-7B fails to load, the system will automatically fall back to Phi-2

## Performance Considerations

- The embedding and retrieval components work efficiently even on limited hardware
- Response generation speed depends on the model size and available GPU memory
- For optimal performance with 8GB GPU, stick with Phi-2 model
- For faster responses with less accuracy, use TinyLlama-1.1B -->