Spaces:

Techbite
/

faq-rag-chatbot

Sleeping

App Files Files Community

faq-rag-chatbot / README.md

Techbite

fix

3c4eeeb 6 months ago

preview code

raw

history blame contribute delete

3.28 kB

	---
	title: FAQ Chatbot Using RAG
	emoji: 💬
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: "1.44.1"
	app_file: app.py
	pinned: false
	---

	# FAQ Chatbot Using RAG for Customer Support - Setup Instructions

	Follow these steps to set up and run the e-commerce FAQ chatbot, optimized for hardware with 16-19GB RAM and 8-11GB GPU.

	## Prerequisites

	- Python 3.8 or higher
	- CUDA-compatible GPU with 8-11GB VRAM
	- 16-19GB RAM
	- Internet connection (for downloading models and datasets)

	## Step 1: Create Project Directory Structure

	```bash
	# Create the project directory
	mkdir faq-rag-chatbot
	cd faq-rag-chatbot

	# Create the source directory
	mkdir -p src data
	```

	## Step 2: Create Virtual Environment

	```bash
	# Create a virtual environment
	python -m venv venv

	# Activate the virtual environment
	# On Windows:
	venv\Scripts\activate
	# On macOS/Linux:
	source venv/bin/activate
	```

	## Step 3: Create Project Files

	Create all the required files with the optimized code provided:

	1. `requirements.txt`
	2. `src/__init__.py`
	3. `src/data_processing.py`
	4. `src/embedding.py`
	5. `src/llm_response.py`
	6. `src/utils.py`
	7. `app.py`

	## Step 4: Install Dependencies

	```bash
	# Install required packages
	pip install -r requirements.txt

	# Additional dependency for memory monitoring
	pip install psutil
	```

	## Step 5: Run the Application

	```bash
	# Make sure the virtual environment is activated
	# Run the Streamlit app
	streamlit run app.py
	```

	## Memory Optimization Notes

	This implementation includes several optimizations for systems with 16-19GB RAM and 8-11GB GPU:

	1. Default to Smaller Models: The app defaults to Phi-2 which works well on 8GB GPUs
	2. 4-bit Quantization: Uses 4-bit quantization for larger models like Mistral-7B
	3. Memory Offloading: Offloads weights to CPU when not in use
	4. Batch Processing: Processes embeddings in smaller batches
	5. Garbage Collection: Aggressively frees memory after operations
	6. Response Length Limits: Generates shorter responses to save memory
	7. CPU Embedding: Keeps the embedding model on CPU to save GPU memory for the LLM

	## Using the Chatbot

	1. The application will automatically download the e-commerce FAQ dataset from Hugging Face
	2. Choose an appropriate model based on your available GPU memory:
	- For 8GB GPU: Use Phi-2 (default)
	- For 10-11GB GPU: You can try Mistral-7B with 4-bit quantization
	- For limited GPU or CPU-only: Use TinyLlama-1.1B
	3. Type a question or select a sample question
	4. The system will retrieve relevant FAQs and generate a response
	5. Monitor memory usage in the sidebar

	## Troubleshooting

	- Out of Memory Errors: If you encounter CUDA out of memory errors, switch to a smaller model like TinyLlama-1.1B
	- Slow Response Times: First response may be slow as the model loads, subsequent responses will be faster
	- Model Loading Issues: If Mistral-7B fails to load, the system will automatically fall back to Phi-2

	## Performance Considerations

	- The embedding and retrieval components work efficiently even on limited hardware
	- Response generation speed depends on the model size and available GPU memory
	- For optimal performance with 8GB GPU, stick with Phi-2 model
	- For faster responses with less accuracy, use TinyLlama-1.1B -->