mwalker22's picture
Implemented maintaining stateful values of the contexts retrieved and answers provided. The tests and documentation were also updated.
ed80a59
---
title: TMD-SDG-via-LangGraph
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8501
pinned: false
---
# SDG via LangGraph
This project reproduces the RAGAS Synthetic Data Generation steps using LangGraph instead of the Knowledge Graph approach.
## Features
- Synthetic data generation using Evol Instruct methodology
- Iterative question evolution with alternating prompts:
- Even iterations: More challenging and insightful questions
- Odd iterations: More creative and original questions
- Consistent state management across iterations
- Standardized JSON output format with linked questions, answers, and contexts
- Deployed as a Streamlit app on Hugging Face Spaces
## Evol Instruct Implementation
This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:
### Core Principles
1. **Single Evolution Per Pass**: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
2. **Alternating Prompts**: The system alternates between:
- Challenging/insightful prompts (even-numbered iterations)
- Creative/original prompts (odd-numbered iterations)
3. **State Management**: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
4. **Configurable Evolution Count**: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.
### Implementation Details
- The evolution logic is implemented in `graph/nodes/evolve.py`
- Prompt selection is based on the number of existing evolutions
- State management ensures each evolution builds upon previous results
- Results maintain consistent IDs (`q0`, `q1`, etc.) across questions, answers, and contexts
### Configuration
- Number of evolution passes can be controlled via:
- Streamlit UI slider (web interface)
- `NUM_EVOLVE_PASSES` environment variable (CLI)
### ⚠️ Important Considerations
When modifying this codebase, please keep in mind:
1. The evolution process is intentionally sequential and builds upon previous iterations
2. Maintaining the alternating prompt pattern is crucial for question diversity
3. State management between iterations must preserve the evolution history
4. The ID system (`q0`, `q1`, etc.) must remain consistent across all collections
## Quick Start
### Local Development
1. Create a virtual environment:
```bash
python3.11 -m venv .venv
source .venv/bin/activate
```
2. Install dependencies:
```bash
pip install -e ".[dev]"
```
3. Run the application:
```bash
streamlit run app.py
```
4. Access the app at `http://localhost:8501`
## Deployment
### HuggingFace Spaces
1. Create a new Space on HuggingFace:
- Go to https://huggingface.co/spaces
- Click "New Space"
- Choose "Streamlit" as the SDK
- Choose "Docker" as the hardware
2. Add the HuggingFace remote:
```bash
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
```
3. Push to HuggingFace:
```bash
git push hf main
```
### Environment Variables
The following environment variables need to be set in your HuggingFace Space settings:
- `OPENAI_API_KEY`: Your OpenAI API key
- `LANGCHAIN_API_KEY`: Your LangChain API key (optional)
- `LANGCHAIN_PROJECT`: Your LangChain project name (optional)
- `LANGCHAIN_TRACING_V2`: Set to "true" to enable tracing
- `ENVIRONMENT`: Set to "production" for production mode
- `NUM_EVOLVE_PASSES`: Number of evolution iterations (default: 2)
- `VECTORSTORE_PATH`: Path to store vectors (default: /tmp/vectorstore)
## Project Structure
- `app.py`: Streamlit application for the Hugging Face deployment
- `main.py`: CLI interface with the same functionality as the web app
- `preprocess/`: Code for preprocessing HTML files and creating embeddings
- `graph/`: LangGraph implementation for synthetic data generation
- `nodes/`: Individual graph nodes (evolve, retrieve, answer)
- `types.py`: State management and data structures
- `build_graph.py`: Graph construction and configuration
- `data/`: HTML files containing LLM evolution data
- `tests/`: Test files ensuring correct implementation
- `generated/`: Generated documents, vectorstore, and results