Spaces:

mwalker22
/

TMD-SDG-via-LangGraph

Sleeping

Implemented maintaining stateful values of the contexts retrieved and answers provided. The tests and documentation were also updated.

ed80a59 8 months ago

preview code

raw

history blame contribute delete

4.4 kB

	---
	title: TMD-SDG-via-LangGraph
	emoji: 🤖
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 8501
	pinned: false
	---

	# SDG via LangGraph

	This project reproduces the RAGAS Synthetic Data Generation steps using LangGraph instead of the Knowledge Graph approach.

	## Features

	- Synthetic data generation using Evol Instruct methodology
	- Iterative question evolution with alternating prompts:
	- Even iterations: More challenging and insightful questions
	- Odd iterations: More creative and original questions
	- Consistent state management across iterations
	- Standardized JSON output format with linked questions, answers, and contexts
	- Deployed as a Streamlit app on Hugging Face Spaces

	## Evol Instruct Implementation

	This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:

	### Core Principles

	1. Single Evolution Per Pass: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
	2. Alternating Prompts: The system alternates between:
	- Challenging/insightful prompts (even-numbered iterations)
	- Creative/original prompts (odd-numbered iterations)
	3. State Management: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
	4. Configurable Evolution Count: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.

	### Implementation Details

	- The evolution logic is implemented in `graph/nodes/evolve.py`
	- Prompt selection is based on the number of existing evolutions
	- State management ensures each evolution builds upon previous results
	- Results maintain consistent IDs (`q0`, `q1`, etc.) across questions, answers, and contexts

	### Configuration

	- Number of evolution passes can be controlled via:
	- Streamlit UI slider (web interface)
	- `NUM_EVOLVE_PASSES` environment variable (CLI)

	### ⚠️ Important Considerations

	When modifying this codebase, please keep in mind:
	1. The evolution process is intentionally sequential and builds upon previous iterations
	2. Maintaining the alternating prompt pattern is crucial for question diversity
	3. State management between iterations must preserve the evolution history
	4. The ID system (`q0`, `q1`, etc.) must remain consistent across all collections

	## Quick Start

	### Local Development

	1. Create a virtual environment:
	```bash
	python3.11 -m venv .venv
	source .venv/bin/activate
	```

	2. Install dependencies:
	```bash
	pip install -e ".[dev]"
	```

	3. Run the application:
	```bash
	streamlit run app.py
	```

	4. Access the app at `http://localhost:8501`

	## Deployment

	### HuggingFace Spaces

	1. Create a new Space on HuggingFace:
	- Go to https://huggingface.co/spaces
	- Click "New Space"
	- Choose "Streamlit" as the SDK
	- Choose "Docker" as the hardware

	2. Add the HuggingFace remote:
	```bash
	git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
	```

	3. Push to HuggingFace:
	```bash
	git push hf main
	```

	### Environment Variables

	The following environment variables need to be set in your HuggingFace Space settings:

	- `OPENAI_API_KEY`: Your OpenAI API key
	- `LANGCHAIN_API_KEY`: Your LangChain API key (optional)
	- `LANGCHAIN_PROJECT`: Your LangChain project name (optional)
	- `LANGCHAIN_TRACING_V2`: Set to "true" to enable tracing
	- `ENVIRONMENT`: Set to "production" for production mode
	- `NUM_EVOLVE_PASSES`: Number of evolution iterations (default: 2)
	- `VECTORSTORE_PATH`: Path to store vectors (default: /tmp/vectorstore)

	## Project Structure

	- `app.py`: Streamlit application for the Hugging Face deployment
	- `main.py`: CLI interface with the same functionality as the web app
	- `preprocess/`: Code for preprocessing HTML files and creating embeddings
	- `graph/`: LangGraph implementation for synthetic data generation
	- `nodes/`: Individual graph nodes (evolve, retrieve, answer)
	- `types.py`: State management and data structures
	- `build_graph.py`: Graph construction and configuration
	- `data/`: HTML files containing LLM evolution data
	- `tests/`: Test files ensuring correct implementation
	- `generated/`: Generated documents, vectorstore, and results