Spaces:
Sleeping
Sleeping
metadata
title: TMD-SDG-via-LangGraph
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8501
pinned: false
SDG via LangGraph
This project reproduces the RAGAS Synthetic Data Generation steps using LangGraph instead of the Knowledge Graph approach.
Features
- Synthetic data generation using Evol Instruct methodology
- Iterative question evolution with alternating prompts:
- Even iterations: More challenging and insightful questions
- Odd iterations: More creative and original questions
- Consistent state management across iterations
- Standardized JSON output format with linked questions, answers, and contexts
- Deployed as a Streamlit app on Hugging Face Spaces
Evol Instruct Implementation
This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:
Core Principles
- Single Evolution Per Pass: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
- Alternating Prompts: The system alternates between:
- Challenging/insightful prompts (even-numbered iterations)
- Creative/original prompts (odd-numbered iterations)
- State Management: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
- Configurable Evolution Count: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.
Implementation Details
- The evolution logic is implemented in
graph/nodes/evolve.py - Prompt selection is based on the number of existing evolutions
- State management ensures each evolution builds upon previous results
- Results maintain consistent IDs (
q0,q1, etc.) across questions, answers, and contexts
Configuration
- Number of evolution passes can be controlled via:
- Streamlit UI slider (web interface)
NUM_EVOLVE_PASSESenvironment variable (CLI)
⚠️ Important Considerations
When modifying this codebase, please keep in mind:
- The evolution process is intentionally sequential and builds upon previous iterations
- Maintaining the alternating prompt pattern is crucial for question diversity
- State management between iterations must preserve the evolution history
- The ID system (
q0,q1, etc.) must remain consistent across all collections
Quick Start
Local Development
- Create a virtual environment:
python3.11 -m venv .venv
source .venv/bin/activate
- Install dependencies:
pip install -e ".[dev]"
- Run the application:
streamlit run app.py
- Access the app at
http://localhost:8501
Deployment
HuggingFace Spaces
Create a new Space on HuggingFace:
- Go to https://huggingface.co/spaces
- Click "New Space"
- Choose "Streamlit" as the SDK
- Choose "Docker" as the hardware
Add the HuggingFace remote:
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
- Push to HuggingFace:
git push hf main
Environment Variables
The following environment variables need to be set in your HuggingFace Space settings:
OPENAI_API_KEY: Your OpenAI API keyLANGCHAIN_API_KEY: Your LangChain API key (optional)LANGCHAIN_PROJECT: Your LangChain project name (optional)LANGCHAIN_TRACING_V2: Set to "true" to enable tracingENVIRONMENT: Set to "production" for production modeNUM_EVOLVE_PASSES: Number of evolution iterations (default: 2)VECTORSTORE_PATH: Path to store vectors (default: /tmp/vectorstore)
Project Structure
app.py: Streamlit application for the Hugging Face deploymentmain.py: CLI interface with the same functionality as the web apppreprocess/: Code for preprocessing HTML files and creating embeddingsgraph/: LangGraph implementation for synthetic data generationnodes/: Individual graph nodes (evolve, retrieve, answer)types.py: State management and data structuresbuild_graph.py: Graph construction and configuration
data/: HTML files containing LLM evolution datatests/: Test files ensuring correct implementationgenerated/: Generated documents, vectorstore, and results