Spaces:

mwalker22
/

TMD-SDG-via-LangGraph

Sleeping

Implemented maintaining stateful values of the contexts retrieved and answers provided. The tests and documentation were also updated.

ed80a59 8 months ago

preview code

raw

history blame contribute delete

4.4 kB

metadata

title: TMD-SDG-via-LangGraph
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8501
pinned: false

SDG via LangGraph

This project reproduces the RAGAS Synthetic Data Generation steps using LangGraph instead of the Knowledge Graph approach.

Features

Synthetic data generation using Evol Instruct methodology
Iterative question evolution with alternating prompts:
- Even iterations: More challenging and insightful questions
- Odd iterations: More creative and original questions
Consistent state management across iterations
Standardized JSON output format with linked questions, answers, and contexts
Deployed as a Streamlit app on Hugging Face Spaces

Evol Instruct Implementation

This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:

Core Principles

Single Evolution Per Pass: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
Alternating Prompts: The system alternates between:
- Challenging/insightful prompts (even-numbered iterations)
- Creative/original prompts (odd-numbered iterations)
State Management: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
Configurable Evolution Count: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.

Implementation Details

The evolution logic is implemented in graph/nodes/evolve.py
Prompt selection is based on the number of existing evolutions
State management ensures each evolution builds upon previous results
Results maintain consistent IDs (q0, q1, etc.) across questions, answers, and contexts

Configuration

Number of evolution passes can be controlled via:
- Streamlit UI slider (web interface)
- NUM_EVOLVE_PASSES environment variable (CLI)

⚠️ Important Considerations

When modifying this codebase, please keep in mind:

The evolution process is intentionally sequential and builds upon previous iterations
Maintaining the alternating prompt pattern is crucial for question diversity
State management between iterations must preserve the evolution history
The ID system (q0, q1, etc.) must remain consistent across all collections

Quick Start

Local Development

Create a virtual environment:

python3.11 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -e ".[dev]"

Run the application:

streamlit run app.py

Access the app at http://localhost:8501

Deployment

HuggingFace Spaces

Create a new Space on HuggingFace:
- Go to https://huggingface.co/spaces
- Click "New Space"
- Choose "Streamlit" as the SDK
- Choose "Docker" as the hardware
Add the HuggingFace remote:

git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME

Push to HuggingFace:

git push hf main

Environment Variables

The following environment variables need to be set in your HuggingFace Space settings:

OPENAI_API_KEY: Your OpenAI API key
LANGCHAIN_API_KEY: Your LangChain API key (optional)
LANGCHAIN_PROJECT: Your LangChain project name (optional)
LANGCHAIN_TRACING_V2: Set to "true" to enable tracing
ENVIRONMENT: Set to "production" for production mode
NUM_EVOLVE_PASSES: Number of evolution iterations (default: 2)
VECTORSTORE_PATH: Path to store vectors (default: /tmp/vectorstore)

Project Structure

app.py: Streamlit application for the Hugging Face deployment
main.py: CLI interface with the same functionality as the web app
preprocess/: Code for preprocessing HTML files and creating embeddings
graph/: LangGraph implementation for synthetic data generation
- nodes/: Individual graph nodes (evolve, retrieve, answer)
- types.py: State management and data structures
- build_graph.py: Graph construction and configuration
data/: HTML files containing LLM evolution data
tests/: Test files ensuring correct implementation
generated/: Generated documents, vectorstore, and results