mwalker22's picture
Implemented maintaining stateful values of the contexts retrieved and answers provided. The tests and documentation were also updated.
ed80a59
metadata
title: TMD-SDG-via-LangGraph
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8501
pinned: false

SDG via LangGraph

This project reproduces the RAGAS Synthetic Data Generation steps using LangGraph instead of the Knowledge Graph approach.

Features

  • Synthetic data generation using Evol Instruct methodology
  • Iterative question evolution with alternating prompts:
    • Even iterations: More challenging and insightful questions
    • Odd iterations: More creative and original questions
  • Consistent state management across iterations
  • Standardized JSON output format with linked questions, answers, and contexts
  • Deployed as a Streamlit app on Hugging Face Spaces

Evol Instruct Implementation

This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:

Core Principles

  1. Single Evolution Per Pass: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
  2. Alternating Prompts: The system alternates between:
    • Challenging/insightful prompts (even-numbered iterations)
    • Creative/original prompts (odd-numbered iterations)
  3. State Management: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
  4. Configurable Evolution Count: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.

Implementation Details

  • The evolution logic is implemented in graph/nodes/evolve.py
  • Prompt selection is based on the number of existing evolutions
  • State management ensures each evolution builds upon previous results
  • Results maintain consistent IDs (q0, q1, etc.) across questions, answers, and contexts

Configuration

  • Number of evolution passes can be controlled via:
    • Streamlit UI slider (web interface)
    • NUM_EVOLVE_PASSES environment variable (CLI)

⚠️ Important Considerations

When modifying this codebase, please keep in mind:

  1. The evolution process is intentionally sequential and builds upon previous iterations
  2. Maintaining the alternating prompt pattern is crucial for question diversity
  3. State management between iterations must preserve the evolution history
  4. The ID system (q0, q1, etc.) must remain consistent across all collections

Quick Start

Local Development

  1. Create a virtual environment:
python3.11 -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -e ".[dev]"
  1. Run the application:
streamlit run app.py
  1. Access the app at http://localhost:8501

Deployment

HuggingFace Spaces

  1. Create a new Space on HuggingFace:

  2. Add the HuggingFace remote:

git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
  1. Push to HuggingFace:
git push hf main

Environment Variables

The following environment variables need to be set in your HuggingFace Space settings:

  • OPENAI_API_KEY: Your OpenAI API key
  • LANGCHAIN_API_KEY: Your LangChain API key (optional)
  • LANGCHAIN_PROJECT: Your LangChain project name (optional)
  • LANGCHAIN_TRACING_V2: Set to "true" to enable tracing
  • ENVIRONMENT: Set to "production" for production mode
  • NUM_EVOLVE_PASSES: Number of evolution iterations (default: 2)
  • VECTORSTORE_PATH: Path to store vectors (default: /tmp/vectorstore)

Project Structure

  • app.py: Streamlit application for the Hugging Face deployment
  • main.py: CLI interface with the same functionality as the web app
  • preprocess/: Code for preprocessing HTML files and creating embeddings
  • graph/: LangGraph implementation for synthetic data generation
    • nodes/: Individual graph nodes (evolve, retrieve, answer)
    • types.py: State management and data structures
    • build_graph.py: Graph construction and configuration
  • data/: HTML files containing LLM evolution data
  • tests/: Test files ensuring correct implementation
  • generated/: Generated documents, vectorstore, and results