Spaces:

mwalker22
/

TMD-SDG-via-LangGraph

Sleeping

App Files Files Community

mwalker22 commited on Apr 28

Commit

e9f7aa8

unverified ·

2 Parent(s): 8ee27c8 ed80a59

Merge pull request #17 from mwalker-tmd/feature/evolve-instruct

Browse files

Files changed (12) hide show

Dockerfile +10 -3
README.md +49 -5
app.py +37 -13
graph/nodes/answer.py +3 -2
graph/nodes/evolve.py +15 -8
graph/nodes/retrieve.py +3 -2
graph/types.py +7 -2
main.py +65 -4
pyproject.toml +2 -1
tests/graph/nodes/test_evolve.py +71 -4
tests/graph/test_build_graph.py +3 -1
uv.lock +0 -0

Dockerfile CHANGED Viewed

@@ -1,6 +1,10 @@
 # Use Python 3.11 as base image
 FROM python:3.11-slim
 # Set working directory
 WORKDIR /app
@@ -26,18 +30,21 @@ COPY data/ data/
 # Create a shell script to run the application
 RUN echo '#!/bin/bash\n\
 source /app/.venv/bin/activate\n\
-exec /app/.venv/bin/streamlit run app.py --server.port=8501 --server.address=0.0.0.0' > /app/run.sh && \
     chmod +x /app/run.sh
-# Expose the port Streamlit runs on
-EXPOSE 8501
 # Set environment variables
 ENV PYTHONUNBUFFERED=1
 ENV ENVIRONMENT=development
 ENV LANGCHAIN_TRACING_V2=false
 ENV PATH="/app/.venv/bin:$PATH"
 # Command to run the application
 CMD ["/app/run.sh"]

 # Use Python 3.11 as base image
 FROM python:3.11-slim
+# Add build argument for version tracking
+ARG BUILD_VERSION=1.0.0
+ENV BUILD_VERSION=${BUILD_VERSION}
 # Set working directory
 WORKDIR /app
 # Create a shell script to run the application
 RUN echo '#!/bin/bash\n\
+echo "Starting application version ${BUILD_VERSION}"\n\
 source /app/.venv/bin/activate\n\
+PORT=${PORT:-8501}\n\
+exec /app/.venv/bin/streamlit run app.py --server.port=${PORT} --server.address=0.0.0.0' > /app/run.sh && \
     chmod +x /app/run.sh
+# Expose the default port Streamlit runs on
+EXPOSE ${PORT:-8501}
 # Set environment variables
 ENV PYTHONUNBUFFERED=1
 ENV ENVIRONMENT=development
 ENV LANGCHAIN_TRACING_V2=false
 ENV PATH="/app/.venv/bin:$PATH"
+ENV PORT=8501
 # Command to run the application
 CMD ["/app/run.sh"]

README.md CHANGED Viewed

@@ -14,11 +14,48 @@ This project reproduces the RAGAS Synthetic Data Generation steps using LangGrap
 ## Features
-- Synthetic data generation using Evol Instruct method
-- Three evolution types: Simple, Multi-Context, and Reasoning
-- Output includes evolved questions, answers, and relevant contexts
 - Deployed as a Streamlit app on Hugging Face Spaces
 ## Quick Start
 ### Local Development
@@ -68,13 +105,20 @@ The following environment variables need to be set in your HuggingFace Space set
 - `OPENAI_API_KEY`: Your OpenAI API key
 - `LANGCHAIN_API_KEY`: Your LangChain API key (optional)
 - `LANGCHAIN_PROJECT`: Your LangChain project name (optional)
 - `ENVIRONMENT`: Set to "production" for production mode
 ## Project Structure
 - `app.py`: Streamlit application for the Hugging Face deployment
 - `preprocess/`: Code for preprocessing HTML files and creating embeddings
 - `graph/`: LangGraph implementation for synthetic data generation
 - `data/`: HTML files containing LLM evolution data
-- `tests/`: Test files
-- `generated/`: Generated documents and vectorstore

 ## Features
+- Synthetic data generation using Evol Instruct methodology
+- Iterative question evolution with alternating prompts:
+  - Even iterations: More challenging and insightful questions
+  - Odd iterations: More creative and original questions
+- Consistent state management across iterations
+- Standardized JSON output format with linked questions, answers, and contexts
 - Deployed as a Streamlit app on Hugging Face Spaces
+## Evol Instruct Implementation
+This project implements the Evol Instruct methodology for evolving questions through multiple iterations. The implementation has several key aspects that should be considered when modifying the code:
+### Core Principles
+1. **Single Evolution Per Pass**: Each graph invocation performs one evolution step, maintaining clarity and control over the evolution process.
+2. **Alternating Prompts**: The system alternates between:
+   - Challenging/insightful prompts (even-numbered iterations)
+   - Creative/original prompts (odd-numbered iterations)
+3. **State Management**: Evolution history is preserved between iterations of the evolving questions process. In addition, each node in the chain only processes the latest evolved question.
+4. **Configurable Evolution Count**: The number of evolution passes can be controlled through UI or environment variables, allowing flexibility in the evolution process.
+### Implementation Details
+- The evolution logic is implemented in `graph/nodes/evolve.py`
+- Prompt selection is based on the number of existing evolutions
+- State management ensures each evolution builds upon previous results
+- Results maintain consistent IDs (`q0`, `q1`, etc.) across questions, answers, and contexts
+### Configuration
+- Number of evolution passes can be controlled via:
+  - Streamlit UI slider (web interface)
+  - `NUM_EVOLVE_PASSES` environment variable (CLI)
+### ⚠️ Important Considerations
+When modifying this codebase, please keep in mind:
+1. The evolution process is intentionally sequential and builds upon previous iterations
+2. Maintaining the alternating prompt pattern is crucial for question diversity
+3. State management between iterations must preserve the evolution history
+4. The ID system (`q0`, `q1`, etc.) must remain consistent across all collections
 ## Quick Start
 ### Local Development
 - `OPENAI_API_KEY`: Your OpenAI API key
 - `LANGCHAIN_API_KEY`: Your LangChain API key (optional)
 - `LANGCHAIN_PROJECT`: Your LangChain project name (optional)
+- `LANGCHAIN_TRACING_V2`: Set to "true" to enable tracing
 - `ENVIRONMENT`: Set to "production" for production mode
+- `NUM_EVOLVE_PASSES`: Number of evolution iterations (default: 2)
+- `VECTORSTORE_PATH`: Path to store vectors (default: /tmp/vectorstore)
 ## Project Structure
 - `app.py`: Streamlit application for the Hugging Face deployment
+- `main.py`: CLI interface with the same functionality as the web app
 - `preprocess/`: Code for preprocessing HTML files and creating embeddings
 - `graph/`: LangGraph implementation for synthetic data generation
+  - `nodes/`: Individual graph nodes (evolve, retrieve, answer)
+  - `types.py`: State management and data structures
+  - `build_graph.py`: Graph construction and configuration
 - `data/`: HTML files containing LLM evolution data
+- `tests/`: Test files ensuring correct implementation
+- `generated/`: Generated documents, vectorstore, and results

app.py CHANGED Viewed

@@ -52,24 +52,46 @@ def initialize_resources():
 # Initialize resources
 docs, vectorstore, graph = initialize_resources()
 # Generate synthetic data button
 if st.button("Generate Synthetic Data"):
     with st.spinner("Generating synthetic data..."):
         # Create initial state
-        initial_state = SDGState(
             input="Generate synthetic data about LLM evolution",
             documents=[],
-            evolved_question="",
             context=[],
-            answer=""
         )
-        logger.debug(f"Initial state before invoke: {initial_state}")
-        # Invoke the graph with the SDGState object
-        result = graph.invoke(initial_state)
-        logger.debug(f"Graph result: {result}")
-        if not isinstance(result, SDGState):
-            result = SDGState(**dict(result))
         # Display results
         st.subheader("Generated Data")
@@ -77,22 +99,24 @@ if st.button("Generate Synthetic Data"):
         # Display evolved questions
         st.markdown("### Evolved Questions")
         evolved_questions = [
-            {"id": f"q{i}", "question": q, "evolution_type": "simple"}
-            for i, q in enumerate([result.evolved_question])  # Currently only one question
         ]
         st.json(evolved_questions)
         # Display answers
         st.markdown("### Answers")
         answers = [
-            {"id": "q0", "answer": result.answer}
         ]
         st.json(answers)
         # Display contexts
         st.markdown("### Contexts")
         contexts = [
-            {"id": "q0", "contexts": result.context}
         ]
         st.json(contexts)

 # Initialize resources
 docs, vectorstore, graph = initialize_resources()
+# Add a number input for evolution passes
+num_evolve_passes = st.number_input(
+    label="Number of Evolution Passes",
+    min_value=1,
+    max_value=10,
+    value=2,
+    step=1,
+    help="How many times to evolve the question (alternates between challenging and creative prompts)."
+)
 # Generate synthetic data button
 if st.button("Generate Synthetic Data"):
     with st.spinner("Generating synthetic data..."):
         # Create initial state
+        state = SDGState(
             input="Generate synthetic data about LLM evolution",
             documents=[],
+            evolved_questions=[],
             context=[],
+            answer="",
+            num_evolve_passes=num_evolve_passes
         )
+        # Run the graph for each evolution pass
+        all_results = []
+        for i in range(num_evolve_passes):
+            logger.debug(f"Running evolution pass {i+1}/{num_evolve_passes}")
+            result = graph.invoke(state)
+            if not isinstance(result, SDGState):
+                result = SDGState(**dict(result))
+            all_results.append(result)
+            # Update state for next iteration with evolved questions
+            state = SDGState(
+                input=state.input,
+                documents=state.documents,
+                evolved_questions=result.evolved_questions,  # Pass forward all evolved questions
+                context=[],  # Reset context for next iteration
+                answer="",   # Reset answer for next iteration
+                num_evolve_passes=num_evolve_passes
+            )
         # Display results
         st.subheader("Generated Data")
         # Display evolved questions
         st.markdown("### Evolved Questions")
         evolved_questions = [
+            {"id": f"q{i}", "question": result.evolved_questions[-1], "evolution_type": "simple"}
+            for i, result in enumerate(all_results)
         ]
         st.json(evolved_questions)
         # Display answers
         st.markdown("### Answers")
         answers = [
+            {"id": f"q{i}", "answer": result.answer}
+            for i, result in enumerate(all_results)
         ]
         st.json(answers)
         # Display contexts
         st.markdown("### Contexts")
         contexts = [
+            {"id": f"q{i}", "contexts": result.context}
+            for i, result in enumerate(all_results)
         ]
         st.json(contexts)

graph/nodes/answer.py CHANGED Viewed

@@ -17,9 +17,10 @@ def generate_answer(state: SDGState) -> SDGState:
     new_state = SDGState(
         input=state.input,
         documents=state.documents,
-        evolved_question=state.evolved_question,
         context=state.context,
-        answer=f"Based on the retrieved context:\n{context_snippet}"
     )
     logger.debug(f"Answer node returning state: {new_state}")

     new_state = SDGState(
         input=state.input,
         documents=state.documents,
+        evolved_questions=state.evolved_questions,
         context=state.context,
+        answer=f"Based on the retrieved context:\n{context_snippet}",
+        num_evolve_passes=state.num_evolve_passes
     )
     logger.debug(f"Answer node returning state: {new_state}")

graph/nodes/evolve.py CHANGED Viewed

@@ -5,20 +5,27 @@ import logging
 logger = logging.getLogger(__name__)
 def evolve_question(state: SDGState, llm) -> SDGState:
-    logger.debug(f"Evolve node received state: {state}")
-    # Use the LLM to generate an evolved question
-    prompt = f"Rewrite or evolve the following question to be more challenging or insightful:\n\n{state.input}"
     response = llm.invoke(prompt)
-    evolved_question = response.content if hasattr(response, 'content') else str(response)
     new_state = SDGState(
         input=state.input,
         documents=state.documents,
-        evolved_question=evolved_question,
         context=state.context,
-        answer=state.answer
     )
     logger.debug(f"Evolve node returning state: {new_state}")
     return new_state

 logger = logging.getLogger(__name__)
 def evolve_question(state: SDGState, llm) -> SDGState:
+    prompts = [
+        "Rewrite or evolve the following question to be more challenging or insightful:\n\n{}",
+        "Rewrite or evolve the following question to be more creative or original:\n\n{}"
+    ]
+    # Choose prompt based on number of existing evolutions (even/odd)
+    prompt_idx = len(state.evolved_questions) % len(prompts)
+    prompt = prompts[prompt_idx].format(state.evolved_question)
+    # Generate new evolution
     response = llm.invoke(prompt)
+    evolved = response.content if hasattr(response, 'content') else str(response)
+    # Create new state with appended evolution
     new_state = SDGState(
         input=state.input,
         documents=state.documents,
+        evolved_questions=state.evolved_questions + [evolved],
         context=state.context,
+        answer=state.answer,
+        num_evolve_passes=state.num_evolve_passes
     )
     logger.debug(f"Evolve node returning state: {new_state}")
     return new_state

graph/nodes/retrieve.py CHANGED Viewed

@@ -14,9 +14,10 @@ def retrieve_relevant_context(state: SDGState, vectorstore) -> SDGState:
     new_state = SDGState(
         input=state.input,
         documents=state.documents,
-        evolved_question=state.evolved_question,
         context=[doc.page_content for doc in retrieved_docs],
-        answer=state.answer
     )
     logger.debug(f"Retrieve node returning state: {new_state}")

     new_state = SDGState(
         input=state.input,
         documents=state.documents,
+        evolved_questions=state.evolved_questions,
         context=[doc.page_content for doc in retrieved_docs],
+        answer=state.answer,
+        num_evolve_passes=state.num_evolve_passes
     )
     logger.debug(f"Retrieve node returning state: {new_state}")

graph/types.py CHANGED Viewed

@@ -5,6 +5,11 @@ from pydantic import BaseModel, Field
 class SDGState(BaseModel):
     input: str = Field(default="")
     documents: List[Document] = Field(default_factory=list)
-    evolved_question: str = Field(default="")
     context: List[str] = Field(default_factory=list)
-    answer: str = Field(default="")

 class SDGState(BaseModel):
     input: str = Field(default="")
     documents: List[Document] = Field(default_factory=list)
+    evolved_questions: List[str] = Field(default_factory=list)
     context: List[str] = Field(default_factory=list)
+    answer: str = Field(default="")
+    num_evolve_passes: int = Field(default=2)
+    @property
+    def evolved_question(self):
+        return self.evolved_questions[-1] if self.evolved_questions else self.input

main.py CHANGED Viewed

@@ -20,6 +20,7 @@ class DocumentEncoder(json.JSONEncoder):
         if isinstance(obj, SDGState):
             return {
                 "input": obj.input,
                 "evolved_question": obj.evolved_question,
                 "context": obj.context,
                 "answer": obj.answer
@@ -63,6 +64,27 @@ def load_or_generate_documents() -> list[Document]:
     return docs
 def main():
     if is_dev_mode():
         print("🚧 Running in development mode...")
@@ -74,11 +96,50 @@ def main():
         llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=None)  # None will use env var
         graph = build_sdg_graph(docs, vectorstore, llm)
-        initial_state = SDGState(input="How did LLMs evolve in 2023?")
-        result = graph.invoke(initial_state)
-        print("🧠 Agent Output:")
-        print(json.dumps(result, indent=2, ensure_ascii=False, cls=DocumentEncoder))
     else:
         print("🔒 Production mode detected. Skipping document generation.")

         if isinstance(obj, SDGState):
             return {
                 "input": obj.input,
+                "evolved_questions": obj.evolved_questions,
                 "evolved_question": obj.evolved_question,
                 "context": obj.context,
                 "answer": obj.answer
     return docs
+def format_results(all_results):
+    """Format results into the standard JSON structure."""
+    evolved_questions = [
+        {"id": f"q{i}", "question": result.evolved_questions[-1], "evolution_type": "simple"}
+        for i, result in enumerate(all_results)
+    ]
+    answers = [
+        {"id": f"q{i}", "answer": result.answer}
+        for i, result in enumerate(all_results)
+    ]
+    contexts = [
+        {"id": f"q{i}", "contexts": result.context}
+        for i, result in enumerate(all_results)
+    ]
+    return {
+        "evolved_questions": evolved_questions,
+        "answers": answers,
+        "contexts": contexts
+    }
 def main():
     if is_dev_mode():
         print("🚧 Running in development mode...")
         llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=None)  # None will use env var
         graph = build_sdg_graph(docs, vectorstore, llm)
+        # Set up initial state with desired number of passes
+        num_evolve_passes = int(os.environ.get("NUM_EVOLVE_PASSES", "2"))
+        state = SDGState(
+            input="How did LLMs evolve in 2023?",
+            documents=[],
+            evolved_questions=[],
+            context=[],
+            answer="",
+            num_evolve_passes=num_evolve_passes
+        )
+        # Run the graph for each evolution pass
+        all_results = []
+        print(f"🔄 Running {num_evolve_passes} evolution passes...")
+        for i in range(num_evolve_passes):
+            print(f"\n📝 Evolution pass {i+1}/{num_evolve_passes}:")
+            result = graph.invoke(state)
+            if not isinstance(result, SDGState):
+                result = SDGState(**dict(result))
+            all_results.append(result)
+            # Update state for next iteration with evolved questions
+            state = SDGState(
+                input=state.input,
+                documents=state.documents,
+                evolved_questions=result.evolved_questions,  # Pass forward all evolved questions
+                context=[],  # Reset context for next iteration
+                answer="",   # Reset answer for next iteration
+                num_evolve_passes=num_evolve_passes
+            )
+            print(f"  Question: {result.evolved_questions[-1]}")
+            print(f"  Answer: {result.answer[:100]}...")
+        # Format and output results
+        print("\n🧠 Final Output:")
+        results = format_results(all_results)
+        print(json.dumps(results, indent=2, ensure_ascii=False, cls=DocumentEncoder))
+        # Save results to file
+        output_file = Path("generated/results.json")
+        output_file.parent.mkdir(parents=True, exist_ok=True)
+        with open(output_file, "w", encoding="utf-8") as f:
+            json.dump(results, f, indent=2, ensure_ascii=False, cls=DocumentEncoder)
+        print(f"\n💾 Results saved to {output_file}")
     else:
         print("🔒 Production mode detected. Skipping document generation.")

pyproject.toml CHANGED Viewed

@@ -15,7 +15,8 @@ dependencies = [
     "openai",
     "tiktoken",
     "langchain-openai",
-    "faiss-cpu",
     "streamlit"
 ]

     "openai",
     "tiktoken",
     "langchain-openai",
+    "faiss-cpu==1.7.4",
+    "numpy<2.0.0",
     "streamlit"
 ]

tests/graph/nodes/test_evolve.py CHANGED Viewed

@@ -1,12 +1,79 @@
 from graph.types import SDGState
 from graph.nodes.evolve import evolve_question
-from unittest.mock import MagicMock
-def test_evolve_question_modifies_state():
     state = SDGState(input="What were the top LLMs in 2023?")
     mock_llm = MagicMock()
     mock_llm.invoke.return_value = MagicMock(content="Evolved: What were the top LLMs in 2023?")
     updated_state = evolve_question(state, mock_llm)
-    assert updated_state.evolved_question.startswith("Evolved:")
-    assert updated_state.evolved_question.endswith("2023?")

 from graph.types import SDGState
 from graph.nodes.evolve import evolve_question
+from unittest.mock import MagicMock, call
+def test_evolve_question_initial_state():
+    # Test evolution from initial state (should use input)
     state = SDGState(input="What were the top LLMs in 2023?")
     mock_llm = MagicMock()
     mock_llm.invoke.return_value = MagicMock(content="Evolved: What were the top LLMs in 2023?")
     updated_state = evolve_question(state, mock_llm)
+    # Should use challenging prompt first (even index)
+    mock_llm.invoke.assert_called_once_with(
+        "Rewrite or evolve the following question to be more challenging or insightful:\n\nWhat were the top LLMs in 2023?"
+    )
+    assert len(updated_state.evolved_questions) == 1
+    assert updated_state.evolved_questions[0] == "Evolved: What were the top LLMs in 2023?"
+    assert updated_state.evolved_question == "Evolved: What were the top LLMs in 2023?"
+def test_evolve_question_with_one_evolution():
+    # Test evolution with one existing evolution (should use creative prompt)
+    state = SDGState(
+        input="Base question",
+        evolved_questions=["First evolution"]
+    )
+    mock_llm = MagicMock()
+    mock_llm.invoke.return_value = MagicMock(content="Creative evolution")
+    updated_state = evolve_question(state, mock_llm)
+    # Should use creative prompt (odd index)
+    mock_llm.invoke.assert_called_once_with(
+        "Rewrite or evolve the following question to be more creative or original:\n\nFirst evolution"
+    )
+    assert len(updated_state.evolved_questions) == 2
+    assert updated_state.evolved_questions == ["First evolution", "Creative evolution"]
+    assert updated_state.evolved_question == "Creative evolution"
+def test_evolve_question_with_two_evolutions():
+    # Test evolution with two existing evolutions (should use challenging prompt)
+    state = SDGState(
+        input="Base question",
+        evolved_questions=["First evolution", "Second evolution"]
+    )
+    mock_llm = MagicMock()
+    mock_llm.invoke.return_value = MagicMock(content="Challenging evolution")
+    updated_state = evolve_question(state, mock_llm)
+    # Should use challenging prompt (even index)
+    mock_llm.invoke.assert_called_once_with(
+        "Rewrite or evolve the following question to be more challenging or insightful:\n\nSecond evolution"
+    )
+    assert len(updated_state.evolved_questions) == 3
+    assert updated_state.evolved_questions == ["First evolution", "Second evolution", "Challenging evolution"]
+    assert updated_state.evolved_question == "Challenging evolution"
+def test_state_preservation():
+    # Test that other state fields are preserved
+    initial_state = SDGState(
+        input="Base question",
+        evolved_questions=["First evolution"],
+        documents=[],
+        context=["Some context"],
+        answer="Previous answer",
+        num_evolve_passes=5
+    )
+    mock_llm = MagicMock()
+    mock_llm.invoke.return_value = MagicMock(content="New evolution")
+    updated_state = evolve_question(initial_state, mock_llm)
+    # Check that all fields are preserved except evolved_questions
+    assert updated_state.input == initial_state.input
+    assert updated_state.documents == initial_state.documents
+    assert updated_state.context == initial_state.context
+    assert updated_state.answer == initial_state.answer
+    assert updated_state.num_evolve_passes == initial_state.num_evolve_passes
+    # Check that evolved_questions is updated correctly
+    assert len(updated_state.evolved_questions) == 2
+    assert updated_state.evolved_questions[0] == "First evolution"
+    assert updated_state.evolved_questions[1] == "New evolution"

tests/graph/test_build_graph.py CHANGED Viewed

@@ -17,6 +17,8 @@ def test_build_sdg_graph_runs():
     result = graph.invoke(state)
     assert isinstance(result, dict)
-    assert "evolved_question" in result
     assert result["context"]
     assert "Relevant content" in result["context"][0]

     result = graph.invoke(state)
     assert isinstance(result, dict)
+    assert "evolved_questions" in result
+    if result["evolved_questions"]:
+        assert result["evolved_questions"][-1] == "Evolved test question"
     assert result["context"]
     assert "Relevant content" in result["context"][0]

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff