vonliechti commited on
Commit
f4644e9
·
verified ·
1 Parent(s): 17dccd8

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +24 -19
  2. agent.py +2 -1
  3. app.py +15 -5
  4. benchmarking.ipynb +0 -0
  5. test_bots.py +32 -46
README.md CHANGED
@@ -12,6 +12,8 @@ python_version: 3.11.9
12
 
13
  The project is built using Transformers Agents 2.0, and uses the Stanford SQuAD dataset for training. The chatbot is designed to answer questions about the dataset, while also incorporating conversational context and various tools to provide a more natural and engaging conversational experience.
14
 
 
 
15
  ## Getting Started
16
 
17
  1. Install dependencies:
@@ -23,13 +25,16 @@ pip install -r pre-requirements.txt
23
  pip install -r requirements.txt
24
  ```
25
 
26
- 1. Set up required keys:
 
 
27
 
28
  ```bash
29
  HF_TOKEN=<your token>
 
30
  ```
31
 
32
- 1. Run the app:
33
 
34
  ```bash
35
  python app.py
@@ -39,37 +44,37 @@ python app.py
39
 
40
  1. SQuAD Dataset: The dataset used for training the chatbot is the Stanford SQuAD dataset, which contains over 100,000 questions and answers extracted from 500+ articles.
41
  2. RAG: RAG is a technique used to improve the accuracy of chatbots by using a custom knowledge base. In this project, the Stanford SQuAD dataset is used as the knowledge base.
42
- 3. Llama 3.1: Llama 3.1 is a large language model used to generate responses to user questions. It is used in this project to generate responses to user questions, while also incorporating conversational context.
43
- 4. Transformers Agents 2.0: Transformers Agents 2.0 is a framework for building conversational AI systems. It is used in this project to build the chatbot.
44
- 5. Created a SquadRetrieverTool to integrate a fine-tuned BERT model into the agent, along with a TextToImageTool for a playful way to engage with the question-answering agent.
45
 
46
  ## Evaluation
47
 
48
- * [Agent Reasoning Benchmark](https://github.com/aymeric-roucher/agent_reasoning_benchmark)
49
- * [Hugging Face Blog: Open Source LLMs as Agents](https://huggingface.co/blog/open-source-llms-as-agents)
50
- * [Benchmarking Transformers Agents](https://github.com/aymeric-roucher/agent_reasoning_benchmark/blob/main/benchmark_transformers_agents.ipynb)
51
 
52
- ## Results
53
 
54
- TBD
55
 
56
- ## Limitations
 
 
 
 
57
 
58
- TBD
59
 
60
- ## Related Research
61
 
62
- * [Retro: A Generalist Agent for Science](https://arxiv.org/abs/2112.04426)
63
- * [RETRO-pytorch](https://github.com/lucidrains/RETRO-pytorch)
64
- * [Why isn't Retro mainstream? State-of-the-art within reach](https://www.reddit.com/r/MachineLearning/comments/1cffgkt/d_why_isnt_retro_mainstream_stateoftheart_within/)
65
 
66
- TBD
 
 
67
 
68
  ## Acknowledgments
69
 
70
- * [Agents 2.0](https://github.com/huggingface/transformers/tree/main/src/transformers/agents)
71
  * [SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity](https://arxiv.org/abs/2401.17072)
 
72
  * [SemScore](https://huggingface.co/blog/g-ronimo/semscore)
73
  * [Stanford SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
74
- * [llama 3.1](https://github.com/meta-llama/Meta-Llama)
75
  * [Gradio](https://www.gradio.app/)
 
12
 
13
  The project is built using Transformers Agents 2.0, and uses the Stanford SQuAD dataset for training. The chatbot is designed to answer questions about the dataset, while also incorporating conversational context and various tools to provide a more natural and engaging conversational experience.
14
 
15
+ At the time of writing, the project is available on [Hugging Face Spaces](https://huggingface.co/spaces/kaiokendall/SQuAD_Agent_Experiment).
16
+
17
  ## Getting Started
18
 
19
  1. Install dependencies:
 
25
  pip install -r requirements.txt
26
  ```
27
 
28
+ 2. Set up required keys:
29
+
30
+ Create a `.env` file and set the following environment variables:
31
 
32
  ```bash
33
  HF_TOKEN=<your token>
34
+ OPENAI_API_KEY=<your key>
35
  ```
36
 
37
+ 3. Run the app:
38
 
39
  ```bash
40
  python app.py
 
44
 
45
  1. SQuAD Dataset: The dataset used for training the chatbot is the Stanford SQuAD dataset, which contains over 100,000 questions and answers extracted from 500+ articles.
46
  2. RAG: RAG is a technique used to improve the accuracy of chatbots by using a custom knowledge base. In this project, the Stanford SQuAD dataset is used as the knowledge base.
47
+ 3. Transformers Agents 2.0: Transformers Agents 2.0 is a framework for building conversational AI systems. It is used in this project to build the chatbot.
48
+ 4. Created a SquadRetrieverTool to integrate a fine-tuned BERT model into the agent, along with a TextToImageTool for a playful way to engage with the question-answering agent.
49
+ 5. Gradio: Gradio is used to create the chatbot interface, in `app.py`.
50
 
51
  ## Evaluation
52
 
53
+ SemScore is used in this project to evaluate the chatbot's responses in the notebook `benchmarking.ipynb`.
 
 
54
 
55
+ See [SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity](https://doi.org/10.48550/arXiv.2401.17072)
56
 
57
+ In this experiment, the agent is evaluated with 3 different system prompting approaches:
58
 
59
+ 1. The default prompting approach, which is just the default system prompt used in Hugging Face Transformers Agents 2.0, with only an example of using the `squad_retriever` tool added.
60
+ 2. A succinct prompting approach, which guides the agent to be concise if possible while still answering the question.
61
+ 3. A focused prompting approach, which reframes the entire chatbots purpose to focus more on the specific task of answering questions about the SQuAD dataset, while still being open to exploring other topics.
62
+
63
+ ## Results
64
 
 
65
 
 
66
 
67
+ ## Limitations
 
 
68
 
69
+ * This experiment is not designed for multiple users. While it has in-session memory, simply refreshing the browser will reset the chat history, which is convenient for experimentation.
70
+ * Some of the agent's underlying engines, models, and tools use keys that have usage limits, so the app may not work if those limits have been reached.
71
+ * It is recommended to clone the repo and run the code using your own keys, to avoid running into those limits.
72
 
73
  ## Acknowledgments
74
 
75
+ * [Hugging Face Transformers Agents 2.0](https://huggingface.co/docs/transformers/en/main_classes/agent)
76
  * [SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity](https://arxiv.org/abs/2401.17072)
77
+ * `semscore.py` from [geronimi73/semscore](https://github.com/geronimi73/semscore/blob/main/semscore.py)
78
  * [SemScore](https://huggingface.co/blog/g-ronimo/semscore)
79
  * [Stanford SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
 
80
  * [Gradio](https://www.gradio.app/)
agent.py CHANGED
@@ -3,6 +3,7 @@ from prompts import *
3
  from tools.squad_tools import SquadRetrieverTool, SquadQueryTool
4
  from transformers.agents.llm_engine import MessageRole, get_clean_message_list
5
  from openai import OpenAI
 
6
 
7
  DEFAULT_TASK_SOLVING_TOOLBOX = [SquadRetrieverTool()] # , SquadQueryTool()
8
 
@@ -30,7 +31,7 @@ class OpenAIModel:
30
 
31
  def get_agent(
32
  model_name=None,
33
- system_prompt=DEFAULT_SQUAD_REACT_CODE_SYSTEM_PROMPT,
34
  toolbox=DEFAULT_TASK_SOLVING_TOOLBOX,
35
  use_openai=True,
36
  openai_model_name="gpt-4o-mini-2024-07-18",
 
3
  from tools.squad_tools import SquadRetrieverTool, SquadQueryTool
4
  from transformers.agents.llm_engine import MessageRole, get_clean_message_list
5
  from openai import OpenAI
6
+ from prompts import FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT
7
 
8
  DEFAULT_TASK_SOLVING_TOOLBOX = [SquadRetrieverTool()] # , SquadQueryTool()
9
 
 
31
 
32
  def get_agent(
33
  model_name=None,
34
+ system_prompt=FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT,
35
  toolbox=DEFAULT_TASK_SOLVING_TOOLBOX,
36
  use_openai=True,
37
  openai_model_name="gpt-4o-mini-2024-07-18",
app.py CHANGED
@@ -40,6 +40,12 @@ model_name = (
40
  else "http://localhost:1234/v1"
41
  )
42
 
 
 
 
 
 
 
43
  class FixImageQuestionAnsweringTool(ImageQuestionAnsweringTool):
44
  def __init__(self, *args, **kwargs):
45
  super().__init__(*args, **kwargs)
@@ -49,6 +55,13 @@ class FixImageQuestionAnsweringTool(ImageQuestionAnsweringTool):
49
  image = Image.open(image)
50
  return super().encode(image, question)
51
 
 
 
 
 
 
 
 
52
  ADDITIONAL_TOOLS = [
53
  DuckDuckGoSearchTool(),
54
  VisitWebpageTool(),
@@ -62,7 +75,7 @@ ADDITIONAL_TOOLS = [
62
  # Add image tools to the default task solving toolbox, for a more visually interactive experience
63
  TASK_SOLVING_TOOLBOX = DEFAULT_TASK_SOLVING_TOOLBOX + ADDITIONAL_TOOLS
64
 
65
- # system_prompt = DEFAULT_SQUAD_REACT_CODE_SYSTEM_PROMPT
66
  system_prompt = FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT
67
 
68
  agent = get_agent(
@@ -72,9 +85,6 @@ agent = get_agent(
72
  use_openai=True, # Use OpenAI instead of a local or HF model as the base LLM engine
73
  )
74
 
75
- app = None
76
-
77
-
78
  def append_example_message(x: gr.SelectData, messages):
79
  if x.value["text"] is not None:
80
  message = x.value["text"]
@@ -197,7 +207,7 @@ with gr.Blocks(
197
  "text": "What is on top of the Notre Dame building?",
198
  },
199
  {
200
- "text": "Tell me what's on top of the Notre Dame building, and draw a picture of it.",
201
  },
202
  {
203
  "text": "Draw a picture of whatever is on top of the Notre Dame building.",
 
40
  else "http://localhost:1234/v1"
41
  )
42
 
43
+ """
44
+ The ImageQuestionAnsweringTool from Transformers Agents 2.0 has a bug where
45
+ it said it accepts the path to an image, but it does not.
46
+ This class uses the adapter pattern to fix the issue, in a way that may be
47
+ compatible with future versions of the tool even if the bug is fixed.
48
+ """
49
  class FixImageQuestionAnsweringTool(ImageQuestionAnsweringTool):
50
  def __init__(self, *args, **kwargs):
51
  super().__init__(*args, **kwargs)
 
55
  image = Image.open(image)
56
  return super().encode(image, question)
57
 
58
+ """
59
+ The app version of the agent has access to additional tools that are not available
60
+ during benchmarking. We chose this approach to focus benchmarking on the agent's
61
+ ability to solve questions about the SQuAD dataset, without the help of general
62
+ knowledge available on the web. For the purposes of the project, the demo
63
+ app has access to additional tools to provide a more interactive and engaging experience.
64
+ """
65
  ADDITIONAL_TOOLS = [
66
  DuckDuckGoSearchTool(),
67
  VisitWebpageTool(),
 
75
  # Add image tools to the default task solving toolbox, for a more visually interactive experience
76
  TASK_SOLVING_TOOLBOX = DEFAULT_TASK_SOLVING_TOOLBOX + ADDITIONAL_TOOLS
77
 
78
+ # Using the focused prompt, which was the top-performing prompt during benchmarking
79
  system_prompt = FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT
80
 
81
  agent = get_agent(
 
85
  use_openai=True, # Use OpenAI instead of a local or HF model as the base LLM engine
86
  )
87
 
 
 
 
88
  def append_example_message(x: gr.SelectData, messages):
89
  if x.value["text"] is not None:
90
  message = x.value["text"]
 
207
  "text": "What is on top of the Notre Dame building?",
208
  },
209
  {
210
+ "text": "What is the Olympic Torch made of?",
211
  },
212
  {
213
  "text": "Draw a picture of whatever is on top of the Notre Dame building.",
benchmarking.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
test_bots.py CHANGED
@@ -1,53 +1,39 @@
1
- import pytest
2
  from deepeval import assert_test
3
  from deepeval.metrics import AnswerRelevancyMetric
4
  from deepeval.test_case import LLMTestCase
5
- import pandas as pd
6
- import os
7
  from agent import get_agent
8
- from semscore import EmbeddingModelWrapper
9
  import logging
10
- from tqdm import tqdm
11
- from transformers.agents import agent_types
12
 
13
- def test_case():
14
- answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
15
- test_case = LLMTestCase(
16
- input="What if these shoes don't fit?",
17
- # Replace this with the actual output from your LLM application
18
- actual_output="We offer a 30-day full refund at no extra costs.",
19
- retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
20
- )
21
- assert_test(test_case, [answer_relevancy_metric])
22
-
23
-
24
- def test_default_agent():
25
- SAMPLES_DIR = "samples"
26
- os.makedirs(SAMPLES_DIR, exist_ok=True)
27
- dfSample = pd.read_pickle(os.path.join(SAMPLES_DIR, f"samples.pkl"))
28
- agent = get_agent()
29
- # Suppress logging from the agent, which can be quite verbose
 
30
  agent.logger.setLevel(logging.CRITICAL)
31
- answers_ref = []
32
- answers_pred = []
33
- for title, context, question, answer, synthesized_question in tqdm(dfSample.values):
34
- class Output:
35
- output: agent_types.AgentType | str = None
36
-
37
- prompt = synthesized_question
38
- answers_ref.append(answer)
39
- final_answer = agent.run(prompt, stream=False, reset=True)
40
- answers_pred.append(final_answer)
41
-
42
- answers_ref = [str(answer) for answer in answers_ref]
43
- answers_pred = [str(answer) for answer in answers_pred]
44
-
45
- em = EmbeddingModelWrapper()
46
- similarities = em.get_similarities(
47
- em.get_embeddings( answers_pred ),
48
- em.get_embeddings( answers_ref ),
49
- )
50
- mean_similarity = similarities.mean()
51
-
52
- assert(mean_similarity >= 0.5, f"Mean similarity is too low: {mean_similarity}")
53
-
 
 
1
  from deepeval import assert_test
2
  from deepeval.metrics import AnswerRelevancyMetric
3
  from deepeval.test_case import LLMTestCase
 
 
4
  from agent import get_agent
5
+ from prompts import FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT
6
  import logging
 
 
7
 
8
+ """
9
+ Test the chatbot's ability to carry out multi-turn conversations,
10
+ adapt to context, and handle a variety of topics.
11
+ """
12
+ def test_chatbot_goals():
13
+ user_messages = [
14
+ "What is on top of the Notre Dame building?",
15
+ "When did the United States purchase Alaska from Russia?",
16
+ "What year did Bern join the Swiss Confederacy?",
17
+ "Are there any other statues nearby the first one you mentioned?",
18
+ ]
19
+ minimum_acceptable_answers = [
20
+ "golden statue of the Virgin Mary",
21
+ "1867",
22
+ "1353",
23
+ "copper statue of Christ"
24
+ ]
25
+ agent = get_agent(system_prompt=FOCUSED_SQUAD_REACT_CODE_SYSTEM_PROMPT)
26
  agent.logger.setLevel(logging.CRITICAL)
27
+ for i, (user_message, minimum_acceptable_answer) in enumerate(zip(user_messages, minimum_acceptable_answers)):
28
+ answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
29
+ reset = (i == 0) # Reset the agent for the first message only
30
+ print(f"Running with reset={reset}")
31
+ answer = agent.run(user_message, stream=False, reset=reset)
32
+ print(f"User message: {user_message}")
33
+ print(f"Minimum acceptable answer: {minimum_acceptable_answer}")
34
+ print(f"Answer: {answer}")
35
+ test_case = LLMTestCase(
36
+ input=user_message,
37
+ actual_output=answer,
38
+ )
39
+ assert_test(test_case, [answer_relevancy_metric])