Spaces:

vonliechti
/

SQuAD_Agent_Experiment

Running

App Files Files Community

vonliechti commited on Oct 13, 2024

Commit

dd5fe55

verified ·

1 Parent(s): c8e3129

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +5 -2
benchmarking.ipynb +404 -0
semscore.py +167 -0

README.md CHANGED Viewed

@@ -63,6 +63,9 @@ TBD
 ## Acknowledgments
-* [MemGPT](https://github.com/cpacker/MemGPT)
 * [Stanford SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
-* [GPT-4](https://openai.com/gpt-4/)

 ## Acknowledgments
+* [Agents 2.0](https://github.com/huggingface/transformers/tree/main/src/transformers/agents)
+* [SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity](https://arxiv.org/abs/2401.17072)
+* [SemScore](https://huggingface.co/blog/g-ronimo/semscore)
 * [Stanford SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
+* [llama 3.1](https://github.com/meta-llama/Meta-Llama)
+* [Gradio](https://www.gradio.app/)

benchmarking.ipynb ADDED Viewed

	@@ -0,0 +1,404 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load SQuAD data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import json\n",
+    "import pandas as pd\n",
+    "\n",
+    "def display_text_df(df):\n",
+    "    display(df.style.set_properties(**{'white-space': 'pre-wrap'}).set_table_styles(\n",
+    "        [{'selector': 'th', 'props': [('text-align', 'left')]},\n",
+    "         {'selector': 'td', 'props': [('text-align', 'left')]}\n",
+    "        ]\n",
+    "    ).hide())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from data import get_data\n",
+    "data = get_data(download=False)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',\n",
+       " 'Saint Bernadette Soubirous')"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data.question_answer_pairs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<style type=\"text/css\">\n",
+       "#T_fc111 th {\n",
+       "  text-align: left;\n",
+       "}\n",
+       "#T_fc111 td {\n",
+       "  text-align: left;\n",
+       "}\n",
+       "#T_fc111_row0_col0, #T_fc111_row0_col1, #T_fc111_row1_col0, #T_fc111_row1_col1, #T_fc111_row2_col0, #T_fc111_row2_col1, #T_fc111_row3_col0, #T_fc111_row3_col1, #T_fc111_row4_col0, #T_fc111_row4_col1, #T_fc111_row5_col0, #T_fc111_row5_col1, #T_fc111_row6_col0, #T_fc111_row6_col1, #T_fc111_row7_col0, #T_fc111_row7_col1, #T_fc111_row8_col0, #T_fc111_row8_col1, #T_fc111_row9_col0, #T_fc111_row9_col1 {\n",
+       "  white-space: pre-wrap;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_fc111\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th id=\"T_fc111_level0_col0\" class=\"col_heading level0 col0\" >Question</th>\n",
+       "      <th id=\"T_fc111_level0_col1\" class=\"col_heading level0 col1\" >Answer</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row0_col0\" class=\"data row0 col0\" >What year was the Banská Akadémia founded?</td>\n",
+       "      <td id=\"T_fc111_row0_col1\" class=\"data row0 col1\" >1735</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row1_col0\" class=\"data row1 col0\" >What is another speed that can also be reported by the camera?</td>\n",
+       "      <td id=\"T_fc111_row1_col1\" class=\"data row1 col1\" >SOS-based speed</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row2_col0\" class=\"data row2 col0\" >Where were the use of advanced materials and techniques on display in Sumer?</td>\n",
+       "      <td id=\"T_fc111_row2_col1\" class=\"data row2 col1\" >Sumerian temples and palaces</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row3_col0\" class=\"data row3 col0\" >Who is elected every even numbered year?</td>\n",
+       "      <td id=\"T_fc111_row3_col1\" class=\"data row3 col1\" >mayor</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row4_col0\" class=\"data row4 col0\" >What was the purpose of top secret ICBM committee?</td>\n",
+       "      <td id=\"T_fc111_row4_col1\" class=\"data row4 col1\" >decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row5_col0\" class=\"data row5 col0\" >What conferences became a requirement after Vatican II?</td>\n",
+       "      <td id=\"T_fc111_row5_col1\" class=\"data row5 col1\" >National Bishop Conferences</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row6_col0\" class=\"data row6 col0\" >Who does M fight with?</td>\n",
+       "      <td id=\"T_fc111_row6_col1\" class=\"data row6 col1\" >C</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row7_col0\" class=\"data row7 col0\" >How many species of fungi have been found on Antarctica?</td>\n",
+       "      <td id=\"T_fc111_row7_col1\" class=\"data row7 col1\" >1150</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row8_col0\" class=\"data row8 col0\" >After losing the battle of Guilford Courthouse, Cornawallis moved his troops where?</td>\n",
+       "      <td id=\"T_fc111_row8_col1\" class=\"data row8 col1\" >Virginia coastline</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_fc111_row9_col0\" class=\"data row9 col0\" >What is the Olympic Torch made from?</td>\n",
+       "      <td id=\"T_fc111_row9_col1\" class=\"data row9 col1\" >aluminum.</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n"
+      ],
+      "text/plain": [
+       "<pandas.io.formats.style.Styler at 0x3afc43c80>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "np.random.seed(42)\n",
+    "arr =np.array(data.question_answer_pairs)\n",
+    "n_samples = 10\n",
+    "indices = np.random.choice(len(arr), n_samples, replace=False)\n",
+    "random_sample = arr[indices]\n",
+    "# Display the questions and answers in the random sample as a dataframe\n",
+    "dfSample = pd.DataFrame(random_sample, columns=[\"Question\", \"Answer\"])\n",
+    "display_text_df(dfSample)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create the agent to be evaluated"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from agent import get_agent\n",
+    "agent = get_agent()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the agent on the random sample of questions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "4bce5a5c2449435dbd058ed938db2a91",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/10 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from gradio import ChatMessage\n",
+    "from transformers.agents import agent_types\n",
+    "from tqdm.notebook import tqdm\n",
+    "import logging\n",
+    "\n",
+    "answers_ref, answers_pred = [], []        \n",
+    "\n",
+    "# Suppress logging from the agent, which can be quite verbose\n",
+    "agent.logger.setLevel(logging.CRITICAL)\n",
+    "\n",
+    "for question, answer in tqdm(random_sample):\n",
+    "    class Output:\n",
+    "        output: agent_types.AgentType | str = None\n",
+    "\n",
+    "    prompt = question\n",
+    "    answers_ref.append(answer)\n",
+    "    final_answer = agent.run(prompt, stream=False, reset=True)\n",
+    "    answers_pred.append(final_answer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Use semantic similarity to evaluate the agent's answers against the reference answers\n",
+    "\n",
+    "* One flaw of this approach is that it does not take into account the existence of multiple acceptable answers.\n",
+    "* It also does not benefit from having the context of the question. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from semscore import EmbeddingModelWrapper\n",
+    "from statistics import mean\n",
+    "\n",
+    "answers_ref = [str(answer) for answer in answers_ref]\n",
+    "answers_pred = [str(answer) for answer in answers_pred]\n",
+    "\n",
+    "em = EmbeddingModelWrapper()\n",
+    "similarities = em.get_similarities(\n",
+    "    em.get_embeddings( answers_pred ),\n",
+    "    em.get_embeddings( answers_ref ),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<style type=\"text/css\">\n",
+       "#T_67704 th {\n",
+       "  text-align: left;\n",
+       "}\n",
+       "#T_67704 td {\n",
+       "  text-align: left;\n",
+       "}\n",
+       "#T_67704_row0_col0, #T_67704_row0_col1, #T_67704_row0_col2, #T_67704_row0_col3, #T_67704_row1_col0, #T_67704_row1_col1, #T_67704_row1_col2, #T_67704_row1_col3, #T_67704_row2_col0, #T_67704_row2_col1, #T_67704_row2_col2, #T_67704_row2_col3, #T_67704_row3_col0, #T_67704_row3_col1, #T_67704_row3_col2, #T_67704_row3_col3, #T_67704_row4_col0, #T_67704_row4_col1, #T_67704_row4_col2, #T_67704_row4_col3, #T_67704_row5_col0, #T_67704_row5_col1, #T_67704_row5_col2, #T_67704_row5_col3, #T_67704_row6_col0, #T_67704_row6_col1, #T_67704_row6_col2, #T_67704_row6_col3, #T_67704_row7_col0, #T_67704_row7_col1, #T_67704_row7_col2, #T_67704_row7_col3, #T_67704_row8_col0, #T_67704_row8_col1, #T_67704_row8_col2, #T_67704_row8_col3, #T_67704_row9_col0, #T_67704_row9_col1, #T_67704_row9_col2, #T_67704_row9_col3 {\n",
+       "  white-space: pre-wrap;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_67704\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th id=\"T_67704_level0_col0\" class=\"col_heading level0 col0\" >Question</th>\n",
+       "      <th id=\"T_67704_level0_col1\" class=\"col_heading level0 col1\" >Reference Answer</th>\n",
+       "      <th id=\"T_67704_level0_col2\" class=\"col_heading level0 col2\" >Predicted Answer</th>\n",
+       "      <th id=\"T_67704_level0_col3\" class=\"col_heading level0 col3\" >Similarity</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row0_col0\" class=\"data row0 col0\" >What year was the Banská Akadémia founded?</td>\n",
+       "      <td id=\"T_67704_row0_col1\" class=\"data row0 col1\" >1735</td>\n",
+       "      <td id=\"T_67704_row0_col2\" class=\"data row0 col2\" >1735</td>\n",
+       "      <td id=\"T_67704_row0_col3\" class=\"data row0 col3\" >1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row1_col0\" class=\"data row1 col0\" >What is another speed that can also be reported by the camera?</td>\n",
+       "      <td id=\"T_67704_row1_col1\" class=\"data row1 col1\" >SOS-based speed</td>\n",
+       "      <td id=\"T_67704_row1_col2\" class=\"data row1 col2\" >Average speed</td>\n",
+       "      <td id=\"T_67704_row1_col3\" class=\"data row1 col3\" >0.433297</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row2_col0\" class=\"data row2 col0\" >Where were the use of advanced materials and techniques on display in Sumer?</td>\n",
+       "      <td id=\"T_67704_row2_col1\" class=\"data row2 col1\" >Sumerian temples and palaces</td>\n",
+       "      <td id=\"T_67704_row2_col2\" class=\"data row2 col2\" >Based on the information provided, it appears that the Sumerians developed and displayed advanced materials and techniques such as metrology, writing, and astronomy throughout their city-states. The specific locations where these advanced materials and techniques were on display are not explicitly mentioned.\n",
+       "\n",
+       "However, considering the context of the question, I would argue that the city-states of Sumer itself is the most relevant answer. The city-states of Sumer were the hub of Sumerian civilization, culture, and innovation, and it was likely there that these advanced materials and techniques were developed, displayed, and showcased.\n",
+       "\n",
+       "Therefore, my final answer to the user request is:\n",
+       "\n",
+       "The city-states of Sumer</td>\n",
+       "      <td id=\"T_67704_row2_col3\" class=\"data row2 col3\" >0.545807</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row3_col0\" class=\"data row3 col0\" >Who is elected every even numbered year?</td>\n",
+       "      <td id=\"T_67704_row3_col1\" class=\"data row3 col1\" >mayor</td>\n",
+       "      <td id=\"T_67704_row3_col2\" class=\"data row3 col2\" >mayor</td>\n",
+       "      <td id=\"T_67704_row3_col3\" class=\"data row3 col3\" >1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row4_col0\" class=\"data row4 col0\" >What was the purpose of top secret ICBM committee?</td>\n",
+       "      <td id=\"T_67704_row4_col1\" class=\"data row4 col1\" >decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon</td>\n",
+       "      <td id=\"T_67704_row4_col2\" class=\"data row4 col2\" >decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon</td>\n",
+       "      <td id=\"T_67704_row4_col3\" class=\"data row4 col3\" >1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row5_col0\" class=\"data row5 col0\" >What conferences became a requirement after Vatican II?</td>\n",
+       "      <td id=\"T_67704_row5_col1\" class=\"data row5 col1\" >National Bishop Conferences</td>\n",
+       "      <td id=\"T_67704_row5_col2\" class=\"data row5 col2\" >['National Bishop Conferences']</td>\n",
+       "      <td id=\"T_67704_row5_col3\" class=\"data row5 col3\" >0.937632</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row6_col0\" class=\"data row6 col0\" >Who does M fight with?</td>\n",
+       "      <td id=\"T_67704_row6_col1\" class=\"data row6 col1\" >C</td>\n",
+       "      <td id=\"T_67704_row6_col2\" class=\"data row6 col2\" >C</td>\n",
+       "      <td id=\"T_67704_row6_col3\" class=\"data row6 col3\" >1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row7_col0\" class=\"data row7 col0\" >How many species of fungi have been found on Antarctica?</td>\n",
+       "      <td id=\"T_67704_row7_col1\" class=\"data row7 col1\" >1150</td>\n",
+       "      <td id=\"T_67704_row7_col2\" class=\"data row7 col2\" >Based on the output from the `squad_retriever` tool, I can see that there are two documents in the SQuAD dataset that answer the question \"How many species of fungi have been found on Antarctica?\".\n",
+       "\n",
+       "The first document states that about 1150 species of fungi have been recorded from Antarctica. The second document does not provide a different answer to this question.\n",
+       "\n",
+       "Therefore, my final answer is:\n",
+       "\n",
+       "There are approximately 1150 species of fungi that have been found on Antarctica.</td>\n",
+       "      <td id=\"T_67704_row7_col3\" class=\"data row7 col3\" >-0.020657</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row8_col0\" class=\"data row8 col0\" >After losing the battle of Guilford Courthouse, Cornawallis moved his troops where?</td>\n",
+       "      <td id=\"T_67704_row8_col1\" class=\"data row8 col1\" >Virginia coastline</td>\n",
+       "      <td id=\"T_67704_row8_col2\" class=\"data row8 col2\" >The Virginia coastline</td>\n",
+       "      <td id=\"T_67704_row8_col3\" class=\"data row8 col3\" >0.948570</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_67704_row9_col0\" class=\"data row9 col0\" >What is the Olympic Torch made from?</td>\n",
+       "      <td id=\"T_67704_row9_col1\" class=\"data row9 col1\" >aluminum.</td>\n",
+       "      <td id=\"T_67704_row9_col2\" class=\"data row9 col2\" >aluminum</td>\n",
+       "      <td id=\"T_67704_row9_col3\" class=\"data row9 col3\" >0.973508</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n"
+      ],
+      "text/plain": [
+       "<pandas.io.formats.style.Styler at 0x3b0db7320>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean similarity: 0.78\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "questions = [question for question, _ in random_sample]\n",
+    "dfAnswers = pd.DataFrame(list(zip(questions, answers_ref, answers_pred)), columns=[\"Question\", \"Reference Answer\", \"Predicted Answer\"])\n",
+    "dfAnswers[\"Similarity\"] = similarities\n",
+    "display(dfAnswers.style.set_properties(**{'white-space': 'pre-wrap'}).set_table_styles(\n",
+    "    [{'selector': 'th', 'props': [('text-align', 'left')]},\n",
+    "     {'selector': 'td', 'props': [('text-align', 'left')]}\n",
+    "    ]\n",
+    ").hide())\n",
+    "print(f\"Mean similarity: {round(mean(similarities), 2)}\")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "aai520",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

semscore.py ADDED Viewed

	@@ -0,0 +1,167 @@

+from transformers import AutoTokenizer, AutoModel
+from accelerate import Accelerator
+from accelerate.utils import gather_object
+from tqdm import tqdm
+import torch, gc
+import torch.nn as nn
+class EmbeddingModelWrapper():
+    DEFAULT_MODEL="sentence-transformers/all-mpnet-base-v2"
+    def __init__(self, model_path=None, bs=8):
+        if model_path is None: model_path = self.DEFAULT_MODEL
+        self.model, self.tokenizer = self.load_model(model_path)
+        self.bs = bs
+        self.cos = nn.CosineSimilarity(dim=1, eps=1e-6)
+    def load_model(self, model_path):
+        model = AutoModel.from_pretrained(
+            model_path,
+        ).to("mps")
+        model.eval()
+        tokenizer = AutoTokenizer.from_pretrained(
+             model_path,
+        )
+        return model, tokenizer
+    def emb_mean_pooling(self, model_output, attention_mask):
+        token_embeddings = model_output[0]
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    def get_embeddings(self, sentences):
+        embeddings=torch.tensor([],device="mps")
+        if self.bs is None:
+            batches=[sentences]
+        else:
+            batches = [sentences[i:i + self.bs] for i in range(0, len(sentences), self.bs)]
+        for sentences in batches:
+            encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to("mps")
+            with torch.no_grad():
+                model_output = self.model(**encoded_input)
+            batch_embeddings=self.emb_mean_pooling(model_output, encoded_input['attention_mask'])
+            embeddings=torch.cat( (embeddings, batch_embeddings), dim=0 )
+        return embeddings
+    def get_similarities(self, x, y=None):
+        if y is None:
+            num_samples=x.shape[0]
+            similarities = [[0 for i in range(num_samples)] for f in range(num_samples)]
+            for row in tqdm(range(num_samples)):
+                similarities[row][0:row+1]=self.cos(x[row].repeat(row+1,1), x[0:row+1]).tolist()
+            return similarities
+        else:
+            return self.cos(x,y).tolist()
+class ModelPredictionGenerator:
+    def __init__(self, model, tokenizer, eval_dataset, use_accelerate=False, bs=8, generation_config=None):
+        self.model=model
+        self.tokenizer=tokenizer
+        self.bs=bs
+        self.eval_prompts=self.messages_to_prompts( eval_dataset )
+        self.use_accelerate=use_accelerate
+        self.accelerator = Accelerator()
+        assert tokenizer.eos_token_id is not None
+        assert tokenizer.chat_template is not None
+        if tokenizer.pad_token_id is None:
+            tokenizer.pad_token_id = tokenizer.eos_token_id
+        # llama-precise
+        if generation_config is None:
+            self.generation_config = {
+                "temperature": 0.7,
+                "top_p": 0.1,
+                "repetition_penalty": 1.18,
+                "top_k": 40,
+                "do_sample": True,
+                "max_new_tokens": 100,
+                "pad_token_id": tokenizer.pad_token_id
+            }
+        else:
+            self.generation_config = generation_config
+    def clear_cache(self):
+        torch.mps.empty_cache()
+        gc.collect()
+    def messages_to_prompts(self, ds):
+        prompts=[]
+        for conversation in ds["messages"]:
+            for i,msg in enumerate(conversation):
+                if msg["role"]=="user":
+                    prompts.append(
+                        dict (
+                            # prompt: format current messages up to the current user message and add a generation prompt
+                            prompt=self.tokenizer.apply_chat_template(conversation[:i+1], add_generation_prompt=True, tokenize=False),
+                            answer_ref=conversation[i+1]["content"]
+                        )
+                    )
+        return prompts
+    def get_batches(self, dataset, batch_size):
+        return [dataset[i:i + batch_size] for i in range(0, len(dataset), batch_size)]
+    def tokenize_batch(self, batch):
+        pad_side=self.tokenizer.padding_side
+        self.tokenizer.padding_side="left"     # left pad for inference
+        prompts=[ item["prompt"] for item in batch ]
+        prompts_tok=self.tokenizer(
+            prompts,
+            return_tensors="pt",
+            padding='longest',
+            truncation=True,
+            max_length=self.tokenizer.model_max_length,
+            return_length=True,
+            pad_to_multiple_of=8,
+            add_special_tokens=False
+        ).to(self.model.device)
+        self.tokenizer.padding_side=pad_side   # restore orig. padding side
+        return prompts_tok
+    def generate_batch(self, batch_tok):
+        with torch.no_grad():
+            outputs_tok=self.model.generate(
+                input_ids=batch_tok["input_ids"],
+                attention_mask=batch_tok["attention_mask"],
+                **self.generation_config
+            ).to("cpu")
+        outputs=[
+            # cut prompt from output
+            self.tokenizer.decode(
+                outputs_tok[i][outputs_tok[i] != self.tokenizer.pad_token_id][batch_tok["length"][i]:],
+                spaces_between_special_tokens=False,
+                skip_special_tokens=True
+                ).strip()
+            for i,t in enumerate(outputs_tok) ]
+        return outputs
+    def run(self):
+        self.model.eval()
+        self.clear_cache()
+        if self.use_accelerate:
+            with self.accelerator.split_between_processes(list(range(len(self.eval_prompts)))) as eval_prompts_local_idcs:
+                eval_prompts_local = [self.eval_prompts[i] for i in eval_prompts_local_idcs]
+        else:
+            eval_prompts_local = self.eval_prompts
+        for batch in tqdm( self.get_batches(eval_prompts_local, self.bs) ):
+            batch_tok = self.tokenize_batch( batch )
+            answers = self.generate_batch( batch_tok )
+            for i in range(len(batch)):
+                batch[i]["answer_pred"]=answers[i]
+                batch[i]["GPU"]=self.accelerator.process_index
+        if self.use_accelerate:
+            return gather_object(eval_prompts_local)
+        else:
+            return eval_prompts_local