## Load SQuAD data

In [1]:
import os
import numpy as np
import pandas as pd
from transformers.agents import agent_types
from tqdm.notebook import tqdm
import logging
from semscore import EmbeddingModelWrapper
from statistics import mean


def display_text_df(df):
    display(df.style.set_properties(**{'white-space': 'pre-wrap'}).set_table_styles(
        [{'selector': 'th', 'props': [('text-align', 'left')]},
         {'selector': 'td', 'props': [('text-align', 'left')]}
        ]
    ).hide())


In [2]:
from data import get_data
data = get_data(download=False)


Initializing Data...
Download: False
Loading data...
Raw Data loaded
Chroma DB already exists
Loading index...
Index loaded


In [3]:
display_text_df(data.df.head(3))


Title,Context,Question,Answer
University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?,Saint Bernadette Soubirous
University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What is in front of the Notre Dame Main Building?,a copper statue of Christ
University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",The Basilica of the Sacred heart at Notre Dame is beside to which structure?,the Main Building


In [4]:
np.random.seed(42)
# Select 10 random rows from data.df
dfSample = data.df.sample(n=10)
display_text_df(dfSample)

Title,Context,Question,Answer
Institute_of_technology,"The world's first institution of technology or technical university with tertiary technical education is the Banská Akadémia in Banská Štiavnica, Slovakia, founded in 1735, Academy since December 13, 1762 established by queen Maria Theresa in order to train specialists of silver and gold mining and metallurgy in neighbourhood. Teaching started in 1764. Later the department of Mathematics, Mechanics and Hydraulics and department of Forestry were settled. University buildings are still at their place today and are used for teaching. University has launched the first book of electrotechnics in the world.",What year was the Banská Akadémia founded?,1735
Film_speed,"The standard specifies how speed ratings should be reported by the camera. If the noise-based speed (40:1) is higher than the saturation-based speed, the noise-based speed should be reported, rounded downwards to a standard value (e.g. 200, 250, 320, or 400). The rationale is that exposure according to the lower saturation-based speed would not result in a visibly better image. In addition, an exposure latitude can be specified, ranging from the saturation-based speed to the 10:1 noise-based speed. If the noise-based speed (40:1) is lower than the saturation-based speed, or undefined because of high noise, the saturation-based speed is specified, rounded upwards to a standard value, because using the noise-based speed would lead to overexposed images. The camera may also report the SOS-based speed (explicitly as being an SOS speed), rounded to the nearest standard speed rating.",What is another speed that can also be reported by the camera?,SOS-based speed
Sumer,"The most impressive and famous of Sumerian buildings are the ziggurats, large layered platforms which supported temples. Sumerian cylinder seals also depict houses built from reeds not unlike those built by the Marsh Arabs of Southern Iraq until as recently as 400 CE. The Sumerians also developed the arch, which enabled them to develop a strong type of dome. They built this by constructing and linking several arches. Sumerian temples and palaces made use of more advanced materials and techniques,[citation needed] such as buttresses, recesses, half columns, and clay nails.",Where were the use of advanced materials and techniques on display in Sumer?,Sumerian temples and palaces
"Ann_Arbor,_Michigan","Ann Arbor has a council-manager form of government. The City Council has 11 voting members: the mayor and 10 city council members. The mayor and city council members serve two-year terms: the mayor is elected every even-numbered year, while half of the city council members are up for election annually (five in even-numbered and five in odd-numbered years). Two council members are elected from each of the city's five wards. The mayor is elected citywide. The mayor is the presiding officer of the City Council and has the power to appoint all Council committee members as well as board and commission members, with the approval of the City Council. The current mayor of Ann Arbor is Christopher Taylor, a Democrat who was elected as mayor in 2014. Day-to-day city operations are managed by a city administrator chosen by the city council.",Who is elected every even numbered year?,mayor
John_von_Neumann,"Shortly before his death, when he was already quite ill, von Neumann headed the United States government's top secret ICBM committee, and it would sometimes meet in his home. Its purpose was to decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon. Von Neumann had long argued that while the technical obstacles were sizable, they could be overcome in time. The SM-65 Atlas passed its first fully functional test in 1959, two years after his death. The feasibility of an ICBM owed as much to improved, smaller warheads as it did to developments in rocketry, and his understanding of the former made his advice invaluable.",What was the purpose of top secret ICBM committee?,decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon
Pope_Paul_VI,"Some critiqued Paul VI's decision; the newly created Synod of Bishops had an advisory role only and could not make decisions on their own, although the Council decided exactly that. During the pontificate of Paul VI, five such synods took place, and he is on record of implementing all their decisions. Related questions were raised about the new National Bishop Conferences, which became mandatory after Vatican II. Others questioned his Ostpolitik and contacts with Communism and the deals he engaged in for the faithful.",What conferences became a requirement after Vatican II?,National Bishop Conferences
Spectre_(2015_film),"Bond and Swann return to London where they meet M, Bill Tanner, Q, and Moneypenny; they intend to arrest C and stop Nine Eyes from going online. Swann leaves Bond, telling him she cannot be part of a life involving espionage, and is subsequently kidnapped. On the way, the group is ambushed and Bond is kidnapped, but the rest still proceed with the plan. After Q succeeds in preventing the Nine Eyes from going online, a brief struggle between M and C ends with the latter falling to his death. Meanwhile, Bond is taken to the old MI6 building, which is scheduled for demolition, and frees himself. Moving throughout the ruined labyrinth, he encounters a disfigured Blofeld, who tells him that he has three minutes to escape the building before explosives are detonated or die trying to save Swann. Bond finds Swann and the two escape by boat as the building collapses. Bond shoots down Blofeld's helicopter, which crashes onto Westminster Bridge. As Blofeld crawls away from the wreckage, Bond confronts him but ultimately leaves him to be arrested by M. Bond leaves the bridge with Swann.",Who does M fight with?,C
Antarctica,"About 1150 species of fungi have been recorded from Antarctica, of which about 750 are non-lichen-forming and 400 are lichen-forming. Some of these species are cryptoendoliths as a result of evolution under extreme conditions, and have significantly contributed to shaping the impressive rock formations of the McMurdo Dry Valleys and surrounding mountain ridges. The apparently simple morphology, scarcely differentiated structures, metabolic systems and enzymes still active at very low temperatures, and reduced life cycles shown by such fungi make them particularly suited to harsh environments such as the McMurdo Dry Valleys. In particular, their thick-walled and strongly melanized cells make them resistant to UV light. Those features can also be observed in algae and cyanobacteria, suggesting that these are adaptations to the conditions prevailing in Antarctica. This has led to speculation that, if life ever occurred on Mars, it might have looked similar to Antarctic fungi such as Cryomyces minteri. Some of these fungi are also apparently endemic to Antarctica. Endemic Antarctic fungi also include certain dung-inhabiting species which have had to evolve in response to the double challenge of extreme cold while growing on dung, and the need to survive passage through the gut of warm-blooded animals.",How many species of fungi have been found on Antarctica?,1150
North_Carolina,"In the Battle of Cowan's Ford, Cornwallis met resistance along the banks of the Catawba River at Cowan's Ford on February 1, 1781, in an attempt to engage General Morgan's forces during a tactical withdrawal. Morgan had moved to the northern part of the state to combine with General Greene's newly recruited forces. Generals Greene and Cornwallis finally met at the Battle of Guilford Courthouse in present-day Greensboro on March 15, 1781. Although the British troops held the field at the end of the battle, their casualties at the hands of the numerically superior Continental Army were crippling. Following this ""Pyrrhic victory"", Cornwallis chose to move to the Virginia coastline to get reinforcements, and to allow the Royal Navy to protect his battered army. This decision would result in Cornwallis' eventual defeat at Yorktown, Virginia, later in 1781. The Patriots' victory there guaranteed American independence.","After losing the battle of Guilford Courthouse, Cornawallis moved his troops where?",Virginia coastline
2008_Summer_Olympics_torch_relay,"The Olympic Torch is based on traditional scrolls and uses a traditional Chinese design known as ""Lucky Cloud"". It is made from aluminum. It is 72 centimetres high and weighs 985 grams. The torch is designed to remain lit in 65 kilometre per hour (37 mile per hour) winds, and in rain of up to 50 millimetres (2 inches) per hour. An ignition key is used to ignite and extinguish the flame. The torch is fueled by cans of propane. Each can will light the torch for 15 minutes. It is designed by a team from Lenovo Group. The Torch is designed in reference to the traditional Chinese concept of the 5 elements that make up the entire universe.",What is the Olympic Torch made from?,aluminum.


### Create the agent to be evaluated

### Run the agent on the random sample of questions

* Unlike the default Retrieval QA or Open Generative QA of SQuAD, in our use case, the agent would normally be given context in the course of a natural conversation, as the user elaborates on what they want to know. 
* Therefore, for benchmarking, we will provide the context to answer the question in the prompt.

### Use semantic similarity to evaluate the agent's answers against the reference answers

* One flaw of this approach is that it does not take into account the existence of multiple acceptable answers.
* Another flaw is that the agent me be unfairly penalized for elaborating on the answer, while this benchmark focuses on only and exactly the one canonical answer given.


In [6]:
BENCHMARKS_DIR = "benchmarks"

def benchmark_agent(agent, dfSample, name):
    answers_ref, answers_pred = [], []        

    # Suppress logging from the agent, which can be quite verbose
    agent.logger.setLevel(logging.CRITICAL)

    for title, context, question, answer in tqdm(dfSample.values):
        class Output:
            output: agent_types.AgentType | str = None

        prompt = f"""
            Read the following document and answer the question.

            Document Title: {title}
            Document Content: {context}

            Question: {question}
        """
        answers_ref.append(answer)
        final_answer = agent.run(prompt, stream=False, reset=True)
        answers_pred.append(final_answer)

        answers_ref = [str(answer) for answer in answers_ref]
        answers_pred = [str(answer) for answer in answers_pred]

        em = EmbeddingModelWrapper()
        similarities = em.get_similarities(
            em.get_embeddings( answers_pred ),
            em.get_embeddings( answers_ref ),
        )

        dfAnswers = dfSample.copy()
        dfAnswers["Predicted Answer"] = answers_pred
        dfAnswers["Similarity"] = similarities

        os.makedirs(BENCHMARKS_DIR, exist_ok=True)
        dfAnswers.to_pickle(os.path.join(BENCHMARKS_DIR, f"{name}.pkl"))


  0%|          | 0/10 [00:00<?, ?it/s]

### Set up and run the benchmarks

In [7]:
from agent import get_agent

benchmarks = [
    (get_agent(), "baseline"), # Baseline agent with default settings
]

for agent, name in tqdm(benchmarks):
    benchmark_agent(agent, dfSample, name)

In [11]:
# Load and display all benchmarks
def load_benchmarks():
    benchmarks_dir = "benchmarks"
    benchmarks = []
    for file in os.listdir(benchmarks_dir):
        if file.endswith(".pkl"):
            df = pd.read_pickle(os.path.join(benchmarks_dir, file))
            benchmarks.append(df)
    return benchmarks

benchmarks = load_benchmarks()

for benchmark in benchmarks:
    display_text_df(benchmark)


Title,Context,Question,Answer,Predicted Answer,Similarity
Institute_of_technology,"The world's first institution of technology or technical university with tertiary technical education is the Banská Akadémia in Banská Štiavnica, Slovakia, founded in 1735, Academy since December 13, 1762 established by queen Maria Theresa in order to train specialists of silver and gold mining and metallurgy in neighbourhood. Teaching started in 1764. Later the department of Mathematics, Mechanics and Hydraulics and department of Forestry were settled. University buildings are still at their place today and are used for teaching. University has launched the first book of electrotechnics in the world.",What year was the Banská Akadémia founded?,1735,1735,1.0
Film_speed,"The standard specifies how speed ratings should be reported by the camera. If the noise-based speed (40:1) is higher than the saturation-based speed, the noise-based speed should be reported, rounded downwards to a standard value (e.g. 200, 250, 320, or 400). The rationale is that exposure according to the lower saturation-based speed would not result in a visibly better image. In addition, an exposure latitude can be specified, ranging from the saturation-based speed to the 10:1 noise-based speed. If the noise-based speed (40:1) is lower than the saturation-based speed, or undefined because of high noise, the saturation-based speed is specified, rounded upwards to a standard value, because using the noise-based speed would lead to overexposed images. The camera may also report the SOS-based speed (explicitly as being an SOS speed), rounded to the nearest standard speed rating.",What is another speed that can also be reported by the camera?,SOS-based speed,saturation-based speed,0.555529
Sumer,"The most impressive and famous of Sumerian buildings are the ziggurats, large layered platforms which supported temples. Sumerian cylinder seals also depict houses built from reeds not unlike those built by the Marsh Arabs of Southern Iraq until as recently as 400 CE. The Sumerians also developed the arch, which enabled them to develop a strong type of dome. They built this by constructing and linking several arches. Sumerian temples and palaces made use of more advanced materials and techniques,[citation needed] such as buttresses, recesses, half columns, and clay nails.",Where were the use of advanced materials and techniques on display in Sumer?,Sumerian temples and palaces,temples and palaces,0.726322
"Ann_Arbor,_Michigan","Ann Arbor has a council-manager form of government. The City Council has 11 voting members: the mayor and 10 city council members. The mayor and city council members serve two-year terms: the mayor is elected every even-numbered year, while half of the city council members are up for election annually (five in even-numbered and five in odd-numbered years). Two council members are elected from each of the city's five wards. The mayor is elected citywide. The mayor is the presiding officer of the City Council and has the power to appoint all Council committee members as well as board and commission members, with the approval of the City Council. The current mayor of Ann Arbor is Christopher Taylor, a Democrat who was elected as mayor in 2014. Day-to-day city operations are managed by a city administrator chosen by the city council.",Who is elected every even numbered year?,mayor,The mayor is elected every even-numbered year.,0.493396
John_von_Neumann,"Shortly before his death, when he was already quite ill, von Neumann headed the United States government's top secret ICBM committee, and it would sometimes meet in his home. Its purpose was to decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon. Von Neumann had long argued that while the technical obstacles were sizable, they could be overcome in time. The SM-65 Atlas passed its first fully functional test in 1959, two years after his death. The feasibility of an ICBM owed as much to improved, smaller warheads as it did to developments in rocketry, and his understanding of the former made his advice invaluable.",What was the purpose of top secret ICBM committee?,decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon,decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon,1.0
Pope_Paul_VI,"Some critiqued Paul VI's decision; the newly created Synod of Bishops had an advisory role only and could not make decisions on their own, although the Council decided exactly that. During the pontificate of Paul VI, five such synods took place, and he is on record of implementing all their decisions. Related questions were raised about the new National Bishop Conferences, which became mandatory after Vatican II. Others questioned his Ostpolitik and contacts with Communism and the deals he engaged in for the faithful.",What conferences became a requirement after Vatican II?,National Bishop Conferences,The National Bishop Conferences became mandatory after Vatican II.,0.442729
Spectre_(2015_film),"Bond and Swann return to London where they meet M, Bill Tanner, Q, and Moneypenny; they intend to arrest C and stop Nine Eyes from going online. Swann leaves Bond, telling him she cannot be part of a life involving espionage, and is subsequently kidnapped. On the way, the group is ambushed and Bond is kidnapped, but the rest still proceed with the plan. After Q succeeds in preventing the Nine Eyes from going online, a brief struggle between M and C ends with the latter falling to his death. Meanwhile, Bond is taken to the old MI6 building, which is scheduled for demolition, and frees himself. Moving throughout the ruined labyrinth, he encounters a disfigured Blofeld, who tells him that he has three minutes to escape the building before explosives are detonated or die trying to save Swann. Bond finds Swann and the two escape by boat as the building collapses. Bond shoots down Blofeld's helicopter, which crashes onto Westminster Bridge. As Blofeld crawls away from the wreckage, Bond confronts him but ultimately leaves him to be arrested by M. Bond leaves the bridge with Swann.",Who does M fight with?,C,C,1.0
Antarctica,"About 1150 species of fungi have been recorded from Antarctica, of which about 750 are non-lichen-forming and 400 are lichen-forming. Some of these species are cryptoendoliths as a result of evolution under extreme conditions, and have significantly contributed to shaping the impressive rock formations of the McMurdo Dry Valleys and surrounding mountain ridges. The apparently simple morphology, scarcely differentiated structures, metabolic systems and enzymes still active at very low temperatures, and reduced life cycles shown by such fungi make them particularly suited to harsh environments such as the McMurdo Dry Valleys. In particular, their thick-walled and strongly melanized cells make them resistant to UV light. Those features can also be observed in algae and cyanobacteria, suggesting that these are adaptations to the conditions prevailing in Antarctica. This has led to speculation that, if life ever occurred on Mars, it might have looked similar to Antarctic fungi such as Cryomyces minteri. Some of these fungi are also apparently endemic to Antarctica. Endemic Antarctic fungi also include certain dung-inhabiting species which have had to evolve in response to the double challenge of extreme cold while growing on dung, and the need to survive passage through the gut of warm-blooded animals.",How many species of fungi have been found on Antarctica?,1150,1150,1.0
North_Carolina,"In the Battle of Cowan's Ford, Cornwallis met resistance along the banks of the Catawba River at Cowan's Ford on February 1, 1781, in an attempt to engage General Morgan's forces during a tactical withdrawal. Morgan had moved to the northern part of the state to combine with General Greene's newly recruited forces. Generals Greene and Cornwallis finally met at the Battle of Guilford Courthouse in present-day Greensboro on March 15, 1781. Although the British troops held the field at the end of the battle, their casualties at the hands of the numerically superior Continental Army were crippling. Following this ""Pyrrhic victory"", Cornwallis chose to move to the Virginia coastline to get reinforcements, and to allow the Royal Navy to protect his battered army. This decision would result in Cornwallis' eventual defeat at Yorktown, Virginia, later in 1781. The Patriots' victory there guaranteed American independence.","After losing the battle of Guilford Courthouse, Cornawallis moved his troops where?",Virginia coastline,the Virginia coastline,0.94857
2008_Summer_Olympics_torch_relay,"The Olympic Torch is based on traditional scrolls and uses a traditional Chinese design known as ""Lucky Cloud"". It is made from aluminum. It is 72 centimetres high and weighs 985 grams. The torch is designed to remain lit in 65 kilometre per hour (37 mile per hour) winds, and in rain of up to 50 millimetres (2 inches) per hour. An ignition key is used to ignite and extinguish the flame. The torch is fueled by cans of propane. Each can will light the torch for 15 minutes. It is designed by a team from Lenovo Group. The Torch is designed in reference to the traditional Chinese concept of the 5 elements that make up the entire universe.",What is the Olympic Torch made from?,aluminum.,aluminum,0.973508
