+
+**Note:** More details on the commands are given below.
+
+## Other examples
+
+### Handle textual instructions
+
+In the `GoToDoor` environment, the agent receives an image along with a textual instruction. To handle the latter, add `--text` to the command:
+
+```
+python3 -m scripts.train --algo ppo --env MiniGrid-GoToDoor-5x5-v0 --model GoToDoor --text --save-interval 10 --frames 1000000
+```
+
+
+
+### Handle dialogue with multi a multi headed agent
+
+In the `GoToDoorTalk` environment, the agent receives an image along with the dialogue. To handle the latter, add `--dialogue` and, to use the multi headed agent, add `--multi-headed-agent` to the command:
+
+```
+python3 -m scripts.train --algo ppo --env MiniGrid-GoToDoorTalk-5x5-v0 --model GoToDoorMultiHead --dialogue --multi-headed-agent --save-interval 10 --frames 1000000
+```
+
+### Add memory
+
+In the `RedBlueDoors` environment, the agent has to open the red door then the blue one. To solve it efficiently, when it opens the red door, it has to remember it. To add memory to the agent, add `--recurrence X` to the command:
+
+```
+python3 -m scripts.train --algo ppo --env MiniGrid-RedBlueDoors-6x6-v0 --model RedBlueDoors --recurrence 4 --save-interval 10 --frames 1000000
+```
+
+
+
+## Files
+
+This package contains:
+- scripts to:
+ - train an agent \
+ in `script/train.py` ([more details](#scripts-train))
+ - visualize agent's behavior \
+ in `script/visualize.py` ([more details](#scripts-visualize))
+ - evaluate agent's performances \
+ in `script/evaluate.py` ([more details](#scripts-evaluate))
+- a default agent's model \
+in `model.py` ([more details](#model))
+- utilitarian classes and functions used by the scripts \
+in `utils`
+
+These files are suited for [`gym-minigrid`](https://github.com/maximecb/gym-minigrid) environments and [`torch-ac`](https://github.com/lcswillems/torch-ac) RL algorithms. They are easy to adapt to other environments and RL algorithms by modifying:
+- `model.py`
+- `utils/format.py`
+
+
scripts/train.py
+
+An example of use:
+
+```bash
+python3 -m scripts.train --algo ppo --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000
+```
+
+The script loads the model in `storage/DoorKey` or creates it if it doesn't exist, then trains it with the PPO algorithm on the MiniGrid DoorKey environment, and saves it every 10 updates in `storage/DoorKey`. It stops after 80 000 frames.
+
+**Note:** You can define a different storage location in the environment variable `PROJECT_STORAGE`.
+
+More generally, the script has 2 required arguments:
+- `--algo ALGO`: name of the RL algorithm used to train
+- `--env ENV`: name of the environment to train on
+
+and a bunch of optional arguments among which:
+- `--recurrence N`: gradient will be backpropagated over N timesteps. By default, N = 1. If N > 1, a LSTM is added to the model to have memory.
+- `--text`: a GRU is added to the model to handle text input.
+- ... (see more using `--help`)
+
+During training, logs are printed in your terminal (and saved in text and CSV format):
+
+
+
+**Note:** `U` gives the update number, `F` the total number of frames, `FPS` the number of frames per second, `D` the total duration, `rR:μσmM` the mean, std, min and max reshaped return per episode, `F:μσmM` the mean, std, min and max number of frames per episode, `H` the entropy, `V` the value, `pL` the policy loss, `vL` the value loss and `∇` the gradient norm.
+
+During training, logs are also plotted in Tensorboard:
+
+
+
+
scripts/visualize.py
+
+An example of use:
+
+```
+python3 -m scripts.visualize --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
+```
+
+
+
+In this use case, the script displays how the model in `storage/DoorKey` behaves on the MiniGrid DoorKey environment.
+
+More generally, the script has 2 required arguments:
+- `--env ENV`: name of the environment to act on.
+- `--model MODEL`: name of the trained model.
+
+and a bunch of optional arguments among which:
+- `--argmax`: select the action with highest probability
+- ... (see more using `--help`)
+
+
scripts/evaluate.py
+
+An example of use:
+
+```
+python3 -m scripts.evaluate --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
+```
+
+
+
+In this use case, the script prints in the terminal the performance among 100 episodes of the model in `storage/DoorKey`.
+
+More generally, the script has 2 required arguments:
+- `--env ENV`: name of the environment to act on.
+- `--model MODEL`: name of the trained model.
+
+and a bunch of optional arguments among which:
+- `--episodes N`: number of episodes of evaluation. By default, N = 100.
+- ... (see more using `--help`)
+
+
model.py
+
+The default model is discribed by the following schema:
+
+
+
+This environment is an empty room, and the goal of the agent is to reach the
+green goal square, which provides a sparse reward. A small penalty is
+subtracted for the number of steps to reach the goal. This environment is
+useful, with small rooms, to validate that your RL algorithm works correctly,
+and with large rooms to experiment with sparse rewards and exploration.
+The random variants of the environment have the agent starting at a random
+position for each episode, while the regular variants have the agent always
+starting in the corner opposite to the goal.
+
+### Four rooms environment
+
+Registered configurations:
+- `MiniGrid-FourRooms-v0`
+
+
+
+
+
+Classic four room reinforcement learning environment. The agent must navigate
+in a maze composed of four rooms interconnected by 4 gaps in the walls. To
+obtain a reward, the agent must reach the green goal square. Both the agent
+and the goal square are randomly placed in any of the four rooms.
+
+### Door & key environment
+
+Registered configurations:
+- `MiniGrid-DoorKey-5x5-v0`
+- `MiniGrid-DoorKey-6x6-v0`
+- `MiniGrid-DoorKey-8x8-v0`
+- `MiniGrid-DoorKey-16x16-v0`
+
+
+
+
+
+This environment has a key that the agent must pick up in order to unlock
+a goal and then get to the green goal square. This environment is difficult,
+because of the sparse reward, to solve using classical RL algorithms. It is
+useful to experiment with curiosity or curriculum learning.
+
+### Multi-room environment
+
+Registered configurations:
+- `MiniGrid-MultiRoom-N2-S4-v0` (two small rooms)
+- `MiniGrid-MultiRoom-N4-S5-v0` (four rooms)
+- `MiniGrid-MultiRoom-N6-v0` (six rooms)
+
+
+
+
+
+This environment has a series of connected rooms with doors that must be
+opened in order to get to the next room. The final room has the green goal
+square the agent must get to. This environment is extremely difficult to
+solve using RL alone. However, by gradually increasing the number of
+rooms and building a curriculum, the environment can be solved.
+
+### Fetch environment
+
+Registered configurations:
+- `MiniGrid-Fetch-5x5-N2-v0`
+- `MiniGrid-Fetch-6x6-N2-v0`
+- `MiniGrid-Fetch-8x8-N3-v0`
+
+
+
+
+
+This environment has multiple objects of assorted types and colors. The
+agent receives a textual string as part of its observation telling it
+which object to pick up. Picking up the wrong object produces a negative
+reward.
+
+### Go-to-door environment
+
+Registered configurations:
+- `MiniGrid-GoToDoor-5x5-v0`
+- `MiniGrid-GoToDoor-6x6-v0`
+- `MiniGrid-GoToDoor-8x8-v0`
+
+
+
+
+
+This environment is a room with four doors, one on each wall. The agent
+receives a textual (mission) string as input, telling it which door to go to,
+(eg: "go to the red door"). It receives a positive reward for performing the
+`done` action next to the correct door, as indicated in the mission string.
+
+### Put-near environment
+
+Registered configurations:
+- `MiniGrid-PutNear-6x6-N2-v0`
+- `MiniGrid-PutNear-8x8-N3-v0`
+
+The agent is instructed through a textual string to pick up an object and
+place it next to another object. This environment is easy to solve with two
+objects, but difficult to solve with more, as it involves both textual
+understanding and spatial reasoning involving multiple objects.
+
+### Red and blue doors environment
+
+Registered configurations:
+- `MiniGrid-RedBlueDoors-6x6-v0`
+- `MiniGrid-RedBlueDoors-8x8-v0`
+
+The purpose of this environment is to test memory.
+The agent is randomly placed within a room with one red and one blue door
+facing opposite directions. The agent has to open the red door and then open
+the blue door, in that order. The agent, when facing one door, cannot see
+the door behind him. Hence, the agent needs to remember whether or not he has
+previously opened the other door in order to reliably succeed at completing
+the task.
+
+### Memory environment
+
+Registered configurations:
+- `MiniGrid-MemoryS17Random-v0`
+- `MiniGrid-MemoryS13Random-v0`
+- `MiniGrid-MemoryS13-v0`
+- `MiniGrid-MemoryS11-v0`
+- `MiniGrid-MemoryS9-v0`
+- `MiniGrid-MemoryS7-v0`
+
+This environment is a memory test. The agent starts in a small room
+where it sees an object. It then has to go through a narrow hallway
+which ends in a split. At each end of the split there is an object,
+one of which is the same as the object in the starting room. The
+agent has to remember the initial object, and go to the matching
+object at split.
+
+### Locked room environment
+
+Registed configurations:
+- `MiniGrid-LockedRoom-v0`
+
+The environment has six rooms, one of which is locked. The agent receives
+a textual mission string as input, telling it which room to go to in order
+to get the key that opens the locked room. It then has to go into the locked
+room in order to reach the final goal. This environment is extremely difficult
+to solve with vanilla reinforcement learning alone.
+
+### Key corridor environment
+
+Registed configurations:
+- `MiniGrid-KeyCorridorS3R1-v0`
+- `MiniGrid-KeyCorridorS3R2-v0`
+- `MiniGrid-KeyCorridorS3R3-v0`
+- `MiniGrid-KeyCorridorS4R3-v0`
+- `MiniGrid-KeyCorridorS5R3-v0`
+- `MiniGrid-KeyCorridorS6R3-v0`
+
+
+
+
+
+
+
+
+
+
+This environment is similar to the locked room environment, but there are
+multiple registered environment configurations of increasing size,
+making it easier to use curriculum learning to train an agent to solve it.
+The agent has to pick up an object which is behind a locked door. The key is
+hidden in another room, and the agent has to explore the environment to find
+it. The mission string does not give the agent any clues as to where the
+key is placed. This environment can be solved without relying on language.
+
+### Unlock environment
+
+Registed configurations:
+- `MiniGrid-Unlock-v0`
+
+
+
+
+
+The agent has to open a locked door. This environment can be solved without
+relying on language.
+
+### Unlock pickup environment
+
+Registed configurations:
+- `MiniGrid-UnlockPickup-v0`
+
+
+
+
+
+The agent has to pick up a box which is placed in another room, behind a
+locked door. This environment can be solved without relying on language.
+
+### Blocked unlock pickup environment
+
+Registed configurations:
+- `MiniGrid-BlockedUnlockPickup-v0`
+
+
+
+
+
+The agent has to pick up a box which is placed in another room, behind a
+locked door. The door is also blocked by a ball which the agent has to move
+before it can unlock the door. Hence, the agent has to learn to move the ball,
+pick up the key, open the door and pick up the object in the other room.
+This environment can be solved without relying on language.
+
+## Obstructed maze environment
+
+Registered configurations:
+- `MiniGrid-ObstructedMaze-1Dl-v0`
+- `MiniGrid-ObstructedMaze-1Dlh-v0`
+- `MiniGrid-ObstructedMaze-1Dlhb-v0`
+- `MiniGrid-ObstructedMaze-2Dl-v0`
+- `MiniGrid-ObstructedMaze-2Dlh-v0`
+- `MiniGrid-ObstructedMaze-2Dlhb-v0`
+- `MiniGrid-ObstructedMaze-1Q-v0`
+- `MiniGrid-ObstructedMaze-2Q-v0`
+- `MiniGrid-ObstructedMaze-Full-v0`
+
+
+
+
+
+
+
+
+
+
+
+
+
+The agent has to pick up a box which is placed in a corner of a 3x3 maze.
+The doors are locked, the keys are hidden in boxes and doors are obstructed
+by balls. This environment can be solved without relying on language.
+
+## Distributional shift environment
+
+Registered configurations:
+- `MiniGrid-DistShift1-v0`
+- `MiniGrid-DistShift2-v0`
+
+This environment is based on one of the DeepMind [AI safety gridworlds](https://github.com/deepmind/ai-safety-gridworlds).
+The agent starts in the top-left corner and must reach the goal which is in the top-right corner, but has to avoid stepping
+into lava on its way. The aim of this environment is to test an agent's ability to generalize. There are two slightly
+different variants of the environment, so that the agent can be trained on one variant and tested on the other.
+
+
+
+The agent has to reach the green goal square at the opposite corner of the room,
+and must pass through a narrow gap in a vertical strip of deadly lava. Touching
+the lava terminate the episode with a zero reward. This environment is useful
+for studying safety and safe exploration.
+
+## Lava crossing environment
+
+Registered configurations:
+- `MiniGrid-LavaCrossingS9N1-v0`
+- `MiniGrid-LavaCrossingS9N2-v0`
+- `MiniGrid-LavaCrossingS9N3-v0`
+- `MiniGrid-LavaCrossingS11N5-v0`
+
+
+
+
+
+
+
+
+The agent has to reach the green goal square on the other corner of the room
+while avoiding rivers of deadly lava which terminate the episode in failure.
+Each lava stream runs across the room either horizontally or vertically, and
+has a single crossing point which can be safely used; Luckily, a path to the
+goal is guaranteed to exist. This environment is useful for studying safety and
+safe exploration.
+
+## Simple crossing environment
+
+Registered configurations:
+- `MiniGrid-SimpleCrossingS9N1-v0`
+- `MiniGrid-SimpleCrossingS9N2-v0`
+- `MiniGrid-SimpleCrossingS9N3-v0`
+- `MiniGrid-SimpleCrossingS11N5-v0`
+
+
+
+
+
+
+
+
+Similar to the `LavaCrossing` environment, the agent has to reach the green
+goal square on the other corner of the room, however lava is replaced by
+walls. This MDP is therefore much easier and and maybe useful for quickly
+testing your algorithms.
+
+### Dynamic obstacles environment
+
+Registered configurations:
+- `MiniGrid-Dynamic-Obstacles-5x5-v0`
+- `MiniGrid-Dynamic-Obstacles-Random-5x5-v0`
+- `MiniGrid-Dynamic-Obstacles-6x6-v0`
+- `MiniGrid-Dynamic-Obstacles-Random-6x6-v0`
+- `MiniGrid-Dynamic-Obstacles-8x8-v0`
+- `MiniGrid-Dynamic-Obstacles-16x16-v0`
+
+
+
+
+
+This environment is an empty room with moving obstacles. The goal of the agent is to reach the green goal square without colliding with any obstacle. A large penalty is subtracted if the agent collides with an obstacle and the episode finishes. This environment is useful to test Dynamic Obstacle Avoidance for mobile robots with Reinforcement Learning in Partial Observability.
diff --git a/gym-minigrid/benchmark.py b/gym-minigrid/benchmark.py
new file mode 100755
index 0000000000000000000000000000000000000000..81840254ddfb148e32c2e42d366e511e04ab4737
--- /dev/null
+++ b/gym-minigrid/benchmark.py
@@ -0,0 +1,53 @@
+#!/usr/bin/env python3
+
+import time
+import argparse
+import gym_minigrid
+import gym
+from gym_minigrid.wrappers import *
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+ "--env-name",
+ dest="env_name",
+ help="gym environment to load",
+ default='MiniGrid-LavaGapS7-v0'
+)
+parser.add_argument("--num_resets", default=200)
+parser.add_argument("--num_frames", default=5000)
+args = parser.parse_args()
+
+env = gym.make(args.env_name)
+
+# Benchmark env.reset
+t0 = time.time()
+for i in range(args.num_resets):
+ env.reset()
+t1 = time.time()
+dt = t1 - t0
+reset_time = (1000 * dt) / args.num_resets
+
+# Benchmark rendering
+t0 = time.time()
+for i in range(args.num_frames):
+ env.render('rgb_array')
+t1 = time.time()
+dt = t1 - t0
+frames_per_sec = args.num_frames / dt
+
+# Create an environment with an RGB agent observation
+env = gym.make(args.env_name)
+env = RGBImgPartialObsWrapper(env)
+env = ImgObsWrapper(env)
+
+# Benchmark rendering
+t0 = time.time()
+for i in range(args.num_frames):
+ obs, reward, done, info = env.step(0)
+t1 = time.time()
+dt = t1 - t0
+agent_view_fps = args.num_frames / dt
+
+print('Env reset time: {:.1f} ms'.format(reset_time))
+print('Rendering FPS : {:.0f}'.format(frames_per_sec))
+print('Agent view FPS: {:.0f}'.format(agent_view_fps))
diff --git a/gym-minigrid/gym_minigrid/__init__.py b/gym-minigrid/gym_minigrid/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c672525c51e14c4f3e5ccf9ee9467480e4d2c65
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/__init__.py
@@ -0,0 +1,6 @@
+# Import the envs module so that envs register themselves
+import gym_minigrid.envs
+import gym_minigrid.social_ai_envs
+
+# Import wrappers so it's accessible when installing with pip
+import gym_minigrid.wrappers
diff --git a/gym-minigrid/gym_minigrid/backup_envs/bobo.py b/gym-minigrid/gym_minigrid/backup_envs/bobo.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1f44c09dcf75faa5c306f707092fc904ba37407
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/bobo.py
@@ -0,0 +1,301 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+import time
+from collections import deque
+
+
+class Peer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env, knowledgeable=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.knowledgeable = knowledgeable
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+ self.exited = False
+
+ def step(self):
+ if self.exited:
+ return
+
+ if all(np.array(self.cur_pos) == np.array(self.env.door_pos)):
+ # todo: disappear
+ # todo: close door
+ self.env.grid.set(*self.cur_pos, self.env.object)
+ self.cur_pos = np.array([np.nan, np.nan])
+
+ self.env.object.toggle(self.env, self.cur_pos)
+
+ self.exited = True
+
+ elif self.knowledgeable:
+
+ if all(self.front_pos == self.env.door_pos):
+ # in front of door
+ if self.env.object.is_open:
+ self.go_forward()
+ else:
+ self.toggle_action()
+
+ else:
+ if (self.cur_pos[0] == self.env.door_pos[0]) or (self.cur_pos[1] == self.env.door_pos[1]):
+ # is either in the correct row on in the correct column
+ next_wanted_position = self.env.door_pos
+ else:
+ # choose the midpoint
+ for cand_x, cand_y in [
+ (self.cur_pos[0], self.env.door_pos[1]),
+ (self.env.door_pos[0], self.cur_pos[1])
+ ]:
+ if (
+ cand_x > 0 and cand_x < self.env.wall_x
+ ) and (
+ cand_y > 0 and cand_y < self.env.wall_y
+ ):
+ next_wanted_position = (cand_x, cand_y)
+
+ if self.cur_pos[1] == next_wanted_position[1]:
+ # same y
+ if self.cur_pos[0] < next_wanted_position[0]:
+ wanted_dir = 0
+ else:
+ wanted_dir = 2
+ if self.npc_dir == wanted_dir:
+ self.go_forward()
+
+ else:
+ self.rotate_left()
+
+ elif self.cur_pos[0] == next_wanted_position[0]:
+ # same x
+ if self.cur_pos[1] < next_wanted_position[1]:
+ wanted_dir = 1
+ else:
+ wanted_dir = 3
+
+
+ if self.npc_dir == wanted_dir:
+ self.go_forward()
+
+ else:
+ self.rotate_left()
+ else:
+ raise ValueError("Something is wrong.")
+
+ else:
+ self.env._rand_elem(self.available_moves)()
+
+
+class BoboGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class BoboEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *BoboGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width - 1
+ self.wall_y = height - 1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ door_color = self._rand_elem(COLOR_NAMES)
+
+ wall_for_door = self._rand_int(0, 4)
+
+ if wall_for_door < 2:
+ w = self._rand_int(1, width-1)
+ h = height-1 if wall_for_door == 0 else 0
+ else:
+ w = width-1 if wall_for_door == 3 else 0
+ h = self._rand_int(1, height-1)
+
+ self.door_pos = (w, h)
+ self.door = Door(door_color)
+ self.grid.set(*self.door_pos, self.door)
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.peer = Peer(color, "Jim", self, knowledgeable=self.knowledgeable)
+
+ # Place it on the middle left side of the room
+ peer_pos = np.array((self._rand_int(1, width - 1), self._rand_int(1, height - 1)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'watch dancer and repeat his moves afterwards'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+
+ if np.isnan(p_action):
+ pass
+
+ if p_action == self.actions.done:
+ done = True
+
+ self.peer.step()
+
+ if all(self.agent_pos == self.door_pos):
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [self.peer.name])
+ return obs
+
+
+class Bobo8x8Env(BoboEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class Bobo6x6Env(BoboEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# knowledgeable
+class BoboKnowledgeableEnv(BoboEnv):
+ def __init__(self):
+ super().__init__(size=5, knowledgeable=True)
+
+class BoboKnowledgeable6x6Env(BoboEnv):
+ def __init__(self):
+ super().__init__(size=6, knowledgeable=True)
+
+class BoboKnowledgeable8x8Env(BoboEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True)
+
+
+
+register(
+ id='MiniGrid-Bobo-5x5-v0',
+ entry_point='gym_minigrid.envs:BoboEnv'
+)
+
+register(
+ id='MiniGrid-Bobo-6x6-v0',
+ entry_point='gym_minigrid.envs:Bobo6x6Env'
+)
+
+register(
+ id='MiniGrid-Bobo-8x8-v0',
+ entry_point='gym_minigrid.envs:Bobo8x8Env'
+)
+
+register(
+ id='MiniGrid-BoboKnowledgeable-5x5-v0',
+ entry_point='gym_minigrid.envs:BoboKnowledgeableEnv'
+)
+
+register(
+ id='MiniGrid-BoboKnowledgeable-6x6-v0',
+ entry_point='gym_minigrid.envs:BoboKnowledgeable6x6Env'
+)
+
+register(
+ id='MiniGrid-BoboKnowledgeable-8x8-v0',
+ entry_point='gym_minigrid.envs:BoboKnowledgeable8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/cointhief.py b/gym-minigrid/gym_minigrid/backup_envs/cointhief.py
new file mode 100644
index 0000000000000000000000000000000000000000..271b897e7c688976e05c36f9cc19f3be4b63a886
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/cointhief.py
@@ -0,0 +1,431 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+import time
+from collections import deque
+
+
+class Thief(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ NPC executes a sequence of movement and utterances
+ """
+
+ def __init__(self, color, name, env, thief_pos, hidden_npc=False, tag_visible_coins=False, view_size=5, look_around=True):
+ super().__init__(color)
+ self.name = name
+ self.npc_type = 0
+ self.env = env
+ self.cur_pos = thief_pos
+ self.init_pos = thief_pos
+ self.view_size = view_size
+ self.npc_dir = self._look_at_agent() # Thief stares at its victim
+ self.init_dir = self.npc_dir
+ self.actions = self.env.possible_actions
+ self.tag_visible_coins = tag_visible_coins
+
+ self.nb_given_coins = None
+ self.look_around = look_around
+ if self.look_around: # randomly select in which direction NPC is looking around
+ if np.random.random() > 0.5: # will look left
+ self.look = self.rotate_left
+ self.look_back = self.rotate_right
+ else: # will look right
+ self.look = self.rotate_right
+ self.look_back = self.rotate_left
+
+ self.nb_seen_coins = self._count_coins() # This is how much coins Thief wants, at least
+ self.add_npc_direction = True
+ self.nb_steps = 0
+ self.hidden_npc = hidden_npc
+
+ def step(self, agent_action, agent_utterance):
+ agent_disobeyed = False
+ agent_gave_coins = False
+ utterance = None
+
+ if self.nb_steps == 0:
+ utterance = "Freeze! Give me all the coins you have!"
+
+ if self.nb_steps >= 0 and self.look_around:
+ if self.npc_dir == self.init_dir: # start to look around
+ self.look()
+ else: # resume looking to agent
+ self.look_back()
+
+ if not(agent_utterance is None):
+ self.nb_given_coins = int(agent_utterance[-2])
+
+ if self.nb_given_coins >= self.nb_seen_coins:
+ agent_gave_coins = True
+ else: # agent gave not enough coins
+ agent_disobeyed = True
+
+ # The thief forbids the agent to move, apart from looking around (rotating)
+ if not (np.isnan(agent_action) or agent_action == self.actions.left or agent_action == self.actions.right):
+ agent_disobeyed = True
+
+ self.nb_steps += 1
+ return agent_disobeyed, agent_gave_coins, utterance
+
+ def get_status_str(self):
+ return "thief sees: {} \n agent gives: {}".format(self.nb_seen_coins, self.nb_given_coins)
+
+ def _count_coins(self):
+ # get seen coins
+ coins_pos = self.get_pos_visible_coins()
+
+ if self.look_around:
+ self.look()
+ # add coins visible from this new direction
+ coins_pos += self.get_pos_visible_coins()
+ # remove coins that we already saw
+ if len(coins_pos) > 0:
+ coins_pos = np.unique(coins_pos, axis=0).tolist()
+ self.look_back()
+
+ return len(coins_pos)
+
+ def _look_at_agent(self):
+ npc_dir = None
+ ax, ay = self.env.agent_pos
+ tx, ty = self.cur_pos
+ delta_x, delta_y = ax - tx, ay - ty
+ if delta_x == 1:
+ npc_dir = 0
+ elif delta_x == -1:
+ npc_dir = 2
+ elif delta_y == 1:
+ npc_dir = 1
+ elif delta_y == -1:
+ npc_dir = 3
+ else:
+ raise NotImplementedError
+
+ return npc_dir
+
+ def gen_npc_obs_grid(self):
+ """
+ Generate the sub-grid observed by the npc.
+ This method also outputs a visibility mask telling us which grid
+ cells the npc can actually see.
+ """
+ view_size = self.view_size
+
+ topX, topY, botX, botY = self.env.get_view_exts(dir=self.npc_dir, view_size=view_size, pos=self.cur_pos)
+
+ grid = self.env.grid.slice(topX, topY, view_size, view_size)
+
+ for i in range(self.npc_dir + 1):
+ grid = grid.rotate_left()
+
+ # Process occluders and visibility
+ # Note that this incurs some performance cost
+ if not self.env.see_through_walls:
+ vis_mask = grid.process_vis(agent_pos=(view_size // 2, view_size - 1))
+ else:
+ vis_mask = np.ones(shape=(grid.width, grid.height), dtype=np.bool)
+
+ # Make it so the agent sees what it's carrying
+ # We do this by placing the carried object at the agent's position
+ # in the agent's partially observable view
+ # agent_pos = grid.width // 2, grid.height - 1
+ # if self.carrying:
+ # grid.set(*agent_pos, self.carrying)
+ # else:
+ # grid.set(*agent_pos, None)
+
+ return grid, vis_mask
+
+ def get_pos_visible_coins(self):
+ """
+ Generate the npc's view (partially observable, low-resolution encoding)
+ return the list of unique visible coins
+ """
+
+ grid, vis_mask = self.gen_npc_obs_grid()
+
+ coins_pos = []
+
+ for obj in grid.grid:
+ if isinstance(obj, Ball):
+ coins_pos.append(obj.cur_pos)
+ if self.tag_visible_coins:
+ obj.tag()
+
+ return coins_pos
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.hidden_npc
+
+
+class CoinThiefGrammar(object):
+
+ templates = ["Here is"]
+ things = ["0","1","2","3","4","5","6"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+ @classmethod
+ def random_utterance(cls):
+ return np.random.choice(cls.templates) + " " + np.random.choice(cls.things) + " "
+
+
+class ThiefActions(IntEnum):
+ # Turn left, turn right, move forward
+ left = 0
+ right = 1
+ forward = 2
+
+
+class CoinThiefEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ hidden_npc=False,
+ max_steps=20,
+ full_obs=False,
+ few_actions=False,
+ tag_visible_coins=False,
+ nb_coins=6,
+ npc_view_size=5,
+ npc_look_around=True
+
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.hidden_npc = hidden_npc
+ self.few_actions = few_actions
+ self.possible_actions = ThiefActions if self.few_actions else MiniGridEnv.Actions
+ self.nb_coins = nb_coins
+ self.tag_visible_coins = tag_visible_coins
+ self.npc_view_size = npc_view_size
+ self.npc_look_around = npc_look_around
+ if max_steps is None:
+ max_steps = 5*size**2
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ full_obs=full_obs,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(self.possible_actions),
+ *CoinThiefGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ # width = self._rand_int(5, width+1)
+ # height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Get possible near-agent positions, and place thief in one of them
+ ax, ay = self.agent_pos
+ near_agent_pos = [[ax, ay + 1], [ax, ay - 1], [ax - 1, ay], [ax + 1, ay]]
+ # get empty cells positions
+ available_pos = []
+ for p in near_agent_pos:
+ if self.grid.get(*p) is None:
+ available_pos.append(p)
+ thief_pos = self._rand_elem(available_pos)
+
+ # Add randomly placed coins
+ # Types and colors of objects we can generate
+ types = ['ball']
+ objs = []
+ objPos = []
+
+ # Until we have generated all the objects
+ while len(objs) < self.nb_coins:
+ objType = self._rand_elem(types)
+ objColor = 'yellow'
+
+ if objType == 'ball':
+ obj = Ball(objColor)
+ else:
+ raise NotImplementedError
+
+ pos = self.place_obj(obj, reject_fn=lambda env,pos: pos.tolist() == thief_pos)
+ objs.append((objType, objColor))
+ objPos.append(pos)
+
+ # Set a randomly coloured Thief NPC next to the agent
+ color = self._rand_elem(COLOR_NAMES)
+
+ self.thief = Thief(color, "Eve", self, thief_pos,
+ hidden_npc=self.hidden_npc,
+ tag_visible_coins=self.tag_visible_coins,
+ view_size=self.npc_view_size,
+ look_around=self.npc_look_around)
+
+ self.grid.set(*thief_pos, self.thief)
+
+ # Generate the mission string
+ self.mission = 'save as much coins as possible'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0] if np.isnan(action[0]) else int(action[0])
+ if len(action) == 1: # agent cannot speak
+ utterance_action = [np.nan, np.nan]
+ else:
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+ speak_flag = not all(np.isnan(utterance_action))
+
+ if speak_flag:
+ utterance = CoinThiefGrammar.construct_utterance(utterance_action)
+ self.conversation += "{}: {} \n".format("Agent", utterance)
+
+ # Don't let the agent open any doors
+ if not self.few_actions and p_action == self.actions.toggle:
+ done = True
+
+ if not self.few_actions and p_action == self.actions.done:
+ done = True
+
+ # npc's turn
+ agent_disobeyed, agent_gave_coins, npc_utterance = self.thief.step(p_action, utterance if speak_flag else None)
+
+ if self.hidden_npc:
+ npc_utterance = None
+
+ if npc_utterance:
+ self.utterance += "{} \n".format(npc_utterance)
+ self.conversation += "{}: {} \n".format(self.thief.name, npc_utterance)
+
+ if agent_disobeyed:
+ done = True
+
+ if agent_gave_coins:
+ done = True
+ if self.thief.nb_seen_coins == self.thief.nb_given_coins:
+ reward = self._reward()
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward,1))
+
+ if done and reward == 0:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # remove npc from agent view
+ npc_obs_idx = np.argwhere(obs['image'] == 11)
+ if npc_obs_idx.size != 0: # agent sees npc
+ obs['image'][npc_obs_idx[0][0], npc_obs_idx[0][1], :] = [1, 0, 0, 0]
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+
+ self.window.clear_text() # erase previous text
+
+ self.window.set_caption(self.conversation) # overwrites super class caption
+ self.window.ax.set_title(self.thief.get_status_str(), loc="left")
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+
+ return obs
+
+
+class CoinThief8x8Env(CoinThiefEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, **kwargs)
+
+
+class CoinThief6x6Env(CoinThiefEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=6, **kwargs)
+
+
+register(
+ id='MiniGrid-CoinThief-5x5-v0',
+ entry_point='gym_minigrid.envs:CoinThiefEnv'
+)
+
+register(
+ id='MiniGrid-CoinThief-6x6-v0',
+ entry_point='gym_minigrid.envs:CoinThief6x6Env'
+)
+
+register(
+ id='MiniGrid-CoinThief-8x8-v0',
+ entry_point='gym_minigrid.envs:CoinThief8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/dancewithonenpc.py b/gym-minigrid/gym_minigrid/backup_envs/dancewithonenpc.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a8cb44c2406bbcc0ac38a0b6697bb5170a5e5d1
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/dancewithonenpc.py
@@ -0,0 +1,344 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+
+class Dancer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ NPC executes a sequence of movement and utterances
+ """
+
+ def __init__(self, color, name, env, dancing_pattern=None,
+ dance_len=3, p_sing=.5, hidden_npc=False, sing_only=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.actions = self.env.possible_actions
+ self.p_sing = p_sing
+ self.sing_only = sing_only
+ if self.sing_only:
+ p_sing = 1
+ self.dancing_pattern = dancing_pattern if dancing_pattern else self._gen_dancing_pattern(dance_len, p_sing)
+ self.agent_actions = deque(maxlen=len(self.dancing_pattern))
+ self.movement_id_to_fun = {self.actions.left: self.rotate_left,
+ self.actions.right: self.rotate_right,
+ self.actions.forward: self.go_forward}
+ # for vizualisation only
+ self.movement_id_to_str = {self.actions.left: "left",
+ self.actions.right: "right",
+ self.actions.forward: "forward",
+ self.actions.pickup: "pickup",
+ self.actions.drop: "drop",
+ self.actions.toggle: "toggle",
+ self.actions.done: "done",
+ None: "None"}
+ self.dancing_step_idx = 0
+ self.done_dancing = False
+ self.add_npc_direction = True
+ self.nb_steps = 0
+ self.hidden_npc = hidden_npc
+
+ def step(self, agent_action, agent_utterance):
+ agent_matched_moves = False
+ utterance = None
+
+ if self.nb_steps == 0:
+ utterance = "Look at me!"
+ if self.nb_steps >= 2: # Wait a couple steps before dancing
+ if not self.done_dancing:
+ if self.dancing_step_idx == len(self.dancing_pattern):
+ self.done_dancing = True
+ utterance = "Now repeat my moves!"
+ else:
+ # NPC moves and speaks according to dance step
+ move_id, utterance = self.dancing_pattern[self.dancing_step_idx]
+ self.movement_id_to_fun[move_id]()
+
+ self.dancing_step_idx += 1
+ else: # record agent dancing pattern
+ self.agent_actions.append((agent_action, agent_utterance))
+
+ if not self.sing_only and list(self.agent_actions) == list(self.dancing_pattern):
+ agent_matched_moves = True
+ if self.sing_only: # only compare utterances
+ if [x[1] for x in self.agent_actions] == [x[1] for x in self.dancing_pattern]:
+ agent_matched_moves = True
+
+ self.nb_steps += 1
+ return agent_matched_moves, utterance
+
+ def get_status_str(self):
+ readable_dancing_pattern = [(self.movement_id_to_str[dp[0]], dp[1]) for dp in self.dancing_pattern]
+ readable_agent_actions = [(self.movement_id_to_str[aa[0]], aa[1]) for aa in self.agent_actions]
+ return "dance: {} \n agent: {}".format(readable_dancing_pattern, readable_agent_actions)
+
+ def _gen_dancing_pattern(self, dance_len, p_sing):
+ available_moves = [self.actions.left, self.actions.right, self.actions.forward]
+ dance_pattern = []
+ for _ in range(dance_len):
+ move = self.env._rand_elem(available_moves)
+ sing = None
+ if np.random.random() < p_sing:
+ sing = DanceWithOneNPCGrammar.random_utterance()
+ dance_pattern.append((move, sing))
+ return dance_pattern
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.hidden_npc
+
+
+
+class DanceWithOneNPCGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+ @classmethod
+ def random_utterance(cls):
+ return np.random.choice(cls.templates) + " " + np.random.choice(cls.things) + " "
+
+
+
+class DanceActions(IntEnum):
+ # Turn left, turn right, move forward
+ left = 0
+ right = 1
+ forward = 2
+
+
+class DanceWithOneNPCEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ dance_len=3,
+ hidden_npc=False,
+ p_sing=.5,
+ max_steps=20,
+ full_obs=False,
+ few_actions=False,
+ sing_only=False
+
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.dance_len = dance_len
+ self.hidden_npc = hidden_npc
+ self.p_sing = p_sing
+ self.few_actions = few_actions
+ self.possible_actions = DanceActions if self.few_actions else MiniGridEnv.Actions
+ self.sing_only = sing_only
+ if max_steps is None:
+ max_steps = 5*size**2
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ full_obs=full_obs,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(self.possible_actions),
+ *DanceWithOneNPCGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.dancer = Dancer(color, "Ren", self, dance_len=self.dance_len,
+ p_sing=self.p_sing, hidden_npc=self.hidden_npc, sing_only=self.sing_only)
+
+ # Place it on the middle left side of the room
+ left_pos = (int((width / 2) - 1), int(height / 2))
+ #right_pos = [(width / 2) + 1, height / 2]
+
+ self.grid.set(*left_pos, self.dancer)
+ self.dancer.init_pos = left_pos
+ self.dancer.cur_pos = left_pos
+
+ # Place it randomly left or right
+ #self.place_obj(self.dancer,
+ # size=(width, height))
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'watch dancer and repeat his moves afterwards'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0] if np.isnan(action[0]) else int(action[0])
+ if len(action) == 1: # agent cannot speak
+ assert self.p_sing == 0, "Non speaking agent used in a dance env requiring to speak"
+ utterance_action = [np.nan, np.nan]
+ else:
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+
+ if np.isnan(p_action):
+ pass
+
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+ speak_flag = not all(np.isnan(utterance_action))
+
+ if speak_flag:
+ utterance = DanceWithOneNPCGrammar.construct_utterance(utterance_action)
+ self.conversation += "{}: {} \n".format("Agent", utterance)
+
+ # Don't let the agent open any of the doors
+ if not self.few_actions and p_action == self.actions.toggle:
+ done = True
+
+ if not self.few_actions and p_action == self.actions.done:
+ done = True
+
+ # npc's turn
+ agent_matched_moves, npc_utterance = self.dancer.step(p_action if not np.isnan(p_action) else None,
+ utterance if speak_flag else None)
+ if self.hidden_npc:
+ npc_utterance = None
+ if npc_utterance:
+ self.utterance += "{} \n".format(npc_utterance)
+ self.conversation += "{}: {} \n".format(self.dancer.name, npc_utterance)
+ if agent_matched_moves:
+ reward = self._reward()
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # remove npc from agent view
+ npc_obs_idx = np.argwhere(obs['image'] == 11)
+ if npc_obs_idx.size != 0: # agent sees npc
+ obs['image'][npc_obs_idx[0][0], npc_obs_idx[0][1], :] = [1, 0, 0, 0]
+
+ if done and reward == 0:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+
+ self.window.clear_text() # erase previous text
+
+ self.window.set_caption(self.conversation) # overwrites super class caption
+ self.window.ax.set_title(self.dancer.get_status_str(), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+
+ return obs
+
+
+
+
+class DanceWithOneNPC8x8Env(DanceWithOneNPCEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, **kwargs)
+
+class DanceWithOneNPC6x6Env(DanceWithOneNPCEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=6, **kwargs)
+
+
+
+register(
+ id='MiniGrid-DanceWithOneNPC-5x5-v0',
+ entry_point='gym_minigrid.envs:DanceWithOneNPCEnv'
+)
+
+register(
+ id='MiniGrid-DanceWithOneNPC-6x6-v0',
+ entry_point='gym_minigrid.envs:DanceWithOneNPC6x6Env'
+)
+
+register(
+ id='MiniGrid-DanceWithOneNPC-8x8-v0',
+ entry_point='gym_minigrid.envs:DanceWithOneNPC8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/diverseexit.py b/gym-minigrid/gym_minigrid/backup_envs/diverseexit.py
new file mode 100644
index 0000000000000000000000000000000000000000..2634484f8a32e7492d57338ac7535195240c80ae
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/diverseexit.py
@@ -0,0 +1,584 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+class TeacherPeer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env, npc_type=0, knowledgeable=False, easier=False, idl=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = npc_type
+ self.env = env
+ self.knowledgeable = knowledgeable
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+ self.was_introduced_to = False
+ self.easier = easier
+ assert not self.easier
+ self.idl = idl
+
+ self.must_eye_contact = True if (self.npc_type // 3) % 2 == 0 else False
+ self.wanted_intro_utterances = [
+ EasyTeachingGamesGrammar.construct_utterance([2, 2]),
+ EasyTeachingGamesGrammar.construct_utterance([0, 1])
+ ]
+ self.wanted_intro_utterance = self.wanted_intro_utterances[0] if (self.npc_type // 3) // 2 == 0 else self.wanted_intro_utterances[1]
+ if self.npc_type % 3 == 0:
+ # must be far, must not poke
+ self.must_be_poked = False
+ self.must_be_close = False
+
+ elif self.npc_type % 3 == 1:
+ # must be close, must not poke
+ self.must_be_poked = False
+ self.must_be_close = True
+
+ elif self.npc_type % 3 == 2:
+ # must be close, must poke
+ self.must_be_poked = True
+ self.must_be_close = True
+
+ else:
+ raise ValueError("npc tyep {} unknown". format(self.npc_type))
+
+ # print("Peer type: ", self.npc_type)
+ # print("Peer conf: ", self.wanted_intro_utterance, self.must_eye_contact, self.must_be_close, self.must_be_poked)
+
+
+ if self.must_be_poked and not self.must_be_close:
+ raise ValueError("Must be poked means it must be close also.")
+
+ self.poked = False
+
+ self.exited = False
+ self.joint_attention_achieved = False
+
+ def toggle(self, env, pos):
+ """Method to trigger/toggle an action this object performs"""
+ self.poked = True
+ return True
+
+ def is_introduction_state_ok(self):
+ if (self.must_be_poked and self.introduction_state["poked"]) or (
+ not self.must_be_poked and not self.introduction_state["poked"]):
+ if (self.must_be_close and self.introduction_state["close"]) or (
+ not self.must_be_close and not self.introduction_state["close"]):
+ if (self.must_eye_contact and self.introduction_state["eye_contact"]) or (
+ not self.must_eye_contact and not self.introduction_state["eye_contact"]
+ ):
+ if self.introduction_state["intro_utterance"] == self.wanted_intro_utterance:
+ return True
+
+ return False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def step(self, agent_utterance):
+ super().step()
+
+ if self.knowledgeable:
+ if self.easier:
+ raise DeprecationWarning()
+ # wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ # action = self.compute_turn_action(wanted_dir)
+ # action()
+ # if not self.was_introduced_to and (agent_utterance in self.wanted_intro_utterances):
+ # self.was_introduced_to = True
+ # self.introduction_state = {
+ # "poked": self.poked,
+ # "close": self.is_near_agent(),
+ # "eye_contact": self.is_eye_contact(),
+ # "correct_intro_utterance": agent_utterance == self.wanted_intro_utterance
+ # }
+ # if self.is_introduction_state_ok():
+ # utterance = "Go to the {} door \n".format(self.env.target_color)
+ # return utterance
+
+ else:
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ action = self.compute_turn_action(wanted_dir)
+ action()
+ if not self.was_introduced_to and (agent_utterance in self.wanted_intro_utterances):
+ self.was_introduced_to = True
+ self.introduction_state = {
+ "poked": self.poked,
+ "close": self.is_near_agent(),
+ "eye_contact": self.is_eye_contact(),
+ "intro_utterance": agent_utterance,
+ }
+ if not self.is_introduction_state_ok():
+ if self.idl:
+ if self.env.hidden_npc:
+ return None
+ else:
+ return "I don't like that \n"
+ else:
+ return None
+
+ if self.is_eye_contact() and self.was_introduced_to:
+
+ if self.is_introduction_state_ok():
+ utterance = "Go to the {} door \n".format(self.env.target_color)
+ if self.env.hidden_npc:
+ return None
+ else:
+ return utterance
+ else:
+ # no utterance
+ return None
+
+ else:
+ self.env._rand_elem(self.available_moves)()
+ return None
+
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+
+ if self.npc_type % 3 == 0:
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+ # Draw top hat
+ npc_shapes.append(point_in_rect(0.30, 0.70, 0.05, 0.28))
+
+ elif self.npc_type % 3 == 1:
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+ # Draw bottom hat
+ npc_shapes.append(point_in_triangle((0.15, 0.28),
+ (0.85, 0.28),
+ (0.50, 0.05)))
+ elif self.npc_type % 3 == 2:
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+ # Draw bottom hat
+ npc_shapes.append(point_in_triangle((0.15, 0.28),
+ (0.85, 0.28),
+ (0.50, 0.05)))
+ # Draw top hat
+ npc_shapes.append(point_in_rect(0.30, 0.70, 0.05, 0.28))
+
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1 * (math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi / 2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+# class EasyTeachingGamesSmallGrammar(object):
+#
+# templates = ["Where is", "Open", "What is"]
+# things = ["sesame", "the exit", "the password"]
+#
+# grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+#
+# @classmethod
+# def construct_utterance(cls, action):
+# if all(np.isnan(action)):
+# return ""
+# return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class EasyTeachingGamesGrammar(object):
+
+ templates = ["Where is", "Open", "Which is", "How are"]
+ things = [
+ "sesame", "the exit", "the correct door", "you", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "the floor", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ if all(np.isnan(action)):
+ return ""
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class EasyTeachingGamesEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ hard_password=False,
+ max_steps=50,
+ n_switches=3,
+ peer_type=None,
+ no_turn_off=False,
+ easier=False,
+ idl=False,
+ hidden_npc = False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hard_password = hard_password
+ self.n_switches = n_switches
+ self.peer_type = peer_type
+ self.no_turn_off = no_turn_off
+ self.easier = easier
+ self.idl = idl
+ self.hidden_npc = hidden_npc
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *EasyTeachingGamesGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width - 1
+ self.wall_y = height - 1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+
+ if self.peer_type is None:
+ self.current_peer_type = self._rand_int(0, 12)
+ else:
+ self.current_peer_type = self.peer_type
+
+ self.peer = TeacherPeer(
+ color,
+ ["Bobby", "Robby", "Toby"][self.current_peer_type % 3],
+ self,
+ knowledgeable=self.knowledgeable,
+ npc_type=self.current_peer_type,
+ easier=self.easier,
+ idl=self.idl
+ )
+
+ # height -2 so its not in front of the buttons in the way
+ while True:
+ peer_pos = np.array((self._rand_int(1, width - 1), self._rand_int(1, height - 2)))
+
+ if (
+ # not in front of any door
+ not tuple(peer_pos) in self.door_front_pos
+ ) and (
+ # no_close npc is not in the middle of the 5x5 env
+ not (not self.peer.must_be_close and (width == 5 and height == 5) and all(peer_pos == (2, 2)))
+ ):
+ break
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'exit the room'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+
+ if p_action == self.actions.done:
+ done = True
+
+ peer_utterance = EasyTeachingGamesGrammar.construct_utterance(utterance_action)
+ peer_reply = self.peer.step(peer_utterance)
+
+ if peer_reply is not None:
+ self.utterance += "{}: {} \n".format(self.peer.name, peer_reply)
+ self.conversation += "{}: {} \n".format(self.peer.name, peer_reply)
+
+ if all(self.agent_pos == self.target_pos):
+ done = True
+ reward = self._reward()
+
+ elif tuple(self.agent_pos) in self.door_pos:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ assert "{}:".format(self.peer.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.clear_text() # erase previous text
+
+ self.window.set_caption(self.conversation, self.peer.name)
+
+ self.window.ax.set_title("correct door: {}".format(self.target_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+# # must be far, must not poke
+# class EasyTeachingGames8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=0)
+#
+# # must be close, must not poke
+# class EasyTeachingGamesClose8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=1)
+#
+# # must be close, must poke
+# class EasyTeachingGamesPoke8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=2)
+#
+# # 100 multi
+# class EasyTeachingGamesMulti8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=None)
+#
+#
+#
+# # speaking 50 steps
+# register(
+# id='MiniGrid-EasyTeachingGames-8x8-v0',
+# entry_point='gym_minigrid.envs:EasyTeachingGames8x8Env'
+# )
+#
+# # demonstrating 50 steps
+# register(
+# id='MiniGrid-EasyTeachingGamesPoke-8x8-v0',
+# entry_point='gym_minigrid.envs:EasyTeachingGamesPoke8x8Env'
+# )
+#
+# # demonstrating 50 steps
+# register(
+# id='MiniGrid-EasyTeachingGamesClose-8x8-v0',
+# entry_point='gym_minigrid.envs:EasyTeachingGamesClose8x8Env'
+# )
+#
+# # speaking 50 steps
+# register(
+# id='MiniGrid-EasyTeachingGamesMulti-8x8-v0',
+# entry_point='gym_minigrid.envs:EasyTeachingGamesMulti8x8Env'
+# )
+
+# # must be far, must not poke
+# class EasierTeachingGames8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=0, easier=True)
+#
+# # must be close, must not poke
+# class EasierTeachingGamesClose8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=1, easier=True)
+#
+# # must be close, must poke
+# class EasierTeachingGamesPoke8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=2, easier=True)
+#
+# # 100 multi
+# class EasierTeachingGamesMulti8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=None, easier=True)
+#
+# # Multi Many
+# class ManyTeachingGamesMulti8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=None, easier=False, many=True)
+#
+# class ManyTeachingGamesMultiIDL8x8Env(EasyTeachingGamesEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=None, easier=False, many=True, idl=True)
+
+
+# # speaking 50 steps
+# register(
+# id='MiniGrid-EasierTeachingGames-8x8-v0',
+# entry_point='gym_minigrid.envs:EasierTeachingGames8x8Env'
+# )
+#
+# # demonstrating 50 steps
+# register(
+# id='MiniGrid-EasierTeachingGamesPoke-8x8-v0',
+# entry_point='gym_minigrid.envs:EasierTeachingGamesPoke8x8Env'
+# )
+#
+# # demonstrating 50 steps
+# register(
+# id='MiniGrid-EasierTeachingGamesClose-8x8-v0',
+# entry_point='gym_minigrid.envs:EasierTeachingGamesClose8x8Env'
+# )
+#
+# # speaking 50 steps
+# register(
+# id='MiniGrid-EasierTeachingGamesMulti-8x8-v0',
+# entry_point='gym_minigrid.envs:EasierTeachingGamesMulti8x8Env'
+# )
+#
+# # speaking 50 steps
+# register(
+# id='MiniGrid-ManyTeachingGamesMulti-8x8-v0',
+# entry_point='gym_minigrid.envs:ManyTeachingGamesMulti8x8Env'
+# )
+#
+# # speaking 50 steps
+# register(
+# id='MiniGrid-ManyTeachingGamesMultiIDL-8x8-v0',
+# entry_point='gym_minigrid.envs:ManyTeachingGamesMultiIDL8x8Env'
+# )
+
+# Multi Many
+class DiverseExit8x8Env(EasyTeachingGamesEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, knowledgeable=True, max_steps=50, peer_type=None, easier=False, **kwargs)
+
+# speaking 50 steps
+register(
+ id='MiniGrid-DiverseExit-8x8-v0',
+ entry_point='gym_minigrid.envs:DiverseExit8x8Env'
+)
+
diff --git a/gym-minigrid/gym_minigrid/backup_envs/exiter.py b/gym-minigrid/gym_minigrid/backup_envs/exiter.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ed9c5d8e3da7a87f50d9d612e161756fa973b82
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/exiter.py
@@ -0,0 +1,347 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+
+class Peer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env, random_actions=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+ self.random_actions = random_actions
+ self.joint_attention_achieved = False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def step(self):
+ super().step()
+ if self.random_actions:
+ if type(self.env.grid.get(*self.front_pos)) == Lava:
+ # can't walk into lava
+ act = self.env._rand_elem([
+ m for m in self.available_moves if m != self.go_forward
+ ])
+ elif type(self.env.grid.get(*self.front_pos)) == Switch:
+ # can't toggle switches
+ act = self.env._rand_elem([
+ m for m in self.available_moves if m != self.toggle_action
+ ])
+ else:
+ act = self.env._rand_elem(self.available_moves)
+
+ act()
+
+ else:
+ distances = np.abs(self.env.agent_pos - self.env.door_pos).sum(-1)
+
+ door_id = np.argmin(distances)
+ wanted_switch_pos = self.env.switches_pos[door_id]
+ sw = self.env.switches[door_id]
+
+ distance_to_switch = np.abs(wanted_switch_pos - self.cur_pos ).sum(-1)
+
+ # corresponding switch
+ if all(self.front_pos == wanted_switch_pos) and self.joint_attention_achieved:
+ # in agent front of door, looking at the door
+ if tuple(self.env.front_pos) == tuple(self.env.door_pos[door_id]):
+ if not sw.is_on:
+ self.toggle_action()
+
+ elif distance_to_switch == 1:
+ if not self.joint_attention_achieved:
+ # looks at he agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ else:
+ # turns to the switch
+ wanted_dir = self.compute_wanted_dir(wanted_switch_pos)
+
+ action = self.compute_turn_action(wanted_dir)
+ action()
+ if self.is_eye_contact():
+ self.joint_attention_achieved = True
+
+
+ else:
+ act = self.path_to_pos(wanted_switch_pos)
+ act()
+
+ # not really important as the NPC doesn't speak
+ if self.env.hidden_npc:
+ return None
+
+
+
+class ExiterGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class ExiterEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ ablation=False,
+ max_steps=20,
+ hidden_npc=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.ablation = ablation
+ self.hidden_npc = hidden_npc
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *ExiterGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width-1
+ self.wall_y = height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # add lava
+ self.grid.vert_wall(width//2, 1, height - 2, Lava)
+
+ # door top
+ door_color_top = self._rand_elem(COLOR_NAMES)
+ self.door_pos_top = (width-1, 1)
+ self.door_top = Door(door_color_top, is_locked=False if self.ablation else True)
+ self.grid.set(*self.door_pos_top, self.door_top)
+
+ # switch top
+ self.switch_pos_top = (0, 1)
+ self.switch_top = Switch(door_color_top, lockable_object=self.door_top, locker_switch=True)
+ self.grid.set(*self.switch_pos_top, self.switch_top)
+
+ # door bottom
+ door_color_bottom = self._rand_elem(COLOR_NAMES)
+ self.door_pos_bottom = (width-1, height-2)
+ self.door_bottom = Door(door_color_bottom, is_locked=False if self.ablation else True)
+ self.grid.set(*self.door_pos_bottom, self.door_bottom)
+
+ # switch bottom
+ self.switch_pos_bottom = (0, height-2)
+ self.switch_bottom = Switch(door_color_bottom, lockable_object=self.door_bottom, locker_switch=True)
+ self.grid.set(*self.switch_pos_bottom, self.switch_bottom)
+
+ self.switches = [self.switch_top, self.switch_bottom]
+ self.switches_pos = [self.switch_pos_top, self.switch_pos_bottom]
+ self.door = [self.door_top, self.door_bottom]
+ self.door_pos = [self.door_pos_top, self.door_pos_bottom]
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.peer = Peer(color, "Jill", self, random_actions=self.ablation)
+
+ # Place it on the middle right side of the room
+ peer_pos = np.array((self._rand_int(1, width//2), self._rand_int(1, height - 1)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ agent = self.place_agent(top=(width // 2, 0), size=(width // 2, height))
+
+ # Generate the mission string
+ self.mission = 'watch dancer and repeat his moves afterwards'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+ self.peer.step()
+
+ if np.isnan(p_action):
+ pass
+
+ if p_action == self.actions.done:
+ done = True
+
+ elif all([self.switch_top.is_on, self.switch_bottom.is_on]):
+ # if both witches are on: no reward is given and the episode ends
+ done = True
+
+ elif tuple(self.agent_pos) in [self.door_pos_top, self.door_pos_bottom]:
+ # agent has exited
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ assert "{}:".format(self.peer.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.clear_text() # erase previous text
+
+ # self.window.set_caption(self.conversation, [self.peer.name])
+ # self.window.ax.set_title("correct door: {}".format(self.true_guide.target_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+class Exiter8x8Env(ExiterEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, max_steps=20, **kwargs)
+
+
+class Exiter6x6Env(ExiterEnv):
+ def __init__(self):
+ super().__init__(size=6, max_steps=20)
+
+class AblationExiterEnv(ExiterEnv):
+ def __init__(self):
+ super().__init__(size=5, ablation=True, max_steps=20)
+
+class AblationExiter8x8Env(ExiterEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, ablation=True, max_steps=20, **kwargs)
+
+
+class AblationExiter6x6Env(ExiterEnv):
+ def __init__(self):
+ super().__init__(size=6, ablation=True, max_steps=20)
+
+
+
+register(
+ id='MiniGrid-Exiter-5x5-v0',
+ entry_point='gym_minigrid.envs:ExiterEnv'
+)
+
+register(
+ id='MiniGrid-Exiter-6x6-v0',
+ entry_point='gym_minigrid.envs:Exiter6x6Env'
+)
+
+register(
+ id='MiniGrid-Exiter-8x8-v0',
+ entry_point='gym_minigrid.envs:Exiter8x8Env'
+)
+register(
+ id='MiniGrid-AblationExiter-5x5-v0',
+ entry_point='gym_minigrid.envs:AblationExiterEnv'
+)
+
+register(
+ id='MiniGrid-AblationExiter-6x6-v0',
+ entry_point='gym_minigrid.envs:AblationExiter6x6Env'
+)
+
+register(
+ id='MiniGrid-AblationExiter-8x8-v0',
+ entry_point='gym_minigrid.envs:AblationExiter8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoorpolite.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoorpolite.py
new file mode 100644
index 0000000000000000000000000000000000000000..584242a3132f4d3d3b882b9b77e5ab5b4a07f3a5
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoorpolite.py
@@ -0,0 +1,292 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Guide(NPC):
+ """
+ A simple NPC that wants an agent to go to an object (randomly chosen among object_pos list)
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.introduced = False
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def listen(self, utterance):
+ if utterance == PoliteGrammar.construct_utterance([0, 2]):
+ self.introduced = True
+ return "I am good. Thank you."
+ elif utterance == PoliteGrammar.construct_utterance([1, 1]):
+ if self.introduced:
+ return self.env.mission
+
+ return None
+
+ # def is_near_agent(self):
+ # ax, ay = self.env.agent_pos
+ # wx, wy = self.cur_pos
+ # if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ # return True
+ # return False
+
+
+class PoliteGrammar(object):
+
+ templates = ["How are", "Where is", "Open"]
+ things = ["sesame", "the exit", 'you']
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GoToDoorPoliteEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ max_steps=100,
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *PoliteGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+
+ self.empty_symbol = "NA \n"
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Set a randomly coloured NPC at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Guide(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ agent_utterance = PoliteGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(agent_utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(agent_utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.utterance_history, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorPoliteTesting(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ hear_yourself=False,
+ diminished_reward=False,
+ step_penalty=True,
+ max_steps=100
+ )
+
+class GoToDoorPolite8x8Env(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(size=8, max_steps=100)
+
+
+class GoToDoorPolite6x6Env(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(size=6, max_steps=100)
+
+
+# hear yourself
+class GoToDoorPoliteHY8x8Env(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True, max_steps=100)
+
+
+class GoToDoorPoliteHY6x6Env(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True, max_steps=100)
+
+
+class GoToDoorPoliteHY5x5Env(GoToDoorPoliteEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True, max_steps=100)
+
+register(
+ id='MiniGrid-GoToDoorPolite-Testing-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPoliteTesting'
+)
+
+register(
+ id='MiniGrid-GoToDoorPolite-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPoliteEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorPolite-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPolite6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorPolite-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPolite8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorPoliteHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPoliteHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorPoliteHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPoliteHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorPoliteHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorPoliteHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoorsesame.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoorsesame.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e6672c1853992d09a58db4c4c56dea7c57981e4
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoorsesame.py
@@ -0,0 +1,165 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class SesameGrammar(object):
+
+ templates = ["Open", "Who is", "Where is"]
+ things = ["the exit", "sesame", "the chest", "him", "that"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + "."
+
+
+class GoToDoorSesameEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *SesameGrammar.grammar_action_space.nvec
+ ])
+ )
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ doorPos = (self._rand_int(2, width-2), 0)
+ doorColors = self._rand_elem(COLOR_NAMES)
+ self.grid.set(*doorPos, Door(doorColors))
+
+ # doorPos = []
+ # doorPos.append((self._rand_int(2, width-2), 0))
+ #
+ # # Generate the door colors
+ # doorColors = []
+ # while len(doorColors) < len(doorPos):
+ # color = self._rand_elem(COLOR_NAMES)
+ # if color in doorColors:
+ # continue
+ # doorColors.append(color)
+ #
+ # # Place the doors in the grid
+ # for idx, pos in enumerate(doorPos):
+ # color = doorColors[idx]
+ # self.grid.set(*pos, Door(color))
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ # doorIdx = self._rand_int(0, len(doorPos))
+ # self.target_pos = doorPos[doorIdx]
+ # self.target_color = doorColors[doorIdx]
+ self.target_pos = doorPos
+ self.target_color = doorColors
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Initialize the dialogue string
+ self.dialogue = "This is what you hear. \n"
+
+ def gen_obs(self):
+ obs = super().gen_obs()
+
+ # add dialogue to obs
+ obs["dialogue"] = self.dialogue
+
+ return obs
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ # magic words if front of the door
+ if speak_flag:
+ utterance = SesameGrammar.construct_utterance(utterance_action)
+ self.dialogue += "YOU: " + utterance + "\n"
+
+ if utterance == SesameGrammar.construct_utterance([0, 1]):
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ # Reward performing done action in front of the target door
+ # if p_action == self.actions.done:
+ # if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ # reward = self._reward()
+ # done = True
+
+ return obs, reward, done, info
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.dialogue, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorSesame8x8Env(GoToDoorSesameEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+class GoToDoorSesame6x6Env(GoToDoorSesameEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+register(
+ id='MiniGrid-GoToDoorSesame-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorSesameEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorSesame-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorSesame6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorSesame-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorSesame8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalk.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalk.py
new file mode 100644
index 0000000000000000000000000000000000000000..dde8579591bdd5b3c27c33ecafa27810efac0873
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalk.py
@@ -0,0 +1,189 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+# these two classes should maybe be extracted to a utils file so they can be used all over our envs
+
+
+class GoToDoorTalkEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *Grammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+
+ self.empty_symbol = "NA \n"
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ doorPos = []
+ doorPos.append((self._rand_int(2, width-2), 0))
+ doorPos.append((self._rand_int(2, width-2), height-1))
+ doorPos.append((0, self._rand_int(2, height-2)))
+ doorPos.append((width-1, self._rand_int(2, height-2)))
+
+ # Generate the door colors
+ doorColors = []
+ while len(doorColors) < len(doorPos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in doorColors:
+ continue
+ doorColors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(doorPos):
+ color = doorColors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ doorIdx = self._rand_int(0, len(doorPos))
+ self.target_pos = doorPos[doorIdx]
+ self.target_color = doorColors[doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ if speak_flag:
+ agent_utterance = Grammar.construct_utterance(utterance_action)
+
+ reply = self.mission
+ NPC_name = "Wizard"
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(agent_utterance)
+
+ self.utterance += "{}: {} \n".format(NPC_name, reply)
+
+ obs, reward, done, info = super().step(p_action)
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ # Reward performing done action in front of the target door
+ if p_action == self.actions.done:
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.utterance_history, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorTalk8x8Env(GoToDoorTalkEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+class GoToDoorTalk6x6Env(GoToDoorTalkEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+# hear yourself
+class GoToDoorTalkHY8x8Env(GoToDoorTalkEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+class GoToDoorTalkHY6x6Env(GoToDoorTalkEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+class GoToDoorTalkHYEnv(GoToDoorTalkEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+
+register(
+ id='MiniGrid-GoToDoorTalk-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalk-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalk6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalk-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalk8x8Env'
+)
+
+# hear yourself
+register(
+ id='MiniGrid-GoToDoorTalkHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHYEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhard.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhard.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3043646391777d9c5ef3786fc17eb96c9e5ba47
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhard.py
@@ -0,0 +1,199 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+
+class TalkHardGrammar(object):
+
+ templates = ["Where is", "What is"]
+ things = ["the exit", "the chair"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + "."
+
+
+class GoToDoorTalkHardEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkHardGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ doorPos = []
+ doorPos.append((self._rand_int(2, width-2), 0))
+ doorPos.append((self._rand_int(2, width-2), height-1))
+ doorPos.append((0, self._rand_int(2, height-2)))
+ doorPos.append((width-1, self._rand_int(2, height-2)))
+
+ # Generate the door colors
+ doorColors = []
+ while len(doorColors) < len(doorPos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in doorColors:
+ continue
+ doorColors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(doorPos):
+ color = doorColors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ doorIdx = self._rand_int(0, len(doorPos))
+ self.target_pos = doorPos[doorIdx]
+ self.target_color = doorColors[doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Initialize the dialogue string
+ self.dialogue = "This is what you hear. "
+
+ def gen_obs(self):
+ obs = super().gen_obs()
+
+ # add dialogue to obs
+ obs["dialogue"] = self.dialogue
+
+ return obs
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ if speak_flag:
+ utterance = TalkHardGrammar.construct_utterance(utterance_action)
+
+ reply = self.mission
+ NPC_name = "Wizard"
+
+ if self.hear_yourself:
+ self.dialogue += "YOU: {} \n".format(utterance)
+
+ if utterance == TalkHardGrammar.construct_utterance([0, 0]):
+ self.dialogue += "{}: {} \n".format(NPC_name, reply) # dummy reply gives mission
+
+ obs, reward, done, info = super().step(p_action)
+
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ # Reward performing done action in front of the target door
+ if p_action == self.actions.done:
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.dialogue, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorTalkHard8x8Env(GoToDoorTalkHardEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GoToDoorTalkHard6x6Env(GoToDoorTalkHardEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# hear yourself
+class GoToDoorTalkHardHY8x8Env(GoToDoorTalkHardEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+
+class GoToDoorTalkHardHY6x6Env(GoToDoorTalkHardEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+
+class GoToDoorTalkHardHY5x5Env(GoToDoorTalkHardEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+register(
+ id='MiniGrid-GoToDoorTalkHard-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHard-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHard6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHard-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHard8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorTalkHardHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardnpc.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardnpc.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddf3ccfe3756f0abc03b0dc6742ee241cddfe964
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardnpc.py
@@ -0,0 +1,283 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Guide(NPC):
+ """
+ A simple NPC that wants an agent to go to an object (randomly chosen among object_pos list)
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.has_spoken = False # wizards only speak once
+ self.npc_type = 0
+
+ def listen(self, utterance):
+ if utterance == TalkHardSesameGrammar.construct_utterance([0, 1]):
+ return self.env.mission
+
+ return None
+
+ # def is_near_agent(self):
+ # ax, ay = self.env.agent_pos
+ # wx, wy = self.cur_pos
+ # if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ # return True
+ # return False
+
+
+class TalkHardSesameGrammar(object):
+
+ templates = ["Where is", "Open"]
+ things = ["sesame", "the exit"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GoToDoorTalkHardNPCEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkHardSesameGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+
+ self.empty_symbol = "NA \n"
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Set a randomly coloured NPC at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Guide(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ agent_utterance = TalkHardSesameGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(agent_utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(agent_utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.utterance_history, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorTalkHardNPCTesting(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ hear_yourself=False,
+ diminished_reward=False,
+ step_penalty=True
+ )
+
+class GoToDoorTalkHardNPC8x8Env(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GoToDoorTalkHardNPC6x6Env(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# hear yourself
+class GoToDoorTalkHardNPCHY8x8Env(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+
+class GoToDoorTalkHardNPCHY6x6Env(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+
+class GoToDoorTalkHardNPCHY5x5Env(GoToDoorTalkHardNPCEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPC-Testing-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPCTesting'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPC-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPCEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPC-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPC6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPC-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPC8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorTalkHardNPCHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPCHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPCHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPCHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardNPCHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardNPCHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesame.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesame.py
new file mode 100644
index 0000000000000000000000000000000000000000..862e95fb2edbc1a01cfc4e1d692b29d88dd2eeb4
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesame.py
@@ -0,0 +1,204 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+
+class TalkHardSesameGrammar(object):
+
+ templates = ["Where is", "Open"]
+ things = ["sesame", "the exit"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GoToDoorTalkHardSesameEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkHardSesameGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+
+ self.empty_symbol = "NA \n"
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.doorPos = []
+ self.doorPos.append((self._rand_int(2, width-2), 0))
+ self.doorPos.append((self._rand_int(2, width-2), height-1))
+ self.doorPos.append((0, self._rand_int(2, height-2)))
+ self.doorPos.append((width-1, self._rand_int(2, height-2)))
+
+ # Generate the door colors
+ doorColors = []
+ while len(doorColors) < len(self.doorPos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in doorColors:
+ continue
+ doorColors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.doorPos):
+ color = doorColors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ doorIdx = self._rand_int(0, len(self.doorPos))
+ self.target_pos = self.doorPos[doorIdx]
+ self.target_color = doorColors[doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkHardSesameGrammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ if utterance == TalkHardSesameGrammar.construct_utterance([0, 1]):
+ reply = self.mission
+ NPC_name = "Wizard"
+ self.utterance += "{}: {} \n".format(NPC_name, reply) # dummy reply gives mission
+
+ elif utterance == TalkHardSesameGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.doorPos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.dialogue, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorTalkHardSesame8x8Env(GoToDoorTalkHardSesameEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GoToDoorTalkHardSesame6x6Env(GoToDoorTalkHardSesameEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# hear yourself
+class GoToDoorTalkHardSesameHY8x8Env(GoToDoorTalkHardSesameEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameHY6x6Env(GoToDoorTalkHardSesameEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameHY5x5Env(GoToDoorTalkHardSesameEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesame-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesame-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesame6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesame-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesame8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpc.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpc.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf8d6b0cbc74b5a48a01291c6162c1656c6640c3
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpc.py
@@ -0,0 +1,294 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Guide(NPC):
+ """
+ A simple NPC that wants an agent to go to an object (randomly chosen among object_pos list)
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_type = 0
+
+ def listen(self, utterance):
+ if utterance == TalkHardSesameGrammar.construct_utterance([0, 1]):
+ return self.env.mission
+
+ return None
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class TalkHardSesameGrammar(object):
+
+ templates = ["Where is", "Open"]
+ things = ["sesame", "the exit"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GoToDoorTalkHardSesameNPCEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkHardSesameGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+
+ self.empty_symbol = "NA \n"
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Set a randomly coloured NPC at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Guide(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkHardSesameGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if utterance == TalkHardSesameGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GoToDoorTalkHardSesameNPCTesting(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ hear_yourself=False,
+ diminished_reward=False,
+ step_penalty=True
+ )
+
+class GoToDoorTalkHardSesameNPC8x8Env(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GoToDoorTalkHardSesameNPC6x6Env(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# hear yourself
+class GoToDoorTalkHardSesameNPCHY8x8Env(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameNPCHY6x6Env(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameNPCHY5x5Env(GoToDoorTalkHardSesameNPCEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPC-Testing-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCTesting'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPC-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPC-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPC6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPC-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPC8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpcguides.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpcguides.py
new file mode 100644
index 0000000000000000000000000000000000000000..f95f1063849da87d6ebdd6dee32a0b12c44cf439
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoortalkhardsesamnpcguides.py
@@ -0,0 +1,384 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Wizard(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_type = 0 # this will be put into the encoding
+
+ def listen(self, utterance):
+ if utterance == TalkHardSesameNPCGuidesGrammar.construct_utterance([0, 1]):
+ return "Ask {}.".format(self.env.true_guide.name)
+
+ return None
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_type = 1 # this will be put into the encoding
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def listen(self, utterance):
+ if utterance == TalkHardSesameNPCGuidesGrammar.construct_utterance([0, 1]):
+ if self.liar:
+ fake_colors = [c for c in self.env.door_colors if c != self.env.target_color]
+ fake_color = self.env._rand_elem(fake_colors)
+
+ # Generate the mission string
+ assert fake_color != self.env.target_color
+ return 'go to the %s door' % fake_color
+
+ else:
+ return self.env.mission
+
+ return None
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # Draw eyes
+ fill_coords(img, point_in_circle(cx=0.70, cy=0.50, r=0.10), c)
+ fill_coords(img, point_in_circle(cx=0.30, cy=0.50, r=0.10), c)
+
+ # Draw mouth
+ fill_coords(img, point_in_rect(0.20, 0.80, 0.72, 0.81), c)
+
+ # #Draw hat
+ # tri_fn = point_in_triangle(
+ # (0.15, 0.25),
+ # (0.85, 0.25),
+ # (0.50, 0.05),
+ # )
+ # fill_coords(img, tri_fn, c)
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class TalkHardSesameNPCGuidesGrammar(object):
+
+ templates = ["Where is", "Open"]
+ things = ["sesame", "the exit"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GoToDoorTalkHardSesameNPCGuidesEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkHardSesameNPCGuidesGrammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+
+ self.empty_symbol = "NA \n"
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Wizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+
+ # add guides
+ GUIDE_NAMES = ["John", "Jack"]
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+ name = self._rand_elem(GUIDE_NAMES)
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(color, name, self, liar=False)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in [*self.door_front_pos, tuple(self.wizard.cur_pos)])
+
+ # Set a randomly coloured FALSE GUIDE at a random position
+ name = self._rand_elem([n for n in GUIDE_NAMES if n != self.true_guide.name])
+ color = self._rand_elem(COLOR_NAMES)
+ self.false_guide = Guide(color, name, self, liar=True)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.false_guide,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in [
+ *self.door_front_pos, tuple(self.wizard.cur_pos), tuple(self.true_guide.cur_pos)])
+ assert self.true_guide.name != self.false_guide.name
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkHardSesameNPCGuidesGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if hasattr(self, "wizard"):
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if hasattr(self, "false_guide"):
+ if self.false_guide.is_near_agent():
+ reply = self.false_guide.listen(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.false_guide.name, reply)
+ self.conversation += "{}: {} \n".format(self.false_guide.name, reply)
+
+ if utterance == TalkHardSesameNPCGuidesGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print(self.conversation)
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+
+class GoToDoorTalkHardSesameNPCGuides8x8Env(GoToDoorTalkHardSesameNPCGuidesEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GoToDoorTalkHardSesameNPCGuides6x6Env(GoToDoorTalkHardSesameNPCGuidesEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# hear yourself
+class GoToDoorTalkHardSesameNPCGuidesHY8x8Env(GoToDoorTalkHardSesameNPCGuidesEnv):
+ def __init__(self):
+ super().__init__(size=8, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameNPCGuidesHY6x6Env(GoToDoorTalkHardSesameNPCGuidesEnv):
+ def __init__(self):
+ super().__init__(size=6, hear_yourself=True)
+
+
+class GoToDoorTalkHardSesameNPCGuidesHY5x5Env(GoToDoorTalkHardSesameNPCGuidesEnv):
+ def __init__(self):
+ super().__init__(size=5, hear_yourself=True)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuides-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCGuidesEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuides-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCGuides6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuides-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCGuides8x8Env'
+)
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuidesHY-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCGuidesHY5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuidesHY-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPGuidesCHY6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorTalkHardSesameNPCGuidesHY-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorTalkHardSesameNPCGuidesHY8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/gotodoorwizard.py b/gym-minigrid/gym_minigrid/backup_envs/gotodoorwizard.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d2441a851aab8abe00e5027ae7c62b78fa2582d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/gotodoorwizard.py
@@ -0,0 +1,209 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+
+class simpleWizard(NPC):
+ """
+ A simple NPC that wants an agent to go to an object (randomly chosen among object_pos list)
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.has_spoken = False # wizards only speak once
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ # Generate the mission string
+ self.wizard_mission = 'go to the %s door' % self.target_color
+
+ def listen(self, utterance):
+ if not self.has_spoken:
+ self.has_spoken = True
+ return self.wizard_mission
+ return None
+
+ def is_satisfied(self):
+ ax, ay = self.env.agent_pos
+ tx, ty = self.target_pos
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ return True
+ return False
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class GoToDoorWizard(MiniGridEnv):
+ """
+ Environment in which the agent is instructed to "please the wizard",
+ i.e. to go ask him for a quest (which is goto door)
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *Grammar.grammar_action_space.nvec
+ ])
+ )
+ self.hear_yourself = hear_yourself
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Set a randomly coloured NPC at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = simpleWizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'please the wizard'
+
+ # Initialize the dialogue string
+ self.dialogue = "This is what you hear. "
+
+ def gen_obs(self):
+ obs = super().gen_obs()
+
+ # add dialogue to obs
+ obs["dialogue"] = self.dialogue
+
+ return obs
+
+ def step(self, action):
+
+ # dirty handle of action provided by manual_control todo improve
+ if type(action) == MiniGridEnv.Actions:
+ action = [action, None]
+
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():#p_action == self.actions.talk and self.near_wizard:
+ #utterance = Grammar.construct_utterance(utterance_action)
+ reply = self.wizard.listen("")
+ # if self.hear_yourself:
+ # self.dialogue += "YOU: " + utterance
+ if reply:
+ self.dialogue += "{}: {}".format(self.wizard.name, reply)
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ # Reward performing done action if pleasing the wizard
+ if p_action == self.actions.done:
+ if self.wizard.is_satisfied():
+ reward = self._reward()
+ done = True
+ return obs, reward, done, info
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.set_caption(self.dialogue, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ self.window.fig.gca().set_title("goal: "+self.mission)
+ return obs
+
+
+class GoToDoorWizard5x5Env(GoToDoorWizard):
+ def __init__(self):
+ super().__init__(size=5)
+
+
+class GoToDoorWizard7x7Env(GoToDoorWizard):
+ def __init__(self):
+ super().__init__(size=7)
+
+class GoToDoorWizard8x8Env(GoToDoorWizard):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+
+register(
+ id='MiniGrid-GoToDoorWizard-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorWizard5x5Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorWizard-7x7-v0',
+ entry_point='gym_minigrid.envs:GoToDoorWizard7x7Env'
+)
+
+register(
+ id='MiniGrid-GoToDoorWizard-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoorWizard8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/guidethief.py b/gym-minigrid/gym_minigrid/backup_envs/guidethief.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fd71ca6d707d46498d3683184660200f6617c9a
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/guidethief.py
@@ -0,0 +1,416 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, id, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = id # this will be put into the encoding
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def listen(self, utterance):
+ if utterance == GuideThiefGrammar.construct_utterance([0, 1]):
+ if self.liar:
+ fake_colors = [c for c in self.env.door_colors if c != self.env.target_color]
+ fake_color = self.env._rand_elem(fake_colors)
+
+ # Generate the mission string
+ assert fake_color != self.env.target_color
+ if self.env.one_word:
+ return '%s' % fake_color
+ elif self.env.very_diff:
+ return 'you want the %s door' % fake_color
+ else:
+ return 'go to the %s door' % fake_color
+
+ else:
+ return self.env.mission
+
+ return None
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class GuideThiefGrammar(object):
+
+ templates = ["Where is", "Open", "Close", "What is"]
+ things = [
+ "sesame", "the exit", "the wall", "the floor", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "oven", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class GuideThiefEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ nameless=False,
+ max_steps=None,
+ very_diff=False,
+ one_word=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.nameless = nameless
+ self.very_diff = very_diff
+ self.one_word = one_word
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps or 5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *GuideThiefGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = []
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+
+ # Place it randomly, omitting front of door positions
+
+ # add guides
+ GUIDE_NAMES = ["John", "Jack"]
+ name_2_id = {name: id for id, name in enumerate(GUIDE_NAMES)}
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+
+ true_guide_name = GUIDE_NAMES[0]
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(
+ color=color,
+ name=true_guide_name,
+ id=name_2_id[true_guide_name],
+ env=self,
+ liar=False
+ )
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Set a randomly coloured FALSE GUIDE at a random position
+ false_guide_name = GUIDE_NAMES[1]
+ if self.nameless:
+ color = self._rand_elem([c for c in COLOR_NAMES if c != self.true_guide.color])
+ else:
+ color = self._rand_elem(COLOR_NAMES)
+
+ self.false_guide = Guide(
+ color=color,
+ name=false_guide_name,
+ id=name_2_id[false_guide_name],
+ env=self,
+ liar=True
+ )
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.false_guide,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in [
+ *self.door_front_pos, tuple(self.true_guide.cur_pos)])
+ assert self.true_guide.name != self.false_guide.name
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = GuideThiefGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ if self.nameless:
+ self.utterance += "{} \n".format(utterance)
+ else:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if self.false_guide.is_near_agent():
+ reply = self.false_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.false_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.false_guide.name, reply)
+
+ if utterance == GuideThiefGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class GuideThief8x8Env(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class GuideThief6x6Env(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+class GuideThiefNameless8x8Env(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(size=8, nameless=True)
+
+
+class GuideThiefTestEnv(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ nameless=False,
+ max_steps=20,
+ )
+
+class GuideThiefVeryDiff(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ nameless=False,
+ max_steps=20,
+ very_diff=True,
+ )
+
+class GuideThiefOneWord(GuideThiefEnv):
+ def __init__(self):
+ super().__init__(
+ size=5,
+ nameless=False,
+ max_steps=20,
+ very_diff=False,
+ one_word=True
+ )
+
+register(
+ id='MiniGrid-GuideThief-5x5-v0',
+ entry_point='gym_minigrid.envs:GuideThiefEnv'
+)
+
+register(
+ id='MiniGrid-GuideThief-6x6-v0',
+ entry_point='gym_minigrid.envs:GuideThief6x6Env'
+)
+
+register(
+ id='MiniGrid-GuideThief-8x8-v0',
+ entry_point='gym_minigrid.envs:GuideThief8x8Env'
+)
+
+register(
+ id='MiniGrid-GuideThiefNameless-8x8-v0',
+ entry_point='gym_minigrid.envs:GuideThiefNameless8x8Env'
+)
+
+register(
+ id='MiniGrid-GuideThiefTest-v0',
+ entry_point='gym_minigrid.envs:GuideThiefTestEnv'
+)
+
+register(
+ id='MiniGrid-GuideThiefVeryDiff-v0',
+ entry_point='gym_minigrid.envs:GuideThiefVeryDiff'
+)
+register(
+ id='MiniGrid-GuideThiefOneWord-v0',
+ entry_point='gym_minigrid.envs:GuideThiefOneWord'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/helper.py b/gym-minigrid/gym_minigrid/backup_envs/helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a3df741f3a6118297898574e4a7bf6921272038
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/helper.py
@@ -0,0 +1,295 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+
+class Peer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+
+ selected_door_id = self.env._rand_elem([0, 1])
+ self.selected_door_pos = [self.env.door_pos_top, self.env.door_pos_bottom][selected_door_id]
+ self.selected_door = [self.env.door_top, self.env.door_bottom][selected_door_id]
+ self.joint_attention_achieved = False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def step(self):
+
+ distance_to_door = np.abs(self.selected_door_pos - self.cur_pos).sum(-1)
+
+ if all(self.front_pos == self.selected_door_pos) and self.selected_door.is_open:
+ # in front of door
+ self.go_forward()
+
+ elif distance_to_door == 1 and not self.joint_attention_achieved:
+ # before turning to the door look at the agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ act = self.compute_turn_action(wanted_dir)
+ act()
+ if self.is_eye_contact():
+ self.joint_attention_achieved = True
+
+ else:
+ act = self.path_to_toggle_pos(self.selected_door_pos)
+ act()
+
+ # not really important as the NPC doesn't speak
+ if self.env.hidden_npc:
+ return None
+
+
+
+class HelperGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class HelperEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=20,
+ hidden_npc=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *HelperGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width-1
+ self.wall_y = height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # add lava
+ self.grid.vert_wall(width//2, 1, height - 2, Lava)
+
+ # door top
+ door_color_top = self._rand_elem(COLOR_NAMES)
+ self.door_pos_top = (width-1, 1)
+ self.door_top = Door(door_color_top, is_locked=True)
+ self.grid.set(*self.door_pos_top, self.door_top)
+
+ # switch top
+ self.switch_pos_top = (0, 1)
+ self.switch_top = Switch(door_color_top, lockable_object=self.door_top, locker_switch=True)
+ self.grid.set(*self.switch_pos_top, self.switch_top)
+
+ # door bottom
+ door_color_bottom = self._rand_elem(COLOR_NAMES)
+ self.door_pos_bottom = (width-1, height-2)
+ self.door_bottom = Door(door_color_bottom, is_locked=True)
+ self.grid.set(*self.door_pos_bottom, self.door_bottom)
+
+ # switch bottom
+ self.switch_pos_bottom = (0, height-2)
+ self.switch_bottom = Switch(door_color_bottom, lockable_object=self.door_bottom, locker_switch=True)
+ self.grid.set(*self.switch_pos_bottom, self.switch_bottom)
+
+ # save to variables
+ self.switches = [self.switch_top, self.switch_bottom]
+ self.switches_pos = [self.switch_pos_top, self.switch_pos_bottom]
+ self.door = [self.door_top, self.door_bottom]
+ self.door_pos = [self.door_pos_top, self.door_pos_bottom]
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.peer = Peer(color, "Jill", self)
+
+ # Place it on the middle right side of the room
+ peer_pos = np.array((self._rand_int(width//2+1, width - 1), self._rand_int(1, height - 1)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width//2, height))
+
+ # Generate the mission string
+ self.mission = 'watch dancer and repeat his moves afterwards'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+ self.peer.step()
+
+ if np.isnan(p_action):
+ pass
+
+ if p_action == self.actions.done:
+ done = True
+
+ elif all(self.agent_pos == self.door_pos_top):
+ done = True
+
+ elif all(self.agent_pos == self.door_pos_bottom):
+ done = True
+
+ elif all([self.switch_top.is_on, self.switch_bottom.is_on]):
+ # if both switches are on no reward is given and episode ends
+ done = True
+
+ elif all(self.peer.cur_pos == self.peer.selected_door_pos):
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ assert "{}:".format(self.peer.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.clear_text() # erase previous text
+
+ # self.window.set_caption(self.conversation, [self.peer.name])
+ # self.window.ax.set_title("correct door: {}".format(self.true_guide.target_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+class Helper8x8Env(HelperEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, max_steps=20, **kwargs)
+
+
+class Helper6x6Env(HelperEnv):
+ def __init__(self):
+ super().__init__(size=6, max_steps=20)
+
+
+
+register(
+ id='MiniGrid-Helper-5x5-v0',
+ entry_point='gym_minigrid.envs:HelperEnv'
+)
+
+register(
+ id='MiniGrid-Helper-6x6-v0',
+ entry_point='gym_minigrid.envs:Helper6x6Env'
+)
+
+register(
+ id='MiniGrid-Helper-8x8-v0',
+ entry_point='gym_minigrid.envs:Helper8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/showme.py b/gym-minigrid/gym_minigrid/backup_envs/showme.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fdb3f2c1f627a658945b38925a7350c8218d5a8
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/showme.py
@@ -0,0 +1,525 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+class DemonstratingPeer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+ def __init__(self, color, name, env, knowledgeable=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.knowledgeable = knowledgeable
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+ self.exited = False
+ self.joint_attention_achieved = False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def step(self):
+ super().step()
+ reply = None
+ if self.exited:
+ return
+
+ if all(np.array(self.cur_pos) == np.array(self.env.door_pos)):
+ # disappear
+ self.env.grid.set(*self.cur_pos, self.env.object)
+ self.cur_pos = np.array([np.nan, np.nan])
+
+ # close door
+ self.env.object.toggle(self.env, self.cur_pos)
+
+ # reset switches door
+ for s in self.env.switches:
+ s.is_on = False
+
+ # update door
+ self.env.update_door_lock()
+
+ self.exited = True
+
+ elif self.knowledgeable:
+
+ if self.joint_attention_achieved:
+ if self.env.object.is_locked:
+ first_wrong_id = np.where(self.env.get_selected_password() != self.env.password)[0][0]
+ goal_pos = self.env.switches_pos[first_wrong_id]
+ act = self.path_to_toggle_pos(goal_pos)
+ act()
+
+ else:
+ if all(self.front_pos == self.env.door_pos) and self.env.object.is_open:
+ self.go_forward()
+
+ else:
+ act = self.path_to_toggle_pos(self.env.door_pos)
+ act()
+ else:
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ action = self.compute_turn_action(wanted_dir)
+ action()
+
+ if self.is_eye_contact():
+ self.joint_attention_achieved = True
+ reply = "Look at me"
+
+ else:
+ self.env._rand_elem(self.available_moves)()
+
+ self.env.update_door_lock()
+
+ if self.env.hidden_npc:
+ reply = None
+
+ return reply
+
+
+class DemonstrationGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class DemonstrationEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ hard_password=False,
+ max_steps=100,
+ n_switches=3,
+ augmentation=False,
+ stump=False,
+ no_turn_off=False,
+ no_light=False,
+ hidden_npc=False
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hard_password = hard_password
+ self.n_switches = n_switches
+ self.augmentation = augmentation
+ self.stump = stump
+ self.no_turn_off=no_turn_off
+ self.hidden_npc = hidden_npc
+
+ if self.augmentation:
+ assert not no_light
+
+ self.no_light = no_light
+
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=False if self.stump else True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *DemonstrationGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def get_selected_password(self):
+ return np.array([int(s.is_on) for s in self.switches])
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width - 1
+ self.wall_y = height - 1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ door_color = self._rand_elem(COLOR_NAMES)
+
+ if self.stump:
+ wall_for_door = 1
+ else:
+ wall_for_door = self._rand_int(1, 4)
+
+ if wall_for_door < 2:
+ w = self._rand_int(1, width-1)
+ h = height-1 if wall_for_door == 0 else 0
+ else:
+ w = width-1 if wall_for_door == 3 else 0
+ h = self._rand_int(1, height-1)
+
+ assert h != height-1 # door mustn't be on the bottom wall
+
+ self.door_pos = (w, h)
+ self.door = Door(door_color, is_locked=True)
+ self.grid.set(*self.door_pos, self.door)
+
+ if self.stump:
+ self.stump_pos = (w, h+2)
+ self.stump_obj = Wall()
+ self.grid.set(*self.stump_pos, self.stump_obj)
+
+ # sample password
+ if self.hard_password:
+ self.password = np.array([self._rand_int(0, 2) for _ in range(self.n_switches)])
+
+ else:
+ idx = self._rand_int(0, self.n_switches)
+ self.password = np.zeros(self.n_switches)
+ self.password[idx] = 1.0
+
+ # add the switches
+ self.switches = []
+ self.switches_pos = []
+ for i in range(self.n_switches):
+ c = COLOR_NAMES[i]
+ pos = np.array([i+1, height-1])
+ sw = Switch(c, is_on=bool(self.password[i]) if self.augmentation else False, no_light=self.no_light)
+ self.grid.set(*pos, sw)
+ self.switches.append(sw)
+ self.switches_pos.append(pos)
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+
+ if not self.augmentation:
+ self.peer = DemonstratingPeer(color, "Jim", self, knowledgeable=self.knowledgeable)
+
+ # height -2 so its not in front of the buttons in the way
+ peer_pos = np.array((self._rand_int(1, width - 1), self._rand_int(1, height - 2)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'exit the room'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def update_door_lock(self):
+ if self.augmentation and self.step_count <= 10:
+ self.door.is_locked = True
+ self.door.is_open = False
+ else:
+ if np.array_equal(self.get_selected_password(), self.password):
+ self.door.is_locked = False
+ else:
+ self.door.is_locked = True
+ self.door.is_open = False
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+ self.update_door_lock()
+ # print("pass:", self.password)
+ # print("selected pass:", self.get_selected_password())
+
+ if self.augmentation and self.step_count == 10:
+ # reset switches door
+ for s in self.switches:
+ s.is_on = False
+
+ # update door
+ self.update_door_lock()
+
+ if p_action == self.actions.done:
+ done = True
+
+ if not self.augmentation:
+ peer_reply = self.peer.step()
+
+ if peer_reply is not None:
+ self.utterance += "{}: {} \n".format(self.peer.name, peer_reply)
+ self.conversation += "{}: {} \n".format(self.peer.name, peer_reply)
+
+ if all(self.agent_pos == self.door_pos):
+ done = True
+ if not self.augmentation:
+ if self.peer.exited:
+ # only give reward if both exited
+ reward = self._reward()
+ else:
+ reward = self._reward()
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ if not self.augmentation:
+ assert "{}:".format(self.peer.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ self.window.clear_text() # erase previous text
+ self.window.set_caption(self.conversation)
+ sw_color = self.switches[np.argmax(self.password)].color
+ self.window.ax.set_title("correct switch: {}".format(sw_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+## 100 Demonstrating
+# register(
+# id='MiniGrid-DemonstrationNoLightNoTurnOff100-8x8-v0',
+# entry_point='gym_minigrid.envs:DemonstrationNoLightNoTurnOff1008x8Env'
+# )
+#class Demonstration100TwoSwitches8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2)
+#
+#class Demonstration100TwoSwitchesHard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2, hard_password=True)
+#
+## 100 AUG Demonstrating
+#class AugmentationDemonstration100TwoSwitches8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2, augmentation=True)
+#
+#class AugmentationDemonstration100TwoSwitchesHard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2, hard_password=True, augmentation=True)
+#
+#
+## Three switches
+## 100 Demonstrating
+#class Demonstration1008x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100)
+#
+#class Demonstration100Hard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, hard_password=True)
+#
+## 100 AUG Demonstrating
+#class AugmentationDemonstration1008x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, augmentation=True)
+#
+#class AugmentationDemonstration100Hard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, hard_password=True, augmentation=True)
+#
+## No turn off
+## 100 Demonstrating: No light, no turn off
+#
+#class DemonstrationNoLightNoTurnOff100Hard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, hard_password=True, no_light=True)
+#
+## 100 no turn off
+#class DemonstrationNoTurnOff1008x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True)
+#
+#class DemonstrationNoTurnOff100Hard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, hard_password=True)
+#
+## 100 AUG Demonstrating
+#
+#class AugmentationDemonstrationNoTurnOff100Hard8x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, hard_password=True, augmentation=True)
+
+
+## demonstrating 100 steps
+#register(
+# id='MiniGrid-Demonstration100TwoSwitches-8x8-v0',
+# entry_point='gym_minigrid.envs:Demonstration100TwoSwitches8x8Env'
+#)
+#register(
+# id='MiniGrid-Demonstration100TwoSwitchesHard-8x8-v0',
+# entry_point='gym_minigrid.envs:Demonstration100TwoSwitchesHard8x8Env'
+#)
+#
+## AUG demonstrating 100 steps
+#register(
+# id='MiniGrid-AugmentationDemonstration100TwoSwitches-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstration100TwoSwitches8x8Env'
+#)
+#register(
+# id='MiniGrid-AugmentationDemonstration100TwoSwitchesHard-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstration100TwoSwitchesHard8x8Env'
+#)
+#
+## three switches
+#
+## demonstrating 100 steps
+#register(
+# id='MiniGrid-Demonstration100-8x8-v0',
+# entry_point='gym_minigrid.envs:Demonstration1008x8Env'
+#)
+#register(
+# id='MiniGrid-Demonstration100Hard-8x8-v0',
+# entry_point='gym_minigrid.envs:Demonstration100Hard8x8Env'
+#)
+#
+## AUG demonstrating 100 steps
+#register(
+# id='MiniGrid-AugmentationDemonstration100-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstration1008x8Env'
+#)
+#register(
+# id='MiniGrid-AugmentationDemonstration100Hard-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstration100Hard8x8Env'
+#)
+#
+## no turn off three switches
+#
+## demonstrating 100 steps
+#register(
+# id='MiniGrid-DemonstrationNoTurnOff100-8x8-v0',
+# entry_point='gym_minigrid.envs:DemonstrationNoTurnOff1008x8Env'
+#)
+#register(
+# id='MiniGrid-DemonstrationNoTurnOff100Hard-8x8-v0',
+# entry_point='gym_minigrid.envs:DemonstrationNoTurnOff100Hard8x8Env'
+#)
+#
+## demonstrating 100 steps no light
+#register(
+# id='MiniGrid-DemonstrationNoLightNoTurnOff100-8x8-v0',
+# entry_point='gym_minigrid.envs:DemonstrationNoLightNoTurnOff1008x8Env'
+#)
+#register(
+# id='MiniGrid-DemonstrationNoLightNoTurnOff100Hard-8x8-v0',
+# entry_point='gym_minigrid.envs:DemonstrationNoLightNoTurnOff100Hard8x8Env'
+#)
+#
+## AUG demonstrating 100 steps
+#register(
+# id='MiniGrid-AugmentationDemonstrationNoTurnOff100-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstrationNoTurnOff1008x8Env'
+#)
+#register(
+# id='MiniGrid-AugmentationDemonstrationNoTurnOff100Hard-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstrationNoTurnOff100Hard8x8Env'
+#)
+# register(
+# id='MiniGrid-AugmentationDemonstrationNoTurnOff100-8x8-v0',
+# entry_point='gym_minigrid.envs:AugmentationDemonstrationNoTurnOff1008x8Env'
+# )
+#
+# class DemonstrationNoLightNoTurnOff1008x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, no_light=True)
+#
+# class AugmentationDemonstrationNoTurnOff1008x8Env(DemonstrationEnv):
+# def __init__(self):
+# super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, augmentation=True)
+
+class ShowMe8x8Env(DemonstrationEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, no_light=True, **kwargs)
+
+class ShowMeNoSocial8x8Env(DemonstrationEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, knowledgeable=True, max_steps=100, no_turn_off=True, augmentation=True, **kwargs)
+
+
+# AUG demonstrating 100 steps
+register(
+ id='MiniGrid-ShowMeNoSocial-8x8-v0',
+ entry_point='gym_minigrid.envs:ShowMeNoSocial8x8Env'
+)
+register(
+ id='MiniGrid-ShowMe-8x8-v0',
+ entry_point='gym_minigrid.envs:ShowMe8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/socialenv.py b/gym-minigrid/gym_minigrid/backup_envs/socialenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..e76f7b459f8ae5be19da70bbe4b361fe4349ae4b
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/socialenv.py
@@ -0,0 +1,194 @@
+from itertools import chain
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+from gym_minigrid.envs import DanceWithOneNPC8x8Env, CoinThief8x8Env, TalkItOutPolite8x8Env, ShowMe8x8Env, \
+ DiverseExit8x8Env, Exiter8x8Env, Helper8x8Env
+from gym_minigrid.envs import DanceWithOneNPCGrammar, CoinThiefGrammar, TalkItOutPoliteGrammar, DemonstrationGrammar, \
+ EasyTeachingGamesGrammar, ExiterGrammar
+import time
+from collections import deque
+
+
+class SocialEnvMetaGrammar(object):
+
+ def __init__(self, grammar_list, env_list):
+ self.templates = []
+ self.things = []
+ self.original_template_idx = []
+ self.original_thing_idx = []
+
+ self.meta_template_idx_to_env_name = {}
+ self.meta_thing_idx_to_env_name = {}
+ self.template_idx, self.thing_idx = 0, 0
+ env_names = [e.__class__.__name__ for e in env_list]
+
+ for g, env_name in zip(grammar_list, env_names):
+ # add templates
+ self.templates += g.templates
+ # add things
+ self.things += g.things
+
+ # save original idx for both
+ self.original_template_idx += list(range(0, len(g.templates)))
+ self.original_thing_idx += list(range(0, len(g.things)))
+
+ # update meta_idx to env_names dictionaries
+ self.meta_template_idx_to_env_name.update(dict.fromkeys(list(range(self.template_idx,
+ self.template_idx + len(g.templates))),
+ env_name))
+ self.template_idx += len(g.templates)
+
+ self.meta_thing_idx_to_env_name.update(dict.fromkeys(list(range(self.thing_idx,
+ self.thing_idx + len(g.things))),
+ env_name))
+ self.thing_idx += len(g.things)
+
+ self.grammar_action_space = spaces.MultiDiscrete([len(self.templates), len(self.things)])
+
+ @classmethod
+ def construct_utterance(self, action):
+ return self.templates[int(action[0])] + " " + self.things[int(action[1])] + " "
+
+ @classmethod
+ def random_utterance(self):
+ return np.random.choice(self.templates) + " " + np.random.choice(self.things) + " "
+
+ def construct_original_action(self, action, current_env_name):
+ template_env_name = self.meta_template_idx_to_env_name[int(action[0])]
+ thing_env_name = self.meta_thing_idx_to_env_name[int(action[1])]
+
+ if template_env_name == current_env_name and thing_env_name == current_env_name:
+ original_action = [self.original_template_idx[int(action[0])], self.original_thing_idx[int(action[1])]]
+ else:
+ original_action = [np.nan, np.nan]
+ return original_action
+
+
+class SocialEnv(gym.Env):
+ """
+ Meta-Environment containing all other environment (multi-task learning)
+ """
+
+ def __init__(
+ self,
+ size=8,
+ hidden_npc=False,
+ is_test_env=False
+
+ ):
+
+ # Number of cells (width and height) in the agent view
+ self.agent_view_size = 7
+
+ # Number of object dimensions (i.e. number of channels in symbolic image)
+ self.nb_obj_dims = 4
+
+ # Observations are dictionaries containing an
+ # encoding of the grid and a textual 'mission' string
+ self.observation_space = spaces.Box(
+ low=0,
+ high=255,
+ shape=(self.agent_view_size, self.agent_view_size, self.nb_obj_dims),
+ dtype='uint8'
+ )
+ self.observation_space = spaces.Dict({
+ 'image': self.observation_space
+ })
+
+ self.hidden_npc = hidden_npc # TODO: implement hidden npc
+
+ # TODO get max step from env list
+
+ self.env_list = [DanceWithOneNPC8x8Env, CoinThief8x8Env, TalkItOutPolite8x8Env, ShowMe8x8Env, DiverseExit8x8Env,
+ Exiter8x8Env]
+ self.all_npc_utterance_actions = sorted(list(set(chain(*[e.all_npc_utterance_actions for e in self.env_list]))))
+ self.grammar_list = [DanceWithOneNPCGrammar, CoinThiefGrammar, TalkItOutPoliteGrammar, DemonstrationGrammar,
+ EasyTeachingGamesGrammar, ExiterGrammar]
+
+ if is_test_env:
+ self.env_list[-1] = Helper8x8Env
+
+ # instanciate all envs
+ self.env_list = [env() for env in self.env_list]
+
+ self.current_env = None
+
+ self.metaGrammar = SocialEnvMetaGrammar(self.grammar_list, self.env_list)
+
+ # Actions are discrete integer values
+ self.action_space = spaces.MultiDiscrete([len(MiniGridEnv.Actions),
+ *self.metaGrammar.grammar_action_space.nvec])
+ self.actions = MiniGridEnv.Actions
+
+ self._window = None
+
+ def reset(self):
+ # select a new social environment at random, for each new episode
+
+ old_window = None
+ if self.current_env: # a previous env exists, save old window
+ old_window = self.current_env.window
+
+ # sample new environment
+ self.current_env = np.random.choice(self.env_list)
+ obs = self.current_env.reset()
+
+ # carry on window if this env is not the first
+ if old_window:
+ self.current_env.window = old_window
+ return obs
+
+ def seed(self, seed=1337):
+ # Seed the random number generator
+ for env in self.env_list:
+ env.seed(seed)
+ np.random.seed(seed)
+ return [seed]
+
+ def step(self, action):
+ assert (self.current_env)
+ if len(action) == 1: # agent cannot speak
+ utterance_action = [np.nan, np.nan]
+ else:
+ utterance_action = action[1:]
+
+ if len(action) >= 1 and not all(np.isnan(utterance_action)): # if agent speaks, contruct env-specific action
+ action[1:] = self.metaGrammar.construct_original_action(action[1:], self.current_env.__class__.__name__)
+
+ return self.current_env.step(action)
+
+ @property
+ def window(self):
+ return self.current_env.window
+
+ @window.setter
+ def window(self, value):
+ self.current_env.window = value
+
+ def render(self, *args, **kwargs):
+ assert self.current_env
+ return self.current_env.render(*args, **kwargs)
+
+ @property
+ def step_count(self):
+ return self.current_env.step_count
+
+ def get_mission(self):
+ return self.current_env.get_mission()
+
+
+class SocialEnv8x8Env(SocialEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, **kwargs)
+
+
+register(
+ id='MiniGrid-SocialEnv-5x5-v0',
+ entry_point='gym_minigrid.envs:SocialEnvEnv'
+)
+
+register(
+ id='MiniGrid-SocialEnv-8x8-v0',
+ entry_point='gym_minigrid.envs:SocialEnv8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/spying.py b/gym-minigrid/gym_minigrid/backup_envs/spying.py
new file mode 100644
index 0000000000000000000000000000000000000000..31fe5d6fa1cf339a8f52d7cf3c37d8d8d18b9647
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/spying.py
@@ -0,0 +1,429 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+
+class Peer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env, knowledgeable=False):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.knowledgeable = knowledgeable
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+ self.exited = False
+
+ def step(self):
+ if self.exited:
+ return
+
+ if all(np.array(self.cur_pos) == np.array(self.env.door_pos)):
+ # disappear
+ self.env.grid.set(*self.cur_pos, self.env.object)
+ self.cur_pos = np.array([np.nan, np.nan])
+
+ # close door
+ self.env.object.toggle(self.env, self.cur_pos)
+
+ # reset switches door
+ for s in self.env.switches:
+ s.is_on = False
+
+ # update door
+ self.env.update_door_lock()
+
+ self.exited = True
+
+ elif self.knowledgeable:
+
+ if self.env.object.is_locked:
+ first_wrong_id = np.where(self.env.get_selected_password() != self.env.password)[0][0]
+ print("first_wrong_id:", first_wrong_id)
+ goal_pos = self.env.switches_pos[first_wrong_id]
+ act = self.path_to_toggle_pos(goal_pos)
+ act()
+
+ else:
+ if all(self.front_pos == self.env.door_pos) and self.env.object.is_open:
+ self.go_forward()
+
+ else:
+ act = self.path_to_toggle_pos(self.env.door_pos)
+ act()
+
+ else:
+ self.env._rand_elem(self.available_moves)()
+
+ self.env.update_door_lock()
+
+
+class SpyingGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class SpyingEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ hard_password=False,
+ max_steps=None,
+ n_switches=3
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hard_password = hard_password
+ self.n_switches = n_switches
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps or 5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *SpyingGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def get_selected_password(self):
+ return np.array([int(s.is_on) for s in self.switches])
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width - 1
+ self.wall_y = height - 1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ door_color = self._rand_elem(COLOR_NAMES)
+
+ wall_for_door = self._rand_int(1, 4)
+
+ if wall_for_door < 2:
+ w = self._rand_int(1, width-1)
+ h = height-1 if wall_for_door == 0 else 0
+ else:
+ w = width-1 if wall_for_door == 3 else 0
+ h = self._rand_int(1, height-1)
+
+ assert h != height-1 # door mustn't be on the bottom wall
+
+ self.door_pos = (w, h)
+ self.door = Door(door_color, is_locked=True)
+ self.grid.set(*self.door_pos, self.door)
+
+ # add the switches
+ self.switches = []
+ self.switches_pos = []
+ for i in range(self.n_switches):
+ c = COLOR_NAMES[i]
+ pos = np.array([i+1, height-1])
+ sw = Switch(c)
+ self.grid.set(*pos, sw)
+ self.switches.append(sw)
+ self.switches_pos.append(pos)
+
+ # sample password
+ if self.hard_password:
+ self.password = np.array([self._rand_int(0, 2) for _ in range(self.n_switches)])
+
+ else:
+ idx = self._rand_int(0, self.n_switches)
+ self.password = np.zeros(self.n_switches)
+ self.password[idx] = 1.0
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.peer = Peer(color, "Jim", self, knowledgeable=self.knowledgeable)
+
+ # Place it on the middle left side of the room
+ peer_pos = np.array((self._rand_int(1, width - 1), self._rand_int(1, height - 1)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'exit the room'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def update_door_lock(self):
+ if np.array_equal(self.get_selected_password(), self.password):
+ self.door.is_locked = False
+ else:
+ self.door.is_locked = True
+ self.door.is_open = False
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+ self.update_door_lock()
+
+ print("pass:", self.password)
+
+ if p_action == self.actions.done:
+ done = True
+
+ self.peer.step()
+
+ if all(self.agent_pos == self.door_pos):
+ done = True
+ if self.peer.exited:
+ # only give reward of both exited
+ reward = self._reward()
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [self.peer.name])
+ return obs
+
+
+class Spying8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class Spying6x6Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+# knowledgeable
+class SpyingKnowledgeableEnv(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=5, knowledgeable=True)
+
+class SpyingKnowledgeable6x6Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=6, knowledgeable=True)
+
+class SpyingKnowledgeable8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True)
+
+class SpyingKnowledgeableHardPassword8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, hard_password=True)
+
+class Spying508x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, max_steps=50)
+
+class SpyingKnowledgeable508x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=50)
+
+class SpyingKnowledgeableHardPassword508x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, hard_password=True, max_steps=50)
+
+class SpyingKnowledgeable1008x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=100)
+
+class SpyingKnowledgeable100OneSwitch8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=1)
+
+class SpyingKnowledgeable50OneSwitch5x5Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=5, knowledgeable=True, max_steps=50, n_switches=1)
+
+
+class SpyingKnowledgeable505x5Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=5, knowledgeable=True, max_steps=50, n_switches=3)
+
+class SpyingKnowledgeable50TwoSwitches8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=50, n_switches=2)
+
+class SpyingKnowledgeable50TwoSwitchesHard8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=50, n_switches=2, hard_password=True)
+
+
+class SpyingKnowledgeable100TwoSwitches8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2)
+
+class SpyingKnowledgeable100TwoSwitchesHard8x8Env(SpyingEnv):
+ def __init__(self):
+ super().__init__(size=8, knowledgeable=True, max_steps=100, n_switches=2, hard_password=True)
+
+
+
+
+register(
+ id='MiniGrid-Spying-5x5-v0',
+ entry_point='gym_minigrid.envs:SpyingEnv'
+)
+
+register(
+ id='MiniGrid-Spying-6x6-v0',
+ entry_point='gym_minigrid.envs:Spying6x6Env'
+)
+
+register(
+ id='MiniGrid-Spying-8x8-v0',
+ entry_point='gym_minigrid.envs:Spying8x8Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable-5x5-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeableEnv'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable-6x6-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable6x6Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable8x8Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeableHardPassword-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeableHardPassword8x8Env'
+)
+
+# max len 50
+register(
+ id='MiniGrid-Spying50-8x8-v0',
+ entry_point='gym_minigrid.envs:Spying508x8Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable50-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable508x8Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeableHardPassword50-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeableHardPassword508x8Env'
+)
+
+# max len 100
+register(
+ id='MiniGrid-SpyingKnowledgeable100-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable1008x8Env'
+)
+
+# max len OneSwitch
+register(
+ id='MiniGrid-SpyingKnowledgeable100OneSwitch-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable100OneSwitch8x8Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable50OneSwitch-5x5-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable50OneSwitch5x5Env'
+)
+
+register(
+ id='MiniGrid-SpyingUnknowledgeable50OneSwitch-5x5-v0',
+ entry_point='gym_minigrid.envs:SpyingUnknowledgeable50OneSwitch5x5Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable50-5x5-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable505x5Env'
+)
+
+register(
+ id='MiniGrid-SpyingKnowledgeable50TwoSwitches-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable50TwoSwitches8x8Env'
+)
+register(
+ id='MiniGrid-SpyingKnowledgeable50TwoSwitchesHard-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable50TwoSwitchesHard8x8Env'
+)
+register(
+ id='MiniGrid-SpyingKnowledgeable100TwoSwitches-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable100TwoSwitches8x8Env'
+)
+register(
+ id='MiniGrid-SpyingKnowledgeable100TwoSwitchesHard-8x8-v0',
+ entry_point='gym_minigrid.envs:SpyingKnowledgeable100TwoSwitchesHard8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/backup_envs/talkitout.py b/gym-minigrid/gym_minigrid/backup_envs/talkitout.py
new file mode 100644
index 0000000000000000000000000000000000000000..2256d3ca03603b141cab34e350019af7edffa0ed
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/talkitout.py
@@ -0,0 +1,385 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Wizard(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+
+ def listen(self, utterance):
+ if utterance == TalkItOutGrammar.construct_utterance([0, 1]):
+ if self.env.nameless:
+ return "Ask the {} guide.".format(self.env.true_guide.color)
+ else:
+ return "Ask {}.".format(self.env.true_guide.name)
+
+ return None
+
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 1 # this will be put into the encoding
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def listen(self, utterance):
+ if utterance == TalkItOutGrammar.construct_utterance([0, 1]):
+ if self.liar:
+ fake_colors = [c for c in self.env.door_colors if c != self.env.target_color]
+ fake_color = self.env._rand_elem(fake_colors)
+
+ # Generate the mission string
+ assert fake_color != self.env.target_color
+ return 'go to the %s door' % fake_color
+
+ else:
+ return self.env.mission
+
+ return None
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+
+class TalkItOutGrammar(object):
+
+ templates = ["Where is", "Open", "Close", "What is"]
+ things = [
+ "sesame", "the exit", "the wall", "the floor", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "oven", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class TalkItOutEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ nameless=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.nameless = nameless
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkItOutGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Wizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # add guides
+ GUIDE_NAMES = ["John", "Jack"]
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+ name = self._rand_elem(GUIDE_NAMES)
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(color, name, self, liar=False)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in [*self.door_front_pos, tuple(self.wizard.cur_pos)])
+
+ # Set a randomly coloured FALSE GUIDE at a random position
+ name = self._rand_elem([n for n in GUIDE_NAMES if n != self.true_guide.name])
+
+ if self.nameless:
+ color = self._rand_elem([c for c in COLOR_NAMES if c != self.true_guide.color])
+ else:
+ color = self._rand_elem(COLOR_NAMES)
+
+ self.false_guide = Guide(color, name, self, liar=True)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.false_guide,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in [
+ *self.door_front_pos, tuple(self.wizard.cur_pos), tuple(self.true_guide.cur_pos)])
+ assert self.true_guide.name != self.false_guide.name
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkItOutGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ if self.nameless:
+ self.utterance += "{} \n".format(utterance)
+ else:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if self.false_guide.is_near_agent():
+ reply = self.false_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.false_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.false_guide.name, reply)
+
+ if utterance == TalkItOutGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class TalkItOut8x8Env(TalkItOutEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class TalkItOut6x6Env(TalkItOutEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+class TalkItOutNameless8x8Env(TalkItOutEnv):
+ def __init__(self):
+ super().__init__(size=8, nameless=True)
+
+register(
+ id='MiniGrid-TalkItOut-5x5-v0',
+ entry_point='gym_minigrid.envs:TalkItOutEnv'
+)
+
+register(
+ id='MiniGrid-TalkItOut-6x6-v0',
+ entry_point='gym_minigrid.envs:TalkItOut6x6Env'
+)
+
+register(
+ id='MiniGrid-TalkItOut-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOut8x8Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutNameless-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNameless8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliar.py b/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliar.py
new file mode 100644
index 0000000000000000000000000000000000000000..864a147f8cfee7fd15bdf3bfff4cf88187192c80
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliar.py
@@ -0,0 +1,384 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Wizard(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+
+ def listen(self, utterance):
+ if utterance == TalkItOutNoLiarGrammar.construct_utterance([0, 1]):
+ if self.env.nameless:
+ return "Ask the {} guide.".format(self.env.true_guide.color)
+ else:
+ return "Ask {}.".format(self.env.true_guide.name)
+
+ return None
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_dir = 1 # NPC initially looks downward
+ assert not self.liar # in this env the guide is always good
+ # todo: this should be id == name
+ self.npc_type = 1 # this will be put into the encoding
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def listen(self, utterance):
+ if utterance == TalkItOutNoLiarGrammar.construct_utterance([0, 1]):
+ return self.env.mission
+
+ return None
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class TalkItOutNoLiarGrammar(object):
+
+ templates = ["Where is", "Open", "Close", "What is"]
+ things = [
+ "sesame", "the exit", "the wall", "the floor", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "oven", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class TalkItOutNoLiarEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ nameless=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.nameless = nameless
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkItOutNoLiarGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Wizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(color, "Jack", self, liar=False)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in [*self.door_front_pos, tuple(self.wizard.cur_pos)])
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkItOutNoLiarGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ if self.nameless:
+ self.utterance += "{} \n".format(utterance)
+ else:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if utterance == TalkItOutNoLiarGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ return obs, reward, done, info
+
+ # def reset(self):
+ # obs = super().reset()
+ # self.append_existing_utterance_to_history()
+ # obs = self.add_utterance_to_observation(obs)
+ # self.reset_utterance()
+ # return obs
+ #
+ # def append_existing_utterance_to_history(self):
+ # if self.utterance != self.empty_symbol:
+ # if self.utterance.startswith(self.empty_symbol):
+ # self.utterance_history += self.utterance[len(self.empty_symbol):]
+ # else:
+ # assert self.utterance == self.beginning_string
+ # self.utterance_history += self.utterance
+ #
+ # def add_utterance_to_observation(self, obs):
+ # obs["utterance"] = self.utterance
+ # obs["utterance_history"] = self.utterance_history
+ # obs["mission"] = "Hidden"
+ # return obs
+ #
+ # def reset_utterance(self):
+ # # set utterance to empty indicator
+ # self.utterance = self.empty_symbol
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+ return obs
+
+
+class TalkItOutNoLiar8x8Env(TalkItOutNoLiarEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class TalkItOutNoLiar6x6Env(TalkItOutNoLiarEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+class TalkItOutNoLiarNameless8x8Env(TalkItOutNoLiarEnv):
+ def __init__(self):
+ super().__init__(size=8, nameless=True)
+
+register(
+ id='MiniGrid-TalkItOutNoLiar-5x5-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarEnv'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiar-6x6-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiar6x6Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiar-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiar8x8Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiarNameless-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarNameless8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliarpolite.py b/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliarpolite.py
new file mode 100644
index 0000000000000000000000000000000000000000..248fede7bebc471394579675db392972edac0556
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/talkitoutnoliarpolite.py
@@ -0,0 +1,428 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Wizard(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+ self.was_introduced_to = False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def listen(self, utterance):
+ if self.env.hidden_npc:
+ return None
+
+ if self.was_introduced_to:
+ if utterance == TalkItOutNoLiarPoliteGrammar.construct_utterance([0, 1]):
+ if self.env.nameless:
+ return "Ask the {} guide.".format(self.env.true_guide.color)
+ else:
+ return "Ask {}.".format(self.env.true_guide.name)
+ else:
+ if utterance == TalkItOutNoLiarPoliteGrammar.construct_utterance([3, 3]):
+ self.was_introduced_to = True
+ return "I am well."
+
+ return None
+
+
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 1 # this will be put into the encoding
+ self.was_introduced_to = False
+ assert not self.liar # in this env the guide is always good
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def listen(self, utterance):
+ if self.was_introduced_to:
+ if utterance == TalkItOutNoLiarPoliteGrammar.construct_utterance([0, 1]):
+ return self.env.mission
+ else:
+ if utterance == TalkItOutNoLiarPoliteGrammar.construct_utterance([3, 3]):
+ self.was_introduced_to = True
+ return "I am well."
+
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+
+class TalkItOutNoLiarPoliteGrammar(object):
+
+ templates = ["Where is", "Open", "Close", "How are"]
+ things = [
+ "sesame", "the exit", "the wall", "you", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "the floor", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+ assert len(templates)*len(things) == 64
+ print("language complexity {}:".format(len(templates)*len(things)))
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class TalkItOutNoLiarPoliteEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ nameless=False,
+ max_steps=100,
+ hidden_npc=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.nameless = nameless
+ self.hidden_npc = hidden_npc
+
+ if max_steps is None:
+ max_steps = 5*size**2
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkItOutNoLiarPoliteGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Wizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+ name = "John"
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(color, name, self, liar=False)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in [*self.door_front_pos, tuple(self.wizard.cur_pos)])
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkItOutNoLiarPoliteGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ if self.nameless:
+ self.utterance += "{} \n".format(utterance)
+ else:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if utterance == TalkItOutNoLiarPoliteGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ assert "{}:".format(self.wizard.name) not in self.utterance
+ #assert "{}:".format(self.true_guide.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+
+ self.window.clear_text() # erase previous text
+
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+
+ self.window.ax.set_title("correct door: {}".format(self.true_guide.target_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+class TalkItOutNoLiarPolite8x8Env(TalkItOutNoLiarPoliteEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, max_steps=100, **kwargs)
+
+
+class TalkItOutNoLiarPolite6x6Env(TalkItOutNoLiarPoliteEnv):
+ def __init__(self):
+ super().__init__(size=6, max_steps=100)
+
+
+class TalkItOutNoLiarPoliteNameless8x8Env(TalkItOutNoLiarPoliteEnv):
+ def __init__(self):
+ super().__init__(size=8, max_steps=100, nameless=True)
+
+register(
+ id='MiniGrid-TalkItOutNoLiarPolite-5x5-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarPoliteEnv'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiarPolite-6x6-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarPolite6x6Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiarPolite-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarPolite8x8Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutNoLiarPoliteNameless-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutNoLiarPoliteNameless8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/talkitoutpolite.py b/gym-minigrid/gym_minigrid/backup_envs/talkitoutpolite.py
new file mode 100644
index 0000000000000000000000000000000000000000..25b681a0df38d614b3373bcae41192f82072f556
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/talkitoutpolite.py
@@ -0,0 +1,464 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class Wizard(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+ self.was_introduced_to = False
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def listen(self, utterance):
+ if self.env.hidden_npc:
+ return None
+
+ if self.was_introduced_to:
+ if utterance == TalkItOutPoliteGrammar.construct_utterance([0, 1]):
+ if self.env.nameless:
+ return "Ask the {} guide.".format(self.env.true_guide.color)
+ else:
+ return "Ask {}.".format(self.env.true_guide.name)
+ else:
+ if utterance == TalkItOutPoliteGrammar.construct_utterance([3, 3]):
+ self.was_introduced_to = True
+ return "I am well."
+
+ return None
+
+
+
+class Guide(NPC):
+ """
+ A simple NPC that knows the correct door.
+ """
+
+ def __init__(self, color, name, env, liar=False):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.liar = liar
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 1 # this will be put into the encoding
+ self.was_introduced_to = False
+
+ # Select a random target object as mission
+ obj_idx = self.env._rand_int(0, len(self.env.door_pos))
+ self.target_pos = self.env.door_pos[obj_idx]
+ self.target_color = self.env.door_colors[obj_idx]
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3):
+ if self.env.hidden_npc:
+ if nb_dims == 3:
+ return (1, 0, 0)
+ elif nb_dims == 4:
+ return (1, 0, 0, 0)
+ else:
+ return super().encode(nb_dims=nb_dims)
+
+ def listen(self, utterance):
+ if self.env.hidden_npc:
+ return None
+
+ if self.was_introduced_to:
+ if utterance == TalkItOutPoliteGrammar.construct_utterance([0, 1]):
+ if self.liar:
+ fake_colors = [c for c in self.env.door_colors if c != self.env.target_color]
+ fake_color = self.env._rand_elem(fake_colors)
+
+ # Generate the mission string
+ assert fake_color != self.env.target_color
+ return 'go to the %s door' % fake_color
+
+ else:
+ return self.env.mission
+ else:
+ if utterance == TalkItOutPoliteGrammar.construct_utterance([3, 3]):
+ self.was_introduced_to = True
+ return "I am well."
+
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ # Draw eyes
+ npc_shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ npc_shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+
+ # Draw mouth
+ npc_shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+
+ # todo: move this to super function
+ # todo: super.render should be able to take the npc_shapes and then rotate them
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ # Draw shapes
+ for v in npc_shapes:
+ fill_coords(img, v, c)
+
+
+class TalkItOutPoliteGrammar(object):
+
+ templates = ["Where is", "Open", "Close", "How are"]
+ things = [
+ "sesame", "the exit", "the wall", "you", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "the floor", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+ assert len(templates)*len(things) == 64
+ print("language complexity {}:".format(len(templates)*len(things)))
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class TalkItOutPoliteEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ hear_yourself=False,
+ diminished_reward=True,
+ step_penalty=False,
+ nameless=False,
+ max_steps=100,
+ hidden_npc=False,
+
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.hear_yourself = hear_yourself
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.nameless = nameless
+ self.hidden_npc = hidden_npc
+
+ if max_steps is None:
+ max_steps = 5*size**2
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TalkItOutPoliteGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "hear_yourself": hear_yourself,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ self.door_pos = []
+ self.door_front_pos = [] # Remembers positions in front of door to avoid setting wizard here
+
+ self.door_pos.append((self._rand_int(2, width-2), 0))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1]+1))
+
+ self.door_pos.append((self._rand_int(2, width-2), height-1))
+ self.door_front_pos.append((self.door_pos[-1][0], self.door_pos[-1][1] - 1))
+
+ self.door_pos.append((0, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] + 1, self.door_pos[-1][1]))
+
+ self.door_pos.append((width-1, self._rand_int(2, height-2)))
+ self.door_front_pos.append((self.door_pos[-1][0] - 1, self.door_pos[-1][1]))
+
+ # Generate the door colors
+ self.door_colors = []
+ while len(self.door_colors) < len(self.door_pos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in self.door_colors:
+ continue
+ self.door_colors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(self.door_pos):
+ color = self.door_colors[idx]
+ self.grid.set(*pos, Door(color))
+
+
+ # Set a randomly coloured WIZARD at a random position
+ color = self._rand_elem(COLOR_NAMES)
+ self.wizard = Wizard(color, "Gandalf", self)
+
+ # Place it randomly, omitting front of door positions
+ self.place_obj(self.wizard,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+
+ # add guides
+ GUIDE_NAMES = ["John", "Jack"]
+
+ # Set a randomly coloured TRUE GUIDE at a random position
+ name = self._rand_elem(GUIDE_NAMES)
+ color = self._rand_elem(COLOR_NAMES)
+ self.true_guide = Guide(color, name, self, liar=False)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.true_guide,
+ size=(width, height),
+ # reject_fn=lambda _, p: tuple(p) in self.door_front_pos)
+ reject_fn=lambda _, p: tuple(p) in [*self.door_front_pos, tuple(self.wizard.cur_pos)])
+
+ # Set a randomly coloured FALSE GUIDE at a random position
+ name = self._rand_elem([n for n in GUIDE_NAMES if n != self.true_guide.name])
+
+ color = self._rand_elem([c for c in COLOR_NAMES if c != self.true_guide.color])
+
+ self.false_guide = Guide(color, name, self, liar=True)
+
+ # Place it randomly, omitting invalid positions
+ self.place_obj(self.false_guide,
+ size=(width, height),
+ reject_fn=lambda _, p: tuple(p) in [
+ *self.door_front_pos, tuple(self.wizard.cur_pos), tuple(self.true_guide.cur_pos)])
+
+ assert self.true_guide.name != self.false_guide.name
+ assert self.true_guide.color != self.false_guide.color
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ self.doorIdx = self._rand_int(0, len(self.door_pos))
+ self.target_pos = self.door_pos[self.doorIdx]
+ self.target_color = self.door_colors[self.doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+ self.outcome_info = None
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ # assert all nan or neither nan
+ assert len(set(np.isnan(utterance_action))) == 1
+
+ speak_flag = not all(np.isnan(utterance_action))
+
+ obs, reward, done, info = super().step(p_action)
+
+ if speak_flag:
+ utterance = TalkItOutPoliteGrammar.construct_utterance(utterance_action)
+ if self.hear_yourself:
+ if self.nameless:
+ self.utterance += "{} \n".format(utterance)
+ else:
+ self.utterance += "YOU: {} \n".format(utterance)
+
+ self.conversation += "YOU: {} \n".format(utterance)
+
+ # check if near wizard
+ if self.wizard.is_near_agent():
+ reply = self.wizard.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.wizard.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.wizard.name, reply)
+
+ if self.true_guide.is_near_agent():
+ reply = self.true_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.true_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.true_guide.name, reply)
+
+ if self.false_guide.is_near_agent():
+ reply = self.false_guide.listen(utterance)
+
+ if reply:
+ if self.nameless:
+ self.utterance += "{} \n".format(reply)
+ else:
+ self.utterance += "{}: {} \n".format(self.false_guide.name, reply)
+
+ self.conversation += "{}: {} \n".format(self.false_guide.name, reply)
+
+ if utterance == TalkItOutPoliteGrammar.construct_utterance([1, 0]):
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+
+ for dx, dy in self.door_pos:
+ if (ax == dx and abs(ay - dy) == 1) or (ay == dy and abs(ax - dx) == 1):
+ # agent has chosen some door episode, regardless of if the door is correct the episode is over
+ done = True
+
+ # Don't let the agent open any of the doors
+ if p_action == self.actions.toggle:
+ done = True
+
+ if p_action == self.actions.done:
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ if self.hidden_npc:
+ # all npc are hidden
+ assert np.argwhere(obs['image'][:,:,0] == OBJECT_TO_IDX['npc']).size == 0
+ assert "{}:".format(self.wizard.name) not in self.utterance
+ # assert "{}:".format(self.true_guide.name) not in self.utterance
+ # assert "{}:".format(self.false_guide.name) not in self.utterance
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+
+ self.window.clear_text() # erase previous text
+
+ self.window.set_caption(self.conversation, [
+ "Gandalf:",
+ "Jack:",
+ "John:",
+ "Where is the exit",
+ "Open sesame",
+ ])
+
+ self.window.ax.set_title("correct door: {}".format(self.true_guide.target_color), loc="left", fontsize=10)
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize':15, 'color':color, 'weight':"bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+class TalkItOutPolite8x8Env(TalkItOutPoliteEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=8, max_steps=100, **kwargs)
+
+
+class TalkItOutPolite6x6Env(TalkItOutPoliteEnv):
+ def __init__(self):
+ super().__init__(size=6, max_steps=100)
+
+
+class TalkItOutPoliteNameless8x8Env(TalkItOutPoliteEnv):
+ def __init__(self):
+ super().__init__(size=8, max_steps=100, nameless=True)
+
+register(
+ id='MiniGrid-TalkItOutPolite-5x5-v0',
+ entry_point='gym_minigrid.envs:TalkItOutPoliteEnv'
+)
+
+register(
+ id='MiniGrid-TalkItOutPolite-6x6-v0',
+ entry_point='gym_minigrid.envs:TalkItOutPolite6x6Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutPolite-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutPolite8x8Env'
+)
+
+register(
+ id='MiniGrid-TalkItOutPoliteNameless-8x8-v0',
+ entry_point='gym_minigrid.envs:TalkItOutPoliteNameless8x8Env'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/backup_envs/twodoorsintent.py b/gym-minigrid/gym_minigrid/backup_envs/twodoorsintent.py
new file mode 100644
index 0000000000000000000000000000000000000000..5602c331eb581d646a26f9d8c3b467903e04d80e
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/backup_envs/twodoorsintent.py
@@ -0,0 +1,274 @@
+import numpy as np
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import time
+from collections import deque
+
+
+class Peer(NPC):
+ """
+ A dancing NPC that the agent has to copy
+ """
+
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0
+ self.env = env
+ self.npc_actions = []
+ self.dancing_step_idx = 0
+ self.actions = MiniGridEnv.Actions
+ self.add_npc_direction = True
+ self.available_moves = [self.rotate_left, self.rotate_right, self.go_forward, self.toggle_action]
+
+ selected_door_id = self.env._rand_elem([0, 1])
+ self.selected_door_pos = [self.env.door_pos_top, self.env.door_pos_bottom][selected_door_id]
+ self.selected_door = [self.env.door_top, self.env.door_bottom][selected_door_id]
+
+ def step(self):
+
+ if all(self.front_pos == self.selected_door_pos):
+ # in front of door
+ if self.selected_door.is_open:
+ self.go_forward()
+
+ else:
+ if (self.cur_pos[0] == self.selected_door_pos[0]) or (self.cur_pos[1] == self.selected_door_pos[1]):
+ # is either in the correct row on in the correct column
+ next_wanted_position = self.selected_door_pos
+ else:
+ # choose the midpoint
+ for cand_x, cand_y in [
+ (self.cur_pos[0], self.selected_door_pos[1]),
+ (self.selected_door_pos[0], self.cur_pos[1])
+ ]:
+ print("wX:", self.env.wall_x)
+ print("wY:", self.env.wall_y)
+ if (
+ cand_x > 0 and cand_x < self.env.wall_x
+ ) and (
+ cand_y > 0 and cand_y < self.env.wall_y
+ ):
+ next_wanted_position = (cand_x, cand_y)
+ print("wanted_pos:", next_wanted_position)
+
+ if self.cur_pos[1] == next_wanted_position[1]:
+ # same y
+ if self.cur_pos[0] < next_wanted_position[0]:
+ wanted_dir = 0
+ else:
+ wanted_dir = 2
+ if self.npc_dir == wanted_dir:
+ self.go_forward()
+
+ else:
+ self.rotate_left()
+
+ elif self.cur_pos[0] == next_wanted_position[0]:
+ # same x
+ if self.cur_pos[1] < next_wanted_position[1]:
+ wanted_dir = 1
+ else:
+ wanted_dir = 3
+
+ if self.npc_dir == wanted_dir:
+ self.go_forward()
+
+ else:
+ self.rotate_left()
+ else:
+ raise ValueError("Something is wrong.")
+
+
+class TwoDoorsIntentGrammar(object):
+
+ templates = ["Move your", "Shake your"]
+ things = ["body", "head"]
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+
+class TwoDoorsIntentEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ actions=MiniGridEnv.Actions,
+ action_space=spaces.MultiDiscrete([
+ len(MiniGridEnv.Actions),
+ *TwoDoorsIntentGrammar.grammar_action_space.nvec
+ ]),
+ add_npc_direction=True
+ )
+
+ print({
+ "size": size,
+ "diminished_reward": diminished_reward,
+ "step_penalty": step_penalty,
+ })
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=4)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ self.wall_x = width-1
+ self.wall_y = height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # door top
+ door_color_top = self._rand_elem(COLOR_NAMES)
+ self.door_pos_top = (width-1, 1)
+ self.door_top = Door(door_color_top)
+ self.grid.set(*self.door_pos_top, self.door_top)
+
+ # switch top
+ self.switch_pos_top = (0, 1)
+ self.switch_top = Switch(door_color_top, lockable_object=self.door_top)
+ self.grid.set(*self.switch_pos_top, self.switch_top)
+
+ # door bottom
+ door_color_bottom = self._rand_elem(COLOR_NAMES)
+ self.door_pos_bottom = (width-1, height-2)
+ self.door_bottom = Door(door_color_bottom)
+ self.grid.set(*self.door_pos_bottom, self.door_bottom)
+
+ # switch bottom
+ self.switch_pos_bottom = (0, height-2)
+ self.switch_bottom = Switch(door_color_bottom, lockable_object=self.door_bottom)
+ self.grid.set(*self.switch_pos_bottom, self.switch_bottom)
+
+ # Set a randomly coloured Dancer NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.peer = Peer(color, "Jill", self)
+
+ # Place it on the middle left side of the room
+ peer_pos = np.array((self._rand_int(1, width - 1), self._rand_int(1, height - 1)))
+
+ self.grid.set(*peer_pos, self.peer)
+ self.peer.init_pos = peer_pos
+ self.peer.cur_pos = peer_pos
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Generate the mission string
+ self.mission = 'watch dancer and repeat his moves afterwards'
+
+ # Dummy beginning string
+ self.beginning_string = "This is what you hear. \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.conversation = self.utterance
+
+ def step(self, action):
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ obs, reward, done, info = super().step(p_action)
+ self.peer.step()
+
+ if np.isnan(p_action):
+ pass
+
+ if p_action == self.actions.done:
+ done = True
+
+ elif all(self.agent_pos == self.door_pos_top):
+ done = True
+
+ elif all(self.agent_pos == self.door_pos_bottom):
+ done = True
+
+ elif all([self.switch_top.is_on, self.switch_bottom.is_on]):
+ # if both switches are on no reward is given and episode ends
+ done = True
+
+ elif all(self.peer.cur_pos == self.peer.selected_door_pos):
+ reward = self._reward()
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ print("conversation:\n", self.conversation)
+ print("utterance_history:\n", self.utterance_history)
+ self.window.set_caption(self.conversation, [self.peer.name])
+ return obs
+
+
+class TwoDoorsIntent8x8Env(TwoDoorsIntentEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+
+class TwoDoorsIntent6x6Env(TwoDoorsIntentEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+
+
+register(
+ id='MiniGrid-TwoDoorsIntent-5x5-v0',
+ entry_point='gym_minigrid.envs:TwoDoorsIntentEnv'
+)
+
+register(
+ id='MiniGrid-TwoDoorsIntent-6x6-v0',
+ entry_point='gym_minigrid.envs:TwoDoorsIntent6x6Env'
+)
+
+register(
+ id='MiniGrid-TwoDoorsIntent-8x8-v0',
+ entry_point='gym_minigrid.envs:TwoDoorsIntent8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/curriculums/__init__.py b/gym-minigrid/gym_minigrid/curriculums/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..42a356d2ed38ff3d9f69328ff5ae09b6ccfb4c00
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/curriculums/__init__.py
@@ -0,0 +1 @@
+from gym_minigrid.curriculums.expertcurriculumsocialaiparamenv import *
diff --git a/gym-minigrid/gym_minigrid/curriculums/expertcurriculumsocialaiparamenv.py b/gym-minigrid/gym_minigrid/curriculums/expertcurriculumsocialaiparamenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..1196189dc51d2584097f6e596c5fe1d631ad82d9
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/curriculums/expertcurriculumsocialaiparamenv.py
@@ -0,0 +1,143 @@
+import warnings
+
+import numpy as np
+import random
+
+class ScaffoldingExpertCurriculum:
+
+ def __init__(self, type, minimum_episodes=1000, average_interval=500, phase_thresholds=(0.75, 0.75)):
+ self.phase = 1
+ self.performance_history = []
+ self.phase_two_current_type = None
+ self.minimum_episodes = minimum_episodes
+ self.phase_thresholds = phase_thresholds # how many episodes to wait for before starting to compute the estimate
+ self.average_interval = average_interval # number of episodes to use to estimate current performance (100 ~ 10 updated)
+ self.mean_perf = 0
+ self.max_mean_perf = 0
+ self.type = type
+
+ def get_status_dict(self):
+ return {
+ "curriculum_phase": self.phase,
+ "curriculum_performance_history": self.performance_history,
+ }
+
+ def load_status_dict(self, status):
+ self.phase = status["curriculum_phase"]
+ self.performance_history = status["curriculum_performance_history"]
+
+ @staticmethod
+ def select(children, label):
+ ch = list(filter(lambda c: c.label == label, children))
+
+ if len(ch) == 0:
+ raise ValueError(f"Label {label} not found in children {children}.")
+ elif len(ch) > 1:
+ raise ValueError(f"Multiple labels {label} found in children {children}.")
+
+ selected = ch[0]
+ assert selected is not None
+ return selected
+
+ def choose(self, node, chosen_parameters):
+ """
+ Choose a child of the parameter node.
+ All the parameters used here should be updated by set_curriculum_parameters.
+ """
+ assert node.type == 'param'
+
+ # E + scaf
+ # E + full
+ # AE + full
+
+ # N cs -> N full -> A/E/N/AE full -> AE full
+
+ # A/E/N/AE scaf/full -> AE full
+ if len(self.phase_thresholds) < 2:
+ warnings.WarningMessage(f"Num of thresholds ({len(self.phase_thresholds)}) is less than the num of phases.")
+
+ if node.label == "Scaffolding":
+
+ if self.type == "intro_seq":
+ return ScaffoldingExpertCurriculum.select(node.children, "N")
+
+ elif self.type == "intro_seq_scaf":
+ if self.phase in [1]:
+ return random.choice(node.children)
+
+ elif self.phase in [2]:
+ return ScaffoldingExpertCurriculum.select(node.children, "N")
+
+ else:
+ raise ValueError(f"Undefined phase {self.phase}.")
+
+ else:
+ raise ValueError(f"Curriculum type {self.type} unknown.")
+
+ elif node.label == "Pragmatic_frame_complexity":
+
+ if self.type not in ["intro_seq", "intro_seq_scaf"]:
+ raise ValueError(f"Undefined type {self.type}.")
+
+ if self.phase in [1]:
+ # return random.choice(node.children)
+ return random.choice([
+ ScaffoldingExpertCurriculum.select(node.children, "No"),
+ ScaffoldingExpertCurriculum.select(node.children, "Ask"),
+ ScaffoldingExpertCurriculum.select(node.children, "Eye_contact"),
+ ScaffoldingExpertCurriculum.select(node.children, "Ask_Eye_contact"),
+ ])
+
+ elif self.phase in [2]:
+ return ScaffoldingExpertCurriculum.select(node.children, "Ask_Eye_contact")
+
+ else:
+ raise ValueError(f"Undefined phase {self.phase}")
+
+ else:
+ return random.choice(node.children)
+
+ def set_parameters(self, params):
+ """
+ Set ALL the parameters used in choose.
+ This is important for parallel environments. This function is called by broadcast_curriculum_parameters()
+ """
+ self.phase = params["phase"]
+ self.mean_perf = params["mean_perf"]
+ self.max_mean_perf = params["max_mean_perf"]
+
+ def get_parameters(self):
+ """
+ Get ALL the parameters used in choose. Used when restoring the curriculum.
+ """
+ return {
+ "phase": self.phase,
+ "mean_perf": self.mean_perf,
+ "max_mean_perf": self.max_mean_perf,
+ }
+
+ def update_parameters(self, data):
+ """
+ Updates the parameters of the ACL used in choose().
+ If using parallel processes these parameters should be broadcasted with broadcast_curriculum_parameters()
+ """
+ for obs, reward, done, info in zip(data["obs"], data["reward"], data["done"], data["info"]):
+ if not done:
+ continue
+
+ self.performance_history.append(info["success"])
+ self.mean_perf = np.mean(self.performance_history[-self.average_interval:])
+ self.max_mean_perf = max(self.mean_perf, self.max_mean_perf)
+
+ if self.phase in [1]:
+ if len(self.performance_history) > self.minimum_episodes and self.mean_perf >= self.phase_thresholds[self.phase-1]:
+ # next phase
+ self.phase = self.phase + 1
+ self.performance_history = []
+ self.max_mean_perf = 0
+
+ return self.get_parameters()
+
+ def get_info(self):
+ return {"param": self.phase, "mean_perf": self.mean_perf, "max_mean_perf": self.max_mean_perf}
+
diff --git a/gym-minigrid/gym_minigrid/envs/__init__.py b/gym-minigrid/gym_minigrid/envs/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..56075e09633a6fdf23ab95e5625e045e30220c69
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/__init__.py
@@ -0,0 +1,22 @@
+from gym_minigrid.envs.empty import *
+from gym_minigrid.envs.doorkey import *
+from gym_minigrid.envs.multiroom import *
+from gym_minigrid.envs.multiroom_noisytv import *
+from gym_minigrid.envs.fetch import *
+from gym_minigrid.envs.gotoobject import *
+from gym_minigrid.envs.gotodoor import *
+from gym_minigrid.envs.putnear import *
+from gym_minigrid.envs.lockedroom import *
+from gym_minigrid.envs.keycorridor import *
+from gym_minigrid.envs.unlock import *
+from gym_minigrid.envs.unlockpickup import *
+from gym_minigrid.envs.blockedunlockpickup import *
+from gym_minigrid.envs.playground_v0 import *
+from gym_minigrid.envs.redbluedoors import *
+from gym_minigrid.envs.obstructedmaze import *
+from gym_minigrid.envs.memory import *
+from gym_minigrid.envs.fourrooms import *
+from gym_minigrid.envs.crossing import *
+from gym_minigrid.envs.lavagap import *
+from gym_minigrid.envs.dynamicobstacles import *
+from gym_minigrid.envs.distshift import *
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/envs/blockedunlockpickup.py b/gym-minigrid/gym_minigrid/envs/blockedunlockpickup.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ff8d53faeabed0dd3a00f560f70e2032c3cf440
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/blockedunlockpickup.py
@@ -0,0 +1,52 @@
+from gym_minigrid.minigrid import Ball
+from gym_minigrid.roomgrid import RoomGrid
+from gym_minigrid.register import register
+
+class BlockedUnlockPickup(RoomGrid):
+ """
+ Unlock a door blocked by a ball, then pick up a box
+ in another room
+ """
+
+ def __init__(self, seed=None):
+ room_size = 6
+ super().__init__(
+ num_rows=1,
+ num_cols=2,
+ room_size=room_size,
+ max_steps=16*room_size**2,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ # Add a box to the room on the right
+ obj, _ = self.add_object(1, 0, kind="box")
+ # Make sure the two rooms are directly connected by a locked door
+ door, pos = self.add_door(0, 0, 0, locked=True)
+ # Block the door with a ball
+ color = self._rand_color()
+ self.grid.set(pos[0]-1, pos[1], Ball(color))
+ # Add a key to unlock the door
+ self.add_object(0, 0, 'key', door.color)
+
+ self.place_agent(0, 0)
+
+ self.obj = obj
+ self.mission = "pick up the %s %s" % (obj.color, obj.type)
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ if action == self.actions.pickup:
+ if self.carrying and self.carrying == self.obj:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-BlockedUnlockPickup-v0',
+ entry_point='gym_minigrid.envs:BlockedUnlockPickup'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/crossing.py b/gym-minigrid/gym_minigrid/envs/crossing.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc499bd1d5b6db20e956c9bcf0e0e0046b98b880
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/crossing.py
@@ -0,0 +1,155 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+import itertools as itt
+
+
+class CrossingEnv(MiniGridEnv):
+ """
+ Environment with wall or lava obstacles, sparse reward.
+ """
+
+ def __init__(self, size=9, num_crossings=1, obstacle_type=Lava, seed=None):
+ self.num_crossings = num_crossings
+ self.obstacle_type = obstacle_type
+ super().__init__(
+ grid_size=size,
+ max_steps=4*size*size,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ seed=None
+ )
+
+ def _gen_grid(self, width, height):
+ assert width % 2 == 1 and height % 2 == 1 # odd size
+
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place the agent in the top-left corner
+ self.agent_pos = (1, 1)
+ self.agent_dir = 0
+
+ # Place a goal square in the bottom-right corner
+ self.put_obj(Goal(), width - 2, height - 2)
+
+ # Place obstacles (lava or walls)
+ v, h = object(), object() # singleton `vertical` and `horizontal` objects
+
+ # Lava rivers or walls specified by direction and position in grid
+ rivers = [(v, i) for i in range(2, height - 2, 2)]
+ rivers += [(h, j) for j in range(2, width - 2, 2)]
+ self.np_random.shuffle(rivers)
+ rivers = rivers[:self.num_crossings] # sample random rivers
+ rivers_v = sorted([pos for direction, pos in rivers if direction is v])
+ rivers_h = sorted([pos for direction, pos in rivers if direction is h])
+ obstacle_pos = itt.chain(
+ itt.product(range(1, width - 1), rivers_h),
+ itt.product(rivers_v, range(1, height - 1)),
+ )
+ for i, j in obstacle_pos:
+ self.put_obj(self.obstacle_type(), i, j)
+
+ # Sample path to goal
+ path = [h] * len(rivers_v) + [v] * len(rivers_h)
+ self.np_random.shuffle(path)
+
+ # Create openings
+ limits_v = [0] + rivers_v + [height - 1]
+ limits_h = [0] + rivers_h + [width - 1]
+ room_i, room_j = 0, 0
+ for direction in path:
+ if direction is h:
+ i = limits_v[room_i + 1]
+ j = self.np_random.choice(
+ range(limits_h[room_j] + 1, limits_h[room_j + 1]))
+ room_i += 1
+ elif direction is v:
+ i = self.np_random.choice(
+ range(limits_v[room_i] + 1, limits_v[room_i + 1]))
+ j = limits_h[room_j + 1]
+ room_j += 1
+ else:
+ assert False
+ self.grid.set(i, j, None)
+
+ self.mission = (
+ "avoid the lava and get to the green goal square"
+ if self.obstacle_type == Lava
+ else "find the opening and get to the green goal square"
+ )
+
+class LavaCrossingEnv(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=1)
+
+class LavaCrossingS9N2Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=2)
+
+class LavaCrossingS9N3Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=3)
+
+class LavaCrossingS11N5Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=11, num_crossings=5)
+
+register(
+ id='MiniGrid-LavaCrossingS9N1-v0',
+ entry_point='gym_minigrid.envs:LavaCrossingEnv'
+)
+
+register(
+ id='MiniGrid-LavaCrossingS9N2-v0',
+ entry_point='gym_minigrid.envs:LavaCrossingS9N2Env'
+)
+
+register(
+ id='MiniGrid-LavaCrossingS9N3-v0',
+ entry_point='gym_minigrid.envs:LavaCrossingS9N3Env'
+)
+
+register(
+ id='MiniGrid-LavaCrossingS11N5-v0',
+ entry_point='gym_minigrid.envs:LavaCrossingS11N5Env'
+)
+
+class SimpleCrossingEnv(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=1, obstacle_type=Wall)
+
+class SimpleCrossingS9N2Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=2, obstacle_type=Wall)
+
+class SimpleCrossingS9N3Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=9, num_crossings=3, obstacle_type=Wall)
+
+class SimpleCrossingS11N5Env(CrossingEnv):
+ def __init__(self):
+ super().__init__(size=11, num_crossings=5, obstacle_type=Wall)
+
+register(
+ id='MiniGrid-SimpleCrossingS9N1-v0',
+ entry_point='gym_minigrid.envs:SimpleCrossingEnv'
+)
+
+register(
+ id='MiniGrid-SimpleCrossingS9N2-v0',
+ entry_point='gym_minigrid.envs:SimpleCrossingS9N2Env'
+)
+
+register(
+ id='MiniGrid-SimpleCrossingS9N3-v0',
+ entry_point='gym_minigrid.envs:SimpleCrossingS9N3Env'
+)
+
+register(
+ id='MiniGrid-SimpleCrossingS11N5-v0',
+ entry_point='gym_minigrid.envs:SimpleCrossingS11N5Env'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/distshift.py b/gym-minigrid/gym_minigrid/envs/distshift.py
new file mode 100644
index 0000000000000000000000000000000000000000..437a6180846a8c1c278ddba932c4dc1ab185fe67
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/distshift.py
@@ -0,0 +1,70 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class DistShiftEnv(MiniGridEnv):
+ """
+ Distributional shift environment.
+ """
+
+ def __init__(
+ self,
+ width=9,
+ height=7,
+ agent_start_pos=(1,1),
+ agent_start_dir=0,
+ strip2_row=2
+ ):
+ self.agent_start_pos = agent_start_pos
+ self.agent_start_dir = agent_start_dir
+ self.goal_pos = (width-2, 1)
+ self.strip2_row = strip2_row
+
+ super().__init__(
+ width=width,
+ height=height,
+ max_steps=4*width*height,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place a goal square in the bottom-right corner
+ self.put_obj(Goal(), *self.goal_pos)
+
+ # Place the lava rows
+ for i in range(self.width - 6):
+ self.grid.set(3+i, 1, Lava())
+ self.grid.set(3+i, self.strip2_row, Lava())
+
+ # Place the agent
+ if self.agent_start_pos is not None:
+ self.agent_pos = self.agent_start_pos
+ self.agent_dir = self.agent_start_dir
+ else:
+ self.place_agent()
+
+ self.mission = "get to the green goal square"
+
+class DistShift1(DistShiftEnv):
+ def __init__(self):
+ super().__init__(strip2_row=2)
+
+class DistShift2(DistShiftEnv):
+ def __init__(self):
+ super().__init__(strip2_row=5)
+
+register(
+ id='MiniGrid-DistShift1-v0',
+ entry_point='gym_minigrid.envs:DistShift1'
+)
+
+register(
+ id='MiniGrid-DistShift2-v0',
+ entry_point='gym_minigrid.envs:DistShift2'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/doorkey.py b/gym-minigrid/gym_minigrid/envs/doorkey.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bcc74128ba7b1dae5bc3538f5f029dca0efd819
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/doorkey.py
@@ -0,0 +1,76 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class DoorKeyEnv(MiniGridEnv):
+ """
+ Environment with a door and key, sparse reward
+ """
+
+ def __init__(self, size=8):
+ super().__init__(
+ grid_size=size,
+ max_steps=10*size*size
+ )
+
+ def _gen_grid(self, width, height):
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place a goal in the bottom-right corner
+ self.put_obj(Goal(), width - 2, height - 2)
+
+ # Create a vertical splitting wall
+ splitIdx = self._rand_int(2, width-2)
+ self.grid.vert_wall(splitIdx, 0)
+
+ # Place the agent at a random position and orientation
+ # on the left side of the splitting wall
+ self.place_agent(size=(splitIdx, height))
+
+ # Place a door in the wall
+ doorIdx = self._rand_int(1, width-2)
+ self.put_obj(Door('yellow', is_locked=True), splitIdx, doorIdx)
+
+ # Place a yellow key on the left side
+ self.place_obj(
+ obj=Key('yellow'),
+ top=(0, 0),
+ size=(splitIdx, height)
+ )
+
+ self.mission = "use the key to open the door and then get to the goal"
+
+class DoorKeyEnv5x5(DoorKeyEnv):
+ def __init__(self):
+ super().__init__(size=5)
+
+class DoorKeyEnv6x6(DoorKeyEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+class DoorKeyEnv16x16(DoorKeyEnv):
+ def __init__(self):
+ super().__init__(size=16)
+
+register(
+ id='MiniGrid-DoorKey-5x5-v0',
+ entry_point='gym_minigrid.envs:DoorKeyEnv5x5'
+)
+
+register(
+ id='MiniGrid-DoorKey-6x6-v0',
+ entry_point='gym_minigrid.envs:DoorKeyEnv6x6'
+)
+
+register(
+ id='MiniGrid-DoorKey-8x8-v0',
+ entry_point='gym_minigrid.envs:DoorKeyEnv'
+)
+
+register(
+ id='MiniGrid-DoorKey-16x16-v0',
+ entry_point='gym_minigrid.envs:DoorKeyEnv16x16'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/dynamicobstacles.py b/gym-minigrid/gym_minigrid/envs/dynamicobstacles.py
new file mode 100644
index 0000000000000000000000000000000000000000..48dd0fbe168c726c5c2393acda96ec64534f103d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/dynamicobstacles.py
@@ -0,0 +1,139 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+from operator import add
+
+class DynamicObstaclesEnv(MiniGridEnv):
+ """
+ Single-room square grid environment with moving obstacles
+ """
+
+ def __init__(
+ self,
+ size=8,
+ agent_start_pos=(1, 1),
+ agent_start_dir=0,
+ n_obstacles=4
+ ):
+ self.agent_start_pos = agent_start_pos
+ self.agent_start_dir = agent_start_dir
+
+ # Reduce obstacles if there are too many
+ if n_obstacles <= size/2 + 1:
+ self.n_obstacles = int(n_obstacles)
+ else:
+ self.n_obstacles = int(size/2)
+ super().__init__(
+ grid_size=size,
+ max_steps=4 * size * size,
+ # Set this to True for maximum speed
+ see_through_walls=True,
+ )
+ # Allow only 3 actions permitted: left, right, forward
+ self.action_space = spaces.Discrete(self.actions.forward + 1)
+ self.reward_range = (-1, 1)
+
+ def _gen_grid(self, width, height):
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place a goal square in the bottom-right corner
+ self.grid.set(width - 2, height - 2, Goal())
+
+ # Place the agent
+ if self.agent_start_pos is not None:
+ self.agent_pos = self.agent_start_pos
+ self.agent_dir = self.agent_start_dir
+ else:
+ self.place_agent()
+
+ # Place obstacles
+ self.obstacles = []
+ for i_obst in range(self.n_obstacles):
+ self.obstacles.append(Ball())
+ self.place_obj(self.obstacles[i_obst], max_tries=100)
+
+ self.mission = "get to the green goal square"
+
+ def step(self, action):
+ # Invalid action
+ if action >= self.action_space.n:
+ action = 0
+
+ # Check if there is an obstacle in front of the agent
+ front_cell = self.grid.get(*self.front_pos)
+ not_clear = front_cell and front_cell.type != 'goal'
+
+ # Update obstacle positions
+ for i_obst in range(len(self.obstacles)):
+ old_pos = self.obstacles[i_obst].cur_pos
+ top = tuple(map(add, old_pos, (-1, -1)))
+
+ try:
+ self.place_obj(self.obstacles[i_obst], top=top, size=(3,3), max_tries=100)
+ self.grid.set(*old_pos, None)
+ except:
+ pass
+
+ # Update the agent's position/direction
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+
+ # If the agent tried to walk over an obstacle or wall
+ if action == self.actions.forward and not_clear:
+ reward = -1
+ done = True
+ return obs, reward, done, info
+
+ return obs, reward, done, info
+
+class DynamicObstaclesEnv5x5(DynamicObstaclesEnv):
+ def __init__(self):
+ super().__init__(size=5, n_obstacles=2)
+
+class DynamicObstaclesRandomEnv5x5(DynamicObstaclesEnv):
+ def __init__(self):
+ super().__init__(size=5, agent_start_pos=None, n_obstacles=2)
+
+class DynamicObstaclesEnv6x6(DynamicObstaclesEnv):
+ def __init__(self):
+ super().__init__(size=6, n_obstacles=3)
+
+class DynamicObstaclesRandomEnv6x6(DynamicObstaclesEnv):
+ def __init__(self):
+ super().__init__(size=6, agent_start_pos=None, n_obstacles=3)
+
+class DynamicObstaclesEnv16x16(DynamicObstaclesEnv):
+ def __init__(self):
+ super().__init__(size=16, n_obstacles=8)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-5x5-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesEnv5x5'
+)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-Random-5x5-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesRandomEnv5x5'
+)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-6x6-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesEnv6x6'
+)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-Random-6x6-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesRandomEnv6x6'
+)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-8x8-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesEnv'
+)
+
+register(
+ id='MiniGrid-Dynamic-Obstacles-16x16-v0',
+ entry_point='gym_minigrid.envs:DynamicObstaclesEnv16x16'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/empty.py b/gym-minigrid/gym_minigrid/envs/empty.py
new file mode 100644
index 0000000000000000000000000000000000000000..53357a1ca411a9c92d57b7fcfbcc431b5b1ab14c
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/empty.py
@@ -0,0 +1,92 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class EmptyEnv(MiniGridEnv):
+ """
+ Empty grid environment, no obstacles, sparse reward
+ """
+
+ def __init__(
+ self,
+ size=8,
+ agent_start_pos=(1,1),
+ agent_start_dir=0,
+ ):
+ self.agent_start_pos = agent_start_pos
+ self.agent_start_dir = agent_start_dir
+
+ super().__init__(
+ grid_size=size,
+ max_steps=4*size*size,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place a goal square in the bottom-right corner
+ self.put_obj(Goal(), width - 2, height - 2)
+
+ # Place the agent
+ if self.agent_start_pos is not None:
+ self.agent_pos = self.agent_start_pos
+ self.agent_dir = self.agent_start_dir
+ else:
+ self.place_agent()
+
+ self.mission = "get to the green goal square"
+
+class EmptyEnv5x5(EmptyEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=5, **kwargs)
+
+class EmptyRandomEnv5x5(EmptyEnv):
+ def __init__(self):
+ super().__init__(size=5, agent_start_pos=None)
+
+class EmptyEnv6x6(EmptyEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=6, **kwargs)
+
+class EmptyRandomEnv6x6(EmptyEnv):
+ def __init__(self):
+ super().__init__(size=6, agent_start_pos=None)
+
+class EmptyEnv16x16(EmptyEnv):
+ def __init__(self, **kwargs):
+ super().__init__(size=16, **kwargs)
+
+register(
+ id='MiniGrid-Empty-5x5-v0',
+ entry_point='gym_minigrid.envs:EmptyEnv5x5'
+)
+
+register(
+ id='MiniGrid-Empty-Random-5x5-v0',
+ entry_point='gym_minigrid.envs:EmptyRandomEnv5x5'
+)
+
+register(
+ id='MiniGrid-Empty-6x6-v0',
+ entry_point='gym_minigrid.envs:EmptyEnv6x6'
+)
+
+register(
+ id='MiniGrid-Empty-Random-6x6-v0',
+ entry_point='gym_minigrid.envs:EmptyRandomEnv6x6'
+)
+
+register(
+ id='MiniGrid-Empty-8x8-v0',
+ entry_point='gym_minigrid.envs:EmptyEnv'
+)
+
+register(
+ id='MiniGrid-Empty-16x16-v0',
+ entry_point='gym_minigrid.envs:EmptyEnv16x16'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/fetch.py b/gym-minigrid/gym_minigrid/envs/fetch.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e5846848b1c550fef180b4e05491adc279f51dd
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/fetch.py
@@ -0,0 +1,109 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class FetchEnv(MiniGridEnv):
+ """
+ Environment in which the agent has to fetch a random object
+ named using English text strings
+ """
+
+ def __init__(
+ self,
+ size=8,
+ numObjs=3
+ ):
+ self.numObjs = numObjs
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.horz_wall(0, 0)
+ self.grid.horz_wall(0, height-1)
+ self.grid.vert_wall(0, 0)
+ self.grid.vert_wall(width-1, 0)
+
+ types = ['key', 'ball']
+
+ objs = []
+
+ # For each object to be generated
+ while len(objs) < self.numObjs:
+ objType = self._rand_elem(types)
+ objColor = self._rand_elem(COLOR_NAMES)
+
+ if objType == 'key':
+ obj = Key(objColor)
+ elif objType == 'ball':
+ obj = Ball(objColor)
+
+ self.place_obj(obj)
+ objs.append(obj)
+
+ # Randomize the player start position and orientation
+ self.place_agent()
+
+ # Choose a random object to be picked up
+ target = objs[self._rand_int(0, len(objs))]
+ self.targetType = target.type
+ self.targetColor = target.color
+
+ descStr = '%s %s' % (self.targetColor, self.targetType)
+
+ # Generate the mission string
+ idx = self._rand_int(0, 5)
+ if idx == 0:
+ self.mission = 'get a %s' % descStr
+ elif idx == 1:
+ self.mission = 'go get a %s' % descStr
+ elif idx == 2:
+ self.mission = 'fetch a %s' % descStr
+ elif idx == 3:
+ self.mission = 'go fetch a %s' % descStr
+ elif idx == 4:
+ self.mission = 'you must fetch a %s' % descStr
+ assert hasattr(self, 'mission')
+
+ def step(self, action):
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+
+ if self.carrying:
+ if self.carrying.color == self.targetColor and \
+ self.carrying.type == self.targetType:
+ reward = self._reward()
+ done = True
+ else:
+ reward = 0
+ done = True
+
+ return obs, reward, done, info
+
+class FetchEnv5x5N2(FetchEnv):
+ def __init__(self):
+ super().__init__(size=5, numObjs=2)
+
+class FetchEnv6x6N2(FetchEnv):
+ def __init__(self):
+ super().__init__(size=6, numObjs=2)
+
+register(
+ id='MiniGrid-Fetch-5x5-N2-v0',
+ entry_point='gym_minigrid.envs:FetchEnv5x5N2'
+)
+
+register(
+ id='MiniGrid-Fetch-6x6-N2-v0',
+ entry_point='gym_minigrid.envs:FetchEnv6x6N2'
+)
+
+register(
+ id='MiniGrid-Fetch-8x8-N3-v0',
+ entry_point='gym_minigrid.envs:FetchEnv'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/fourrooms.py b/gym-minigrid/gym_minigrid/envs/fourrooms.py
new file mode 100644
index 0000000000000000000000000000000000000000..b02fbbdd5709659efd647d0bb7e6d4e6b7f6be4a
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/fourrooms.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+
+class FourRoomsEnv(MiniGridEnv):
+ """
+ Classic 4 rooms gridworld environment.
+ Can specify agent and goal position, if not it set at random.
+ """
+
+ def __init__(self, agent_pos=None, goal_pos=None):
+ self._agent_default_pos = agent_pos
+ self._goal_default_pos = goal_pos
+ super().__init__(grid_size=19, max_steps=100)
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.horz_wall(0, 0)
+ self.grid.horz_wall(0, height - 1)
+ self.grid.vert_wall(0, 0)
+ self.grid.vert_wall(width - 1, 0)
+
+ room_w = width // 2
+ room_h = height // 2
+
+ # For each row of rooms
+ for j in range(0, 2):
+
+ # For each column
+ for i in range(0, 2):
+ xL = i * room_w
+ yT = j * room_h
+ xR = xL + room_w
+ yB = yT + room_h
+
+ # Bottom wall and door
+ if i + 1 < 2:
+ self.grid.vert_wall(xR, yT, room_h)
+ pos = (xR, self._rand_int(yT + 1, yB))
+ self.grid.set(*pos, None)
+
+ # Bottom wall and door
+ if j + 1 < 2:
+ self.grid.horz_wall(xL, yB, room_w)
+ pos = (self._rand_int(xL + 1, xR), yB)
+ self.grid.set(*pos, None)
+
+ # Randomize the player start position and orientation
+ if self._agent_default_pos is not None:
+ self.agent_pos = self._agent_default_pos
+ self.grid.set(*self._agent_default_pos, None)
+ self.agent_dir = self._rand_int(0, 4) # assuming random start direction
+ else:
+ self.place_agent()
+
+ if self._goal_default_pos is not None:
+ goal = Goal()
+ self.put_obj(goal, *self._goal_default_pos)
+ goal.init_pos, goal.cur_pos = self._goal_default_pos
+ else:
+ self.place_obj(Goal())
+
+ self.mission = 'Reach the goal'
+
+ def step(self, action):
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-FourRooms-v0',
+ entry_point='gym_minigrid.envs:FourRoomsEnv'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/gotodoor.py b/gym-minigrid/gym_minigrid/envs/gotodoor.py
new file mode 100644
index 0000000000000000000000000000000000000000..2817c2cc0c5e710f35f7f689eb9151afcc48e1f6
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/gotodoor.py
@@ -0,0 +1,104 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class GoToDoorEnv(MiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=5
+ ):
+ assert size >= 5
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Randomly vary the room width and height
+ width = self._rand_int(5, width+1)
+ height = self._rand_int(5, height+1)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Generate the 4 doors at random positions
+ doorPos = []
+ doorPos.append((self._rand_int(2, width-2), 0))
+ doorPos.append((self._rand_int(2, width-2), height-1))
+ doorPos.append((0, self._rand_int(2, height-2)))
+ doorPos.append((width-1, self._rand_int(2, height-2)))
+
+ # Generate the door colors
+ doorColors = []
+ while len(doorColors) < len(doorPos):
+ color = self._rand_elem(COLOR_NAMES)
+ if color in doorColors:
+ continue
+ doorColors.append(color)
+
+ # Place the doors in the grid
+ for idx, pos in enumerate(doorPos):
+ color = doorColors[idx]
+ self.grid.set(*pos, Door(color))
+
+ # Randomize the agent start position and orientation
+ self.place_agent(size=(width, height))
+
+ # Select a random target door
+ doorIdx = self._rand_int(0, len(doorPos))
+ self.target_pos = doorPos[doorIdx]
+ self.target_color = doorColors[doorIdx]
+
+ # Generate the mission string
+ self.mission = 'go to the %s door' % self.target_color
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ # Don't let the agent open any of the doors
+ if action == self.actions.toggle:
+ done = True
+
+ # Reward performing done action in front of the target door
+ if action == self.actions.done:
+ if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+class GoToDoor8x8Env(GoToDoorEnv):
+ def __init__(self):
+ super().__init__(size=8)
+
+class GoToDoor6x6Env(GoToDoorEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+register(
+ id='MiniGrid-GoToDoor-5x5-v0',
+ entry_point='gym_minigrid.envs:GoToDoorEnv'
+)
+
+register(
+ id='MiniGrid-GoToDoor-6x6-v0',
+ entry_point='gym_minigrid.envs:GoToDoor6x6Env'
+)
+
+register(
+ id='MiniGrid-GoToDoor-8x8-v0',
+ entry_point='gym_minigrid.envs:GoToDoor8x8Env'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/gotoobject.py b/gym-minigrid/gym_minigrid/envs/gotoobject.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffab837745109459a64cbc1abca21f3de4968138
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/gotoobject.py
@@ -0,0 +1,98 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class GoToObjectEnv(MiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=6,
+ numObjs=2
+ ):
+ self.numObjs = numObjs
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Types and colors of objects we can generate
+ types = ['key', 'ball', 'box']
+
+ objs = []
+ objPos = []
+
+ # Until we have generated all the objects
+ while len(objs) < self.numObjs:
+ objType = self._rand_elem(types)
+ objColor = self._rand_elem(COLOR_NAMES)
+
+ # If this object already exists, try again
+ if (objType, objColor) in objs:
+ continue
+
+ if objType == 'key':
+ obj = Key(objColor)
+ elif objType == 'ball':
+ obj = Ball(objColor)
+ elif objType == 'box':
+ obj = Box(objColor)
+
+ pos = self.place_obj(obj)
+ objs.append((objType, objColor))
+ objPos.append(pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent()
+
+ # Choose a random object to be picked up
+ objIdx = self._rand_int(0, len(objs))
+ self.targetType, self.target_color = objs[objIdx]
+ self.target_pos = objPos[objIdx]
+
+ descStr = '%s %s' % (self.target_color, self.targetType)
+ self.mission = 'go to the %s' % descStr
+ #print(self.mission)
+
+ def step(self, action):
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+
+ ax, ay = self.agent_pos
+ tx, ty = self.target_pos
+
+ # Toggle/pickup action terminates the episode
+ if action == self.actions.toggle:
+ done = True
+
+ # Reward performing the done action next to the target object
+ if action == self.actions.done:
+ if abs(ax - tx) <= 1 and abs(ay - ty) <= 1:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+class GotoEnv8x8N2(GoToObjectEnv):
+ def __init__(self):
+ super().__init__(size=8, numObjs=2)
+
+register(
+ id='MiniGrid-GoToObject-6x6-N2-v0',
+ entry_point='gym_minigrid.envs:GoToObjectEnv'
+)
+
+register(
+ id='MiniGrid-GoToObject-8x8-N2-v0',
+ entry_point='gym_minigrid.envs:GotoEnv8x8N2'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/keycorridor.py b/gym-minigrid/gym_minigrid/envs/keycorridor.py
new file mode 100644
index 0000000000000000000000000000000000000000..f51dc8c6a0403667c6bb029f5df411672fd40e1d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/keycorridor.py
@@ -0,0 +1,137 @@
+from gym_minigrid.roomgrid import RoomGrid
+from gym_minigrid.register import register
+
+class KeyCorridor(RoomGrid):
+ """
+ A ball is behind a locked door, the key is placed in a
+ random room.
+ """
+
+ def __init__(
+ self,
+ num_rows=3,
+ obj_type="ball",
+ room_size=6,
+ seed=None
+ ):
+ self.obj_type = obj_type
+
+ super().__init__(
+ room_size=room_size,
+ num_rows=num_rows,
+ max_steps=30*room_size**2,
+ seed=seed,
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ # Connect the middle column rooms into a hallway
+ for j in range(1, self.num_rows):
+ self.remove_wall(1, j, 3)
+
+ # Add a locked door on the bottom right
+ # Add an object behind the locked door
+ room_idx = self._rand_int(0, self.num_rows)
+ door, _ = self.add_door(2, room_idx, 2, locked=True)
+ obj, _ = self.add_object(2, room_idx, kind=self.obj_type)
+
+ # Add a key in a random room on the left side
+ self.add_object(0, self._rand_int(0, self.num_rows), 'key', door.color)
+
+ # Place the agent in the middle
+ self.place_agent(1, self.num_rows // 2)
+
+ # Make sure all rooms are accessible
+ self.connect_all()
+
+ self.obj = obj
+ self.mission = "pick up the %s %s" % (obj.color, obj.type)
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ if action == self.actions.pickup:
+ if self.carrying and self.carrying == self.obj:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+class KeyCorridorS3R1(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=3,
+ num_rows=1,
+ seed=seed
+ )
+
+class KeyCorridorS3R2(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=3,
+ num_rows=2,
+ seed=seed
+ )
+
+class KeyCorridorS3R3(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=3,
+ num_rows=3,
+ seed=seed
+ )
+
+class KeyCorridorS4R3(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=4,
+ num_rows=3,
+ seed=seed
+ )
+
+class KeyCorridorS5R3(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=5,
+ num_rows=3,
+ seed=seed
+ )
+
+class KeyCorridorS6R3(KeyCorridor):
+ def __init__(self, seed=None):
+ super().__init__(
+ room_size=6,
+ num_rows=3,
+ seed=seed
+ )
+
+register(
+ id='MiniGrid-KeyCorridorS3R1-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS3R1'
+)
+
+register(
+ id='MiniGrid-KeyCorridorS3R2-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS3R2'
+)
+
+register(
+ id='MiniGrid-KeyCorridorS3R3-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS3R3'
+)
+
+register(
+ id='MiniGrid-KeyCorridorS4R3-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS4R3'
+)
+
+register(
+ id='MiniGrid-KeyCorridorS5R3-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS5R3'
+)
+
+register(
+ id='MiniGrid-KeyCorridorS6R3-v0',
+ entry_point='gym_minigrid.envs:KeyCorridorS6R3'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/lavagap.py b/gym-minigrid/gym_minigrid/envs/lavagap.py
new file mode 100644
index 0000000000000000000000000000000000000000..04368a1446365270dc70677b26a287c029b75848
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/lavagap.py
@@ -0,0 +1,80 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class LavaGapEnv(MiniGridEnv):
+ """
+ Environment with one wall of lava with a small gap to cross through
+ This environment is similar to LavaCrossing but simpler in structure.
+ """
+
+ def __init__(self, size, obstacle_type=Lava, seed=None):
+ self.obstacle_type = obstacle_type
+ super().__init__(
+ grid_size=size,
+ max_steps=4*size*size,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ seed=None
+ )
+
+ def _gen_grid(self, width, height):
+ assert width >= 5 and height >= 5
+
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, width, height)
+
+ # Place the agent in the top-left corner
+ self.agent_pos = (1, 1)
+ self.agent_dir = 0
+
+ # Place a goal square in the bottom-right corner
+ self.goal_pos = np.array((width - 2, height - 2))
+ self.put_obj(Goal(), *self.goal_pos)
+
+ # Generate and store random gap position
+ self.gap_pos = np.array((
+ self._rand_int(2, width - 2),
+ self._rand_int(1, height - 1),
+ ))
+
+ # Place the obstacle wall
+ self.grid.vert_wall(self.gap_pos[0], 1, height - 2, self.obstacle_type)
+
+ # Put a hole in the wall
+ self.grid.set(*self.gap_pos, None)
+
+ self.mission = (
+ "avoid the lava and get to the green goal square"
+ if self.obstacle_type == Lava
+ else "find the opening and get to the green goal square"
+ )
+
+class LavaGapS5Env(LavaGapEnv):
+ def __init__(self):
+ super().__init__(size=5)
+
+class LavaGapS6Env(LavaGapEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+class LavaGapS7Env(LavaGapEnv):
+ def __init__(self):
+ super().__init__(size=7)
+
+register(
+ id='MiniGrid-LavaGapS5-v0',
+ entry_point='gym_minigrid.envs:LavaGapS5Env'
+)
+
+register(
+ id='MiniGrid-LavaGapS6-v0',
+ entry_point='gym_minigrid.envs:LavaGapS6Env'
+)
+
+register(
+ id='MiniGrid-LavaGapS7-v0',
+ entry_point='gym_minigrid.envs:LavaGapS7Env'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/lockedroom.py b/gym-minigrid/gym_minigrid/envs/lockedroom.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0ad0fa639dfc5a6eb23d55946fae119383ce982
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/lockedroom.py
@@ -0,0 +1,124 @@
+from gym import spaces
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class Room:
+ def __init__(self,
+ top,
+ size,
+ doorPos
+ ):
+ self.top = top
+ self.size = size
+ self.doorPos = doorPos
+ self.color = None
+ self.locked = False
+
+ def rand_pos(self, env):
+ topX, topY = self.top
+ sizeX, sizeY = self.size
+ return env._rand_pos(
+ topX + 1, topX + sizeX - 1,
+ topY + 1, topY + sizeY - 1
+ )
+
+class LockedRoom(MiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=19
+ ):
+ super().__init__(grid_size=size, max_steps=10*size)
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ for i in range(0, width):
+ self.grid.set(i, 0, Wall())
+ self.grid.set(i, height-1, Wall())
+ for j in range(0, height):
+ self.grid.set(0, j, Wall())
+ self.grid.set(width-1, j, Wall())
+
+ # Hallway walls
+ lWallIdx = width // 2 - 2
+ rWallIdx = width // 2 + 2
+ for j in range(0, height):
+ self.grid.set(lWallIdx, j, Wall())
+ self.grid.set(rWallIdx, j, Wall())
+
+ self.rooms = []
+
+ # Room splitting walls
+ for n in range(0, 3):
+ j = n * (height // 3)
+ for i in range(0, lWallIdx):
+ self.grid.set(i, j, Wall())
+ for i in range(rWallIdx, width):
+ self.grid.set(i, j, Wall())
+
+ roomW = lWallIdx + 1
+ roomH = height // 3 + 1
+ self.rooms.append(Room(
+ (0, j),
+ (roomW, roomH),
+ (lWallIdx, j + 3)
+ ))
+ self.rooms.append(Room(
+ (rWallIdx, j),
+ (roomW, roomH),
+ (rWallIdx, j + 3)
+ ))
+
+ # Choose one random room to be locked
+ lockedRoom = self._rand_elem(self.rooms)
+ lockedRoom.locked = True
+ goalPos = lockedRoom.rand_pos(self)
+ self.grid.set(*goalPos, Goal())
+
+ # Assign the door colors
+ colors = set(COLOR_NAMES)
+ for room in self.rooms:
+ color = self._rand_elem(sorted(colors))
+ colors.remove(color)
+ room.color = color
+ if room.locked:
+ self.grid.set(*room.doorPos, Door(color, is_locked=True))
+ else:
+ self.grid.set(*room.doorPos, Door(color))
+
+ # Select a random room to contain the key
+ while True:
+ keyRoom = self._rand_elem(self.rooms)
+ if keyRoom != lockedRoom:
+ break
+ keyPos = keyRoom.rand_pos(self)
+ self.grid.set(*keyPos, Key(lockedRoom.color))
+
+ # Randomize the player start position and orientation
+ self.agent_pos = self.place_agent(
+ top=(lWallIdx, 0),
+ size=(rWallIdx-lWallIdx, height)
+ )
+
+ # Generate the mission string
+ self.mission = (
+ 'get the %s key from the %s room, '
+ 'unlock the %s door and '
+ 'go to the goal'
+ ) % (lockedRoom.color, keyRoom.color, lockedRoom.color)
+
+ def step(self, action):
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-LockedRoom-v0',
+ entry_point='gym_minigrid.envs:LockedRoom'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/memory.py b/gym-minigrid/gym_minigrid/envs/memory.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff9ca86a18619f1eb9bc77a365582925205b9b2e
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/memory.py
@@ -0,0 +1,154 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class MemoryEnv(MiniGridEnv):
+ """
+ This environment is a memory test. The agent starts in a small room
+ where it sees an object. It then has to go through a narrow hallway
+ which ends in a split. At each end of the split there is an object,
+ one of which is the same as the object in the starting room. The
+ agent has to remember the initial object, and go to the matching
+ object at split.
+ """
+
+ def __init__(
+ self,
+ seed,
+ size=8,
+ random_length=False,
+ ):
+ self.random_length = random_length
+ super().__init__(
+ seed=seed,
+ grid_size=size,
+ max_steps=5*size**2,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ )
+
+ def _gen_grid(self, width, height):
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.horz_wall(0, 0)
+ self.grid.horz_wall(0, height-1)
+ self.grid.vert_wall(0, 0)
+ self.grid.vert_wall(width - 1, 0)
+
+ assert height % 2 == 1
+ upper_room_wall = height // 2 - 2
+ lower_room_wall = height // 2 + 2
+ if self.random_length:
+ hallway_end = self._rand_int(4, width - 2)
+ else:
+ hallway_end = width - 3
+
+ # Start room
+ for i in range(1, 5):
+ self.grid.set(i, upper_room_wall, Wall())
+ self.grid.set(i, lower_room_wall, Wall())
+ self.grid.set(4, upper_room_wall + 1, Wall())
+ self.grid.set(4, lower_room_wall - 1, Wall())
+
+ # Horizontal hallway
+ for i in range(5, hallway_end):
+ self.grid.set(i, upper_room_wall + 1, Wall())
+ self.grid.set(i, lower_room_wall - 1, Wall())
+
+ # Vertical hallway
+ for j in range(0, height):
+ if j != height // 2:
+ self.grid.set(hallway_end, j, Wall())
+ self.grid.set(hallway_end + 2, j, Wall())
+
+ # Fix the player's start position and orientation
+ self.agent_pos = (self._rand_int(1, hallway_end + 1), height // 2)
+ self.agent_dir = 0
+
+ # Place objects
+ start_room_obj = self._rand_elem([Key, Ball])
+ self.grid.set(1, height // 2 - 1, start_room_obj('green'))
+
+ other_objs = self._rand_elem([[Ball, Key], [Key, Ball]])
+ pos0 = (hallway_end + 1, height // 2 - 2)
+ pos1 = (hallway_end + 1, height // 2 + 2)
+ self.grid.set(*pos0, other_objs[0]('green'))
+ self.grid.set(*pos1, other_objs[1]('green'))
+
+ # Choose the target objects
+ if start_room_obj == other_objs[0]:
+ self.success_pos = (pos0[0], pos0[1] + 1)
+ self.failure_pos = (pos1[0], pos1[1] - 1)
+ else:
+ self.success_pos = (pos1[0], pos1[1] - 1)
+ self.failure_pos = (pos0[0], pos0[1] + 1)
+
+ self.mission = 'go to the matching object at the end of the hallway'
+
+ def step(self, action):
+ if action == MiniGridEnv.Actions.pickup:
+ action = MiniGridEnv.Actions.toggle
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+
+ if tuple(self.agent_pos) == self.success_pos:
+ reward = self._reward()
+ done = True
+ if tuple(self.agent_pos) == self.failure_pos:
+ reward = 0
+ done = True
+
+ return obs, reward, done, info
+
+class MemoryS17Random(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=17, random_length=True)
+
+register(
+ id='MiniGrid-MemoryS17Random-v0',
+ entry_point='gym_minigrid.envs:MemoryS17Random',
+)
+
+class MemoryS13Random(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=13, random_length=True)
+
+register(
+ id='MiniGrid-MemoryS13Random-v0',
+ entry_point='gym_minigrid.envs:MemoryS13Random',
+)
+
+class MemoryS13(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=13)
+
+register(
+ id='MiniGrid-MemoryS13-v0',
+ entry_point='gym_minigrid.envs:MemoryS13',
+)
+
+class MemoryS11(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=11)
+
+register(
+ id='MiniGrid-MemoryS11-v0',
+ entry_point='gym_minigrid.envs:MemoryS11',
+)
+
+class MemoryS9(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=9)
+
+register(
+ id='MiniGrid-MemoryS9-v0',
+ entry_point='gym_minigrid.envs:MemoryS9',
+)
+
+class MemoryS7(MemoryEnv):
+ def __init__(self, seed=None):
+ super().__init__(seed=seed, size=7)
+
+register(
+ id='MiniGrid-MemoryS7-v0',
+ entry_point='gym_minigrid.envs:MemoryS7',
+)
diff --git a/gym-minigrid/gym_minigrid/envs/multiroom.py b/gym-minigrid/gym_minigrid/envs/multiroom.py
new file mode 100644
index 0000000000000000000000000000000000000000..31f1d92801a3da89fbd971a870dd0bfb98068c6a
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/multiroom.py
@@ -0,0 +1,340 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class Room:
+ def __init__(self,
+ top,
+ size,
+ entryDoorPos,
+ exitDoorPos
+ ):
+ self.top = top
+ self.size = size
+ self.entryDoorPos = entryDoorPos
+ self.exitDoorPos = exitDoorPos
+
+class MultiRoomEnv(MiniGridEnv):
+ """
+ Environment with multiple rooms (subgoals)
+ """
+
+ def __init__(self,
+ minNumRooms,
+ maxNumRooms,
+ maxRoomSize=10
+ ):
+ assert minNumRooms > 0
+ assert maxNumRooms >= minNumRooms
+ assert maxRoomSize >= 4
+
+ self.minNumRooms = minNumRooms
+ self.maxNumRooms = maxNumRooms
+ self.maxRoomSize = maxRoomSize
+
+ self.rooms = []
+
+ super(MultiRoomEnv, self).__init__(
+ grid_size=25,
+ max_steps=self.maxNumRooms * 20
+ )
+
+ def _gen_grid(self, width, height):
+ roomList = []
+
+ # Choose a random number of rooms to generate
+ numRooms = self._rand_int(self.minNumRooms, self.maxNumRooms+1)
+
+ while len(roomList) < numRooms:
+ curRoomList = []
+
+ entryDoorPos = (
+ self._rand_int(0, width - 2),
+ self._rand_int(0, width - 2)
+ )
+
+ # Recursively place the rooms
+ self._placeRoom(
+ numRooms,
+ roomList=curRoomList,
+ minSz=4,
+ maxSz=self.maxRoomSize,
+ entryDoorWall=2,
+ entryDoorPos=entryDoorPos
+ )
+
+ if len(curRoomList) > len(roomList):
+ roomList = curRoomList
+
+ # Store the list of rooms in this environment
+ assert len(roomList) > 0
+ self.rooms = roomList
+
+ # Create the grid
+ self.grid = Grid(width, height, nb_obj_dims=self.nb_obj_dims)
+ wall = Wall()
+
+ prevDoorColor = None
+
+ # For each room
+ for idx, room in enumerate(roomList):
+
+ topX, topY = room.top
+ sizeX, sizeY = room.size
+
+ # Draw the top and bottom walls
+ for i in range(0, sizeX):
+ self.grid.set(topX + i, topY, wall)
+ self.grid.set(topX + i, topY + sizeY - 1, wall)
+
+ # Draw the left and right walls
+ for j in range(0, sizeY):
+ self.grid.set(topX, topY + j, wall)
+ self.grid.set(topX + sizeX - 1, topY + j, wall)
+
+ # If this isn't the first room, place the entry door
+ if idx > 0:
+ # Pick a door color different from the previous one
+ doorColors = set(COLOR_NAMES)
+ if prevDoorColor:
+ doorColors.remove(prevDoorColor)
+ # Note: the use of sorting here guarantees determinism,
+ # This is needed because Python's set is not deterministic
+ doorColor = self._rand_elem(sorted(doorColors))
+
+ entryDoor = Door(doorColor)
+ self.grid.set(*room.entryDoorPos, entryDoor)
+ prevDoorColor = doorColor
+
+ prevRoom = roomList[idx-1]
+ prevRoom.exitDoorPos = room.entryDoorPos
+
+ # Randomize the starting agent position and direction
+ self.place_agent(roomList[0].top, roomList[0].size)
+
+ # Place the final goal in the last room
+ self.goal_pos = self.place_obj(Goal(), roomList[-1].top, roomList[-1].size)
+
+ self.mission = 'traverse the rooms to get to the goal'
+
+ def _placeRoom(
+ self,
+ numLeft,
+ roomList,
+ minSz,
+ maxSz,
+ entryDoorWall,
+ entryDoorPos
+ ):
+ # Choose the room size randomly
+ sizeX = self._rand_int(minSz, maxSz+1)
+ sizeY = self._rand_int(minSz, maxSz+1)
+
+ # The first room will be at the door position
+ if len(roomList) == 0:
+ topX, topY = entryDoorPos
+ # Entry on the right
+ elif entryDoorWall == 0:
+ topX = entryDoorPos[0] - sizeX + 1
+ y = entryDoorPos[1]
+ topY = self._rand_int(y - sizeY + 2, y)
+ # Entry wall on the south
+ elif entryDoorWall == 1:
+ x = entryDoorPos[0]
+ topX = self._rand_int(x - sizeX + 2, x)
+ topY = entryDoorPos[1] - sizeY + 1
+ # Entry wall on the left
+ elif entryDoorWall == 2:
+ topX = entryDoorPos[0]
+ y = entryDoorPos[1]
+ topY = self._rand_int(y - sizeY + 2, y)
+ # Entry wall on the top
+ elif entryDoorWall == 3:
+ x = entryDoorPos[0]
+ topX = self._rand_int(x - sizeX + 2, x)
+ topY = entryDoorPos[1]
+ else:
+ assert False, entryDoorWall
+
+ # If the room is out of the grid, can't place a room here
+ if topX < 0 or topY < 0:
+ return False
+ if topX + sizeX > self.width or topY + sizeY >= self.height:
+ return False
+
+ # If the room intersects with previous rooms, can't place it here
+ for room in roomList[:-1]:
+ nonOverlap = \
+ topX + sizeX < room.top[0] or \
+ room.top[0] + room.size[0] <= topX or \
+ topY + sizeY < room.top[1] or \
+ room.top[1] + room.size[1] <= topY
+
+ if not nonOverlap:
+ return False
+
+ # Add this room to the list
+ roomList.append(Room(
+ (topX, topY),
+ (sizeX, sizeY),
+ entryDoorPos,
+ None
+ ))
+
+ # If this was the last room, stop
+ if numLeft == 1:
+ return True
+
+ # Try placing the next room
+ for i in range(0, 8):
+
+ # Pick which wall to place the out door on
+ wallSet = set((0, 1, 2, 3))
+ wallSet.remove(entryDoorWall)
+ exitDoorWall = self._rand_elem(sorted(wallSet))
+ nextEntryWall = (exitDoorWall + 2) % 4
+
+ # Pick the exit door position
+ # Exit on right wall
+ if exitDoorWall == 0:
+ exitDoorPos = (
+ topX + sizeX - 1,
+ topY + self._rand_int(1, sizeY - 1)
+ )
+ # Exit on south wall
+ elif exitDoorWall == 1:
+ exitDoorPos = (
+ topX + self._rand_int(1, sizeX - 1),
+ topY + sizeY - 1
+ )
+ # Exit on left wall
+ elif exitDoorWall == 2:
+ exitDoorPos = (
+ topX,
+ topY + self._rand_int(1, sizeY - 1)
+ )
+ # Exit on north wall
+ elif exitDoorWall == 3:
+ exitDoorPos = (
+ topX + self._rand_int(1, sizeX - 1),
+ topY
+ )
+ else:
+ assert False
+
+ # Recursively create the other rooms
+ success = self._placeRoom(
+ numLeft - 1,
+ roomList=roomList,
+ minSz=minSz,
+ maxSz=maxSz,
+ entryDoorWall=nextEntryWall,
+ entryDoorPos=exitDoorPos
+ )
+
+ if success:
+ break
+
+ return True
+
+class MultiRoomEnvN2S4(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=2,
+ maxNumRooms=2,
+ maxRoomSize=4
+ )
+
+class MultiRoomEnvN4S5(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=4,
+ maxNumRooms=4,
+ maxRoomSize=5
+ )
+
+class MultiRoomEnvN6(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=6,
+ maxNumRooms=6
+ )
+
+class MultiRoomEnvN7S4(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=7,
+ maxNumRooms=7,
+ maxRoomSize=4
+ )
+
+class MultiRoomEnvN7S8(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=7,
+ maxNumRooms=7,
+ maxRoomSize=8
+ )
+
+class MultiRoomEnvN10S4(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=10,
+ maxNumRooms=10,
+ maxRoomSize=4
+ )
+
+class MultiRoomEnvN10S10(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=10,
+ maxNumRooms=10,
+ maxRoomSize=10
+ )
+
+class MultiRoomEnvN12S10(MultiRoomEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=12,
+ maxNumRooms=12,
+ maxRoomSize=10
+ )
+
+register(
+ id='MiniGrid-MultiRoom-N2-S4-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN2S4'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N4-S5-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN4S5'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N6-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN6'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N7-S4-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN7S4'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N7-S8-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN7S8'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N10-S4-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN10S4'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N10-S10-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN10S10'
+)
+
+register(
+ id='MiniGrid-MultiRoom-N12-S10-v0',
+ entry_point='gym_minigrid.envs:MultiRoomEnvN12S10'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/envs/multiroom_noisytv.py b/gym-minigrid/gym_minigrid/envs/multiroom_noisytv.py
new file mode 100644
index 0000000000000000000000000000000000000000..9487b024dba8eb921c83dfb435f17a52693aa6b5
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/multiroom_noisytv.py
@@ -0,0 +1,325 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class Room:
+ def __init__(self,
+ top,
+ size,
+ entryDoorPos,
+ exitDoorPos
+ ):
+ self.top = top
+ self.size = size
+ self.entryDoorPos = entryDoorPos
+ self.exitDoorPos = exitDoorPos
+
+class MultiRoomNoisyTVEnv(MiniGridEnv):
+ """
+ Environment with multiple rooms (subgoals)
+ """
+
+ def __init__(self,
+ minNumRooms,
+ maxNumRooms,
+ maxRoomSize=10
+ ):
+ assert minNumRooms > 0
+ assert maxNumRooms >= minNumRooms
+ assert maxRoomSize >= 4
+
+ self.minNumRooms = minNumRooms
+ self.maxNumRooms = maxNumRooms
+ self.maxRoomSize = maxRoomSize
+
+ self.rooms = []
+
+ super(MultiRoomNoisyTVEnv, self).__init__(
+ grid_size=25,
+ max_steps=self.maxNumRooms * 20
+ )
+
+ def _gen_grid(self, width, height):
+ roomList = []
+
+ # Choose a random number of rooms to generate
+ numRooms = self._rand_int(self.minNumRooms, self.maxNumRooms+1)
+
+ while len(roomList) < numRooms:
+ curRoomList = []
+
+ entryDoorPos = (
+ self._rand_int(0, width - 2),
+ self._rand_int(0, width - 2)
+ )
+
+ # Recursively place the rooms
+ self._placeRoom(
+ numRooms,
+ roomList=curRoomList,
+ minSz=4,
+ maxSz=self.maxRoomSize,
+ entryDoorWall=2,
+ entryDoorPos=entryDoorPos
+ )
+
+ if len(curRoomList) > len(roomList):
+ roomList = curRoomList
+
+ # Store the list of rooms in this environment
+ assert len(roomList) > 0
+ self.rooms = roomList
+
+ # Create the grid
+ self.grid = Grid(width, height)
+ wall = Wall()
+
+ prevDoorColor = None
+
+ # For each room
+ for idx, room in enumerate(roomList):
+
+ topX, topY = room.top
+ sizeX, sizeY = room.size
+
+ # Draw the top and bottom walls
+ for i in range(0, sizeX):
+ self.grid.set(topX + i, topY, wall)
+ self.grid.set(topX + i, topY + sizeY - 1, wall)
+
+ # Draw the left and right walls
+ for j in range(0, sizeY):
+ self.grid.set(topX, topY + j, wall)
+ self.grid.set(topX + sizeX - 1, topY + j, wall)
+
+ # Create the noisy-tv: a ball of arbitrary color that \
+ # the agent can change with some action
+ if idx == 0:
+ self.noisy_tv = Ball(self._rand_elem(COLOR_NAMES))
+ self.place_obj(
+ self.noisy_tv,
+ top=room.top,
+ size=room.size,
+ max_tries=100,
+ )
+
+ # If this isn't the first room, place the entry door
+ if idx > 0:
+ # Pick a door color different from the previous one
+ doorColors = set(COLOR_NAMES)
+ if prevDoorColor:
+ doorColors.remove(prevDoorColor)
+ # Note: the use of sorting here guarantees determinism,
+ # This is needed because Python's set is not deterministic
+ doorColor = self._rand_elem(sorted(doorColors))
+
+ entryDoor = Door(doorColor)
+ self.grid.set(*room.entryDoorPos, entryDoor)
+ prevDoorColor = doorColor
+
+ prevRoom = roomList[idx-1]
+ prevRoom.exitDoorPos = room.entryDoorPos
+
+ # Randomize the starting agent position and direction
+ self.place_agent(roomList[0].top, roomList[0].size)
+
+ # Place the final goal in the last room
+ self.goal_pos = self.place_obj(Goal(), roomList[-1].top, roomList[-1].size)
+
+ self.mission = 'traverse the rooms to get to the goal'
+
+ def _placeRoom(
+ self,
+ numLeft,
+ roomList,
+ minSz,
+ maxSz,
+ entryDoorWall,
+ entryDoorPos
+ ):
+ # Choose the room size randomly
+ sizeX = self._rand_int(minSz, maxSz+1)
+ sizeY = self._rand_int(minSz, maxSz+1)
+
+ # The first room will be at the door position
+ if len(roomList) == 0:
+ topX, topY = entryDoorPos
+ # Entry on the right
+ elif entryDoorWall == 0:
+ topX = entryDoorPos[0] - sizeX + 1
+ y = entryDoorPos[1]
+ topY = self._rand_int(y - sizeY + 2, y)
+ # Entry wall on the south
+ elif entryDoorWall == 1:
+ x = entryDoorPos[0]
+ topX = self._rand_int(x - sizeX + 2, x)
+ topY = entryDoorPos[1] - sizeY + 1
+ # Entry wall on the left
+ elif entryDoorWall == 2:
+ topX = entryDoorPos[0]
+ y = entryDoorPos[1]
+ topY = self._rand_int(y - sizeY + 2, y)
+ # Entry wall on the top
+ elif entryDoorWall == 3:
+ x = entryDoorPos[0]
+ topX = self._rand_int(x - sizeX + 2, x)
+ topY = entryDoorPos[1]
+ else:
+ assert False, entryDoorWall
+
+ # If the room is out of the grid, can't place a room here
+ if topX < 0 or topY < 0:
+ return False
+ if topX + sizeX > self.width or topY + sizeY >= self.height:
+ return False
+
+ # If the room intersects with previous rooms, can't place it here
+ for room in roomList[:-1]:
+ nonOverlap = \
+ topX + sizeX < room.top[0] or \
+ room.top[0] + room.size[0] <= topX or \
+ topY + sizeY < room.top[1] or \
+ room.top[1] + room.size[1] <= topY
+
+ if not nonOverlap:
+ return False
+
+ # Add this room to the list
+ roomList.append(Room(
+ (topX, topY),
+ (sizeX, sizeY),
+ entryDoorPos,
+ None
+ ))
+
+ # If this was the last room, stop
+ if numLeft == 1:
+ return True
+
+ # Try placing the next room
+ for i in range(0, 8):
+
+ # Pick which wall to place the out door on
+ wallSet = set((0, 1, 2, 3))
+ wallSet.remove(entryDoorWall)
+ exitDoorWall = self._rand_elem(sorted(wallSet))
+ nextEntryWall = (exitDoorWall + 2) % 4
+
+ # Pick the exit door position
+ # Exit on right wall
+ if exitDoorWall == 0:
+ exitDoorPos = (
+ topX + sizeX - 1,
+ topY + self._rand_int(1, sizeY - 1)
+ )
+ # Exit on south wall
+ elif exitDoorWall == 1:
+ exitDoorPos = (
+ topX + self._rand_int(1, sizeX - 1),
+ topY + sizeY - 1
+ )
+ # Exit on left wall
+ elif exitDoorWall == 2:
+ exitDoorPos = (
+ topX,
+ topY + self._rand_int(1, sizeY - 1)
+ )
+ # Exit on north wall
+ elif exitDoorWall == 3:
+ exitDoorPos = (
+ topX + self._rand_int(1, sizeX - 1),
+ topY
+ )
+ else:
+ assert False
+
+ # Recursively create the other rooms
+ success = self._placeRoom(
+ numLeft - 1,
+ roomList=roomList,
+ minSz=minSz,
+ maxSz=maxSz,
+ entryDoorWall=nextEntryWall,
+ entryDoorPos=exitDoorPos
+ )
+
+ if success:
+ break
+
+ return True
+
+ def step(self, action):
+ self.step_count += 1
+
+ reward = 0
+ done = False
+
+ # Get the position in front of the agent
+ fwd_pos = self.front_pos
+
+ # Get the contents of the cell in front of the agent
+ fwd_cell = self.grid.get(*fwd_pos)
+
+ # Rotate left
+ if action == self.actions.left:
+ self.agent_dir -= 1
+ if self.agent_dir < 0:
+ self.agent_dir += 4
+
+ # Rotate right
+ elif action == self.actions.right:
+ self.agent_dir = (self.agent_dir + 1) % 4
+
+ # Move forward
+ elif action == self.actions.forward:
+ if fwd_cell == None or fwd_cell.can_overlap():
+ self.agent_pos = fwd_pos
+ if fwd_cell != None and fwd_cell.type == 'goal':
+ done = True
+ reward = self._reward()
+ if fwd_cell != None and fwd_cell.type == 'lava':
+ done = True
+
+ # Pickup an object -- here the agent should indeed pickup the ball if in front of a door
+ elif action == self.actions.pickup:
+ if fwd_cell and fwd_cell.can_pickup():
+ if self.carrying is None:
+ self.carrying = fwd_cell
+ self.carrying.cur_pos = np.array([-1, -1])
+ self.grid.set(*fwd_pos, None)
+
+ # NOTE: there is no Drop action in this case
+ # Instead, this action is used to randomly change the color of the noisy-tv
+ elif action == self.actions.drop:
+ self.noisy_tv.color = self._rand_elem(COLOR_NAMES)
+
+ # Toggle/activate an object
+ elif action == self.actions.toggle:
+ if fwd_cell:
+ fwd_cell.toggle(self, fwd_pos)
+
+ # Done action (not used by default)
+ elif action == self.actions.done:
+ pass
+
+ else:
+ assert False, "unknown action"
+
+ if self.step_count >= self.max_steps:
+ done = True
+
+ obs = self.gen_obs()
+
+ return obs, reward, done, {}
+
+class MultiRoomNoisyTVEnvN7S4(MultiRoomNoisyTVEnv):
+ def __init__(self):
+ super().__init__(
+ minNumRooms=7,
+ maxNumRooms=7,
+ maxRoomSize=4
+ )
+
+register(
+ id='MiniGrid-MultiRoomNoisyTV-N7-S4-v0',
+ entry_point='gym_minigrid.envs:MultiRoomNoisyTVEnvN7S4'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/envs/obstructedmaze.py b/gym-minigrid/gym_minigrid/envs/obstructedmaze.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b7987c12930d850e0ceceea88c4593771694f4b
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/obstructedmaze.py
@@ -0,0 +1,224 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.roomgrid import RoomGrid
+from gym_minigrid.register import register
+
+class ObstructedMazeEnv(RoomGrid):
+ """
+ A blue ball is hidden in the maze. Doors may be locked,
+ doors may be obstructed by a ball and keys may be hidden in boxes.
+ """
+
+ def __init__(self,
+ num_rows,
+ num_cols,
+ num_rooms_visited,
+ seed=None
+ ):
+ room_size = 6
+ max_steps = 4*num_rooms_visited*room_size**2
+
+ super().__init__(
+ room_size=room_size,
+ num_rows=num_rows,
+ num_cols=num_cols,
+ max_steps=max_steps,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ # Define all possible colors for doors
+ self.door_colors = self._rand_subset(COLOR_NAMES, len(COLOR_NAMES))
+ # Define the color of the ball to pick up
+ self.ball_to_find_color = COLOR_NAMES[0]
+ # Define the color of the balls that obstruct doors
+ self.blocking_ball_color = COLOR_NAMES[1]
+ # Define the color of boxes in which keys are hidden
+ self.box_color = COLOR_NAMES[2]
+
+ self.mission = "pick up the %s ball" % self.ball_to_find_color
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ if action == self.actions.pickup:
+ if self.carrying and self.carrying == self.obj:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+ def add_door(self, i, j, door_idx=0, color=None, locked=False, key_in_box=False, blocked=False):
+ """
+ Add a door. If the door must be locked, it also adds the key.
+ If the key must be hidden, it is put in a box. If the door must
+ be obstructed, it adds a ball in front of the door.
+ """
+
+ door, door_pos = super().add_door(i, j, door_idx, color, locked=locked)
+
+ if blocked:
+ vec = DIR_TO_VEC[door_idx]
+ blocking_ball = Ball(self.blocking_ball_color) if blocked else None
+ self.grid.set(door_pos[0]-vec[0], door_pos[1]-vec[1], blocking_ball)
+
+ if locked:
+ obj = Key(door.color)
+ if key_in_box:
+ box = Box(self.box_color) if key_in_box else None
+ box.contains = obj
+ obj = box
+ self.place_in_room(i, j, obj)
+
+ return door, door_pos
+
+class ObstructedMaze_1Dlhb(ObstructedMazeEnv):
+ """
+ A blue ball is hidden in a 2x1 maze. A locked door separates
+ rooms. Doors are obstructed by a ball and keys are hidden in boxes.
+ """
+
+ def __init__(self, key_in_box=True, blocked=True, seed=None):
+ self.key_in_box = key_in_box
+ self.blocked = blocked
+
+ super().__init__(
+ num_rows=1,
+ num_cols=2,
+ num_rooms_visited=2,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ self.add_door(0, 0, door_idx=0, color=self.door_colors[0],
+ locked=True,
+ key_in_box=self.key_in_box,
+ blocked=self.blocked)
+
+ self.obj, _ = self.add_object(1, 0, "ball", color=self.ball_to_find_color)
+ self.place_agent(0, 0)
+
+class ObstructedMaze_1Dl(ObstructedMaze_1Dlhb):
+ def __init__(self, seed=None):
+ super().__init__(False, False, seed)
+
+class ObstructedMaze_1Dlh(ObstructedMaze_1Dlhb):
+ def __init__(self, seed=None):
+ super().__init__(True, False, seed)
+
+class ObstructedMaze_Full(ObstructedMazeEnv):
+ """
+ A blue ball is hidden in one of the 4 corners of a 3x3 maze. Doors
+ are locked, doors are obstructed by a ball and keys are hidden in
+ boxes.
+ """
+
+ def __init__(self, agent_room=(1, 1), key_in_box=True, blocked=True,
+ num_quarters=4, num_rooms_visited=25, seed=None):
+ self.agent_room = agent_room
+ self.key_in_box = key_in_box
+ self.blocked = blocked
+ self.num_quarters = num_quarters
+
+ super().__init__(
+ num_rows=3,
+ num_cols=3,
+ num_rooms_visited=num_rooms_visited,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ middle_room = (1, 1)
+ # Define positions of "side rooms" i.e. rooms that are neither
+ # corners nor the center.
+ side_rooms = [(2, 1), (1, 2), (0, 1), (1, 0)][:self.num_quarters]
+ for i in range(len(side_rooms)):
+ side_room = side_rooms[i]
+
+ # Add a door between the center room and the side room
+ self.add_door(*middle_room, door_idx=i, color=self.door_colors[i], locked=False)
+
+ for k in [-1, 1]:
+ # Add a door to each side of the side room
+ self.add_door(*side_room, locked=True,
+ door_idx=(i+k)%4,
+ color=self.door_colors[(i+k)%len(self.door_colors)],
+ key_in_box=self.key_in_box,
+ blocked=self.blocked)
+
+ corners = [(2, 0), (2, 2), (0, 2), (0, 0)][:self.num_quarters]
+ ball_room = self._rand_elem(corners)
+
+ self.obj, _ = self.add_object(*ball_room, "ball", color=self.ball_to_find_color)
+ self.place_agent(*self.agent_room)
+
+class ObstructedMaze_2Dl(ObstructedMaze_Full):
+ def __init__(self, seed=None):
+ super().__init__((2, 1), False, False, 1, 4, seed)
+
+class ObstructedMaze_2Dlh(ObstructedMaze_Full):
+ def __init__(self, seed=None):
+ super().__init__((2, 1), True, False, 1, 4, seed)
+
+
+class ObstructedMaze_2Dlhb(ObstructedMaze_Full):
+ def __init__(self, seed=None):
+ super().__init__((2, 1), True, True, 1, 4, seed)
+
+class ObstructedMaze_1Q(ObstructedMaze_Full):
+ def __init__(self, seed=None):
+ super().__init__((1, 1), True, True, 1, 5, seed)
+
+class ObstructedMaze_2Q(ObstructedMaze_Full):
+ def __init__(self, seed=None):
+ super().__init__((1, 1), True, True, 2, 11, seed)
+
+register(
+ id="MiniGrid-ObstructedMaze-1Dl-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_1Dl"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-1Dlh-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_1Dlh"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-1Dlhb-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_1Dlhb"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-2Dl-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_2Dl"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-2Dlh-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_2Dlh"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-2Dlhb-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_2Dlhb"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-1Q-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_1Q"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-2Q-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_2Q"
+)
+
+register(
+ id="MiniGrid-ObstructedMaze-Full-v0",
+ entry_point="gym_minigrid.envs:ObstructedMaze_Full"
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/envs/playground_v0.py b/gym-minigrid/gym_minigrid/envs/playground_v0.py
new file mode 100644
index 0000000000000000000000000000000000000000..226bb1cc3751f3ab32bee39bd9bb08f6b2ac2779
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/playground_v0.py
@@ -0,0 +1,76 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class PlaygroundV0(MiniGridEnv):
+ """
+ Environment with multiple rooms and random objects.
+ This environment has no specific goals or rewards.
+ """
+
+ def __init__(self):
+ super().__init__(grid_size=19, max_steps=100)
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.horz_wall(0, 0)
+ self.grid.horz_wall(0, height-1)
+ self.grid.vert_wall(0, 0)
+ self.grid.vert_wall(width-1, 0)
+
+ roomW = width // 3
+ roomH = height // 3
+
+ # For each row of rooms
+ for j in range(0, 3):
+
+ # For each column
+ for i in range(0, 3):
+ xL = i * roomW
+ yT = j * roomH
+ xR = xL + roomW
+ yB = yT + roomH
+
+ # Bottom wall and door
+ if i+1 < 3:
+ self.grid.vert_wall(xR, yT, roomH)
+ pos = (xR, self._rand_int(yT+1, yB-1))
+ color = self._rand_elem(COLOR_NAMES)
+ self.grid.set(*pos, Door(color))
+
+ # Bottom wall and door
+ if j+1 < 3:
+ self.grid.horz_wall(xL, yB, roomW)
+ pos = (self._rand_int(xL+1, xR-1), yB)
+ color = self._rand_elem(COLOR_NAMES)
+ self.grid.set(*pos, Door(color))
+
+ # Randomize the player start position and orientation
+ self.place_agent()
+
+ # Place random objects in the world
+ types = ['key', 'ball', 'box']
+ for i in range(0, 12):
+ objType = self._rand_elem(types)
+ objColor = self._rand_elem(COLOR_NAMES)
+ if objType == 'key':
+ obj = Key(objColor)
+ elif objType == 'ball':
+ obj = Ball(objColor)
+ elif objType == 'box':
+ obj = Box(objColor)
+ self.place_obj(obj)
+
+ # No explicit mission in this environment
+ self.mission = ''
+
+ def step(self, action):
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-Playground-v0',
+ entry_point='gym_minigrid.envs:PlaygroundV0'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/putnear.py b/gym-minigrid/gym_minigrid/envs/putnear.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ee1a534e2c805a7de3784d90ddecd8a9f2b2db
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/putnear.py
@@ -0,0 +1,126 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class PutNearEnv(MiniGridEnv):
+ """
+ Environment in which the agent is instructed to place an object near
+ another object through a natural language string.
+ """
+
+ def __init__(
+ self,
+ size=6,
+ numObjs=2
+ ):
+ self.numObjs = numObjs
+
+ super().__init__(
+ grid_size=size,
+ max_steps=5*size,
+ # Set this to True for maximum speed
+ see_through_walls=True
+ )
+
+ def _gen_grid(self, width, height):
+ self.grid = Grid(width, height)
+
+ # Generate the surrounding walls
+ self.grid.horz_wall(0, 0)
+ self.grid.horz_wall(0, height-1)
+ self.grid.vert_wall(0, 0)
+ self.grid.vert_wall(width-1, 0)
+
+ # Types and colors of objects we can generate
+ types = ['key', 'ball', 'box']
+
+ objs = []
+ objPos = []
+
+ def near_obj(env, p1):
+ for p2 in objPos:
+ dx = p1[0] - p2[0]
+ dy = p1[1] - p2[1]
+ if abs(dx) <= 1 and abs(dy) <= 1:
+ return True
+ return False
+
+ # Until we have generated all the objects
+ while len(objs) < self.numObjs:
+ objType = self._rand_elem(types)
+ objColor = self._rand_elem(COLOR_NAMES)
+
+ # If this object already exists, try again
+ if (objType, objColor) in objs:
+ continue
+
+ if objType == 'key':
+ obj = Key(objColor)
+ elif objType == 'ball':
+ obj = Ball(objColor)
+ elif objType == 'box':
+ obj = Box(objColor)
+
+ pos = self.place_obj(obj, reject_fn=near_obj)
+
+ objs.append((objType, objColor))
+ objPos.append(pos)
+
+ # Randomize the agent start position and orientation
+ self.place_agent()
+
+ # Choose a random object to be moved
+ objIdx = self._rand_int(0, len(objs))
+ self.move_type, self.moveColor = objs[objIdx]
+ self.move_pos = objPos[objIdx]
+
+ # Choose a target object (to put the first object next to)
+ while True:
+ targetIdx = self._rand_int(0, len(objs))
+ if targetIdx != objIdx:
+ break
+ self.target_type, self.target_color = objs[targetIdx]
+ self.target_pos = objPos[targetIdx]
+
+ self.mission = 'put the %s %s near the %s %s' % (
+ self.moveColor,
+ self.move_type,
+ self.target_color,
+ self.target_type
+ )
+
+ def step(self, action):
+ preCarrying = self.carrying
+
+ obs, reward, done, info = super().step(action)
+
+ u, v = self.dir_vec
+ ox, oy = (self.agent_pos[0] + u, self.agent_pos[1] + v)
+ tx, ty = self.target_pos
+
+ # If we picked up the wrong object, terminate the episode
+ if action == self.actions.pickup and self.carrying:
+ if self.carrying.type != self.move_type or self.carrying.color != self.moveColor:
+ done = True
+
+ # If successfully dropping an object near the target
+ if action == self.actions.drop and preCarrying:
+ if self.grid.get(ox, oy) is preCarrying:
+ if abs(ox - tx) <= 1 and abs(oy - ty) <= 1:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+class PutNear8x8N3(PutNearEnv):
+ def __init__(self):
+ super().__init__(size=8, numObjs=3)
+
+register(
+ id='MiniGrid-PutNear-6x6-N2-v0',
+ entry_point='gym_minigrid.envs:PutNearEnv'
+)
+
+register(
+ id='MiniGrid-PutNear-8x8-N3-v0',
+ entry_point='gym_minigrid.envs:PutNear8x8N3'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/redbluedoors.py b/gym-minigrid/gym_minigrid/envs/redbluedoors.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea95b40e77fc6060eb9d9a70a17ec742073fdad
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/redbluedoors.py
@@ -0,0 +1,80 @@
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+
+class RedBlueDoorEnv(MiniGridEnv):
+ """
+ Single room with red and blue doors on opposite sides.
+ The red door must be opened before the blue door to
+ obtain a reward.
+ """
+
+ def __init__(self, size=8):
+ self.size = size
+
+ super().__init__(
+ width=2*size,
+ height=size,
+ max_steps=20*size*size
+ )
+
+ def _gen_grid(self, width, height):
+ # Create an empty grid
+ self.grid = Grid(width, height)
+
+ # Generate the grid walls
+ self.grid.wall_rect(0, 0, 2*self.size, self.size)
+ self.grid.wall_rect(self.size//2, 0, self.size, self.size)
+
+ # Place the agent in the top-left corner
+ self.place_agent(top=(self.size//2, 0), size=(self.size, self.size))
+
+ # Add a red door at a random position in the left wall
+ pos = self._rand_int(1, self.size - 1)
+ self.red_door = Door("red")
+ self.grid.set(self.size//2, pos, self.red_door)
+
+ # Add a blue door at a random position in the right wall
+ pos = self._rand_int(1, self.size - 1)
+ self.blue_door = Door("blue")
+ self.grid.set(self.size//2 + self.size - 1, pos, self.blue_door)
+
+ # Generate the mission string
+ self.mission = "open the red door then the blue door"
+
+ def step(self, action):
+ red_door_opened_before = self.red_door.is_open
+ blue_door_opened_before = self.blue_door.is_open
+
+ obs, reward, done, info = MiniGridEnv.step(self, action)
+
+ red_door_opened_after = self.red_door.is_open
+ blue_door_opened_after = self.blue_door.is_open
+
+ if blue_door_opened_after:
+ if red_door_opened_before:
+ reward = self._reward()
+ done = True
+ else:
+ reward = 0
+ done = True
+
+ elif red_door_opened_after:
+ if blue_door_opened_before:
+ reward = 0
+ done = True
+
+ return obs, reward, done, info
+
+class RedBlueDoorEnv6x6(RedBlueDoorEnv):
+ def __init__(self):
+ super().__init__(size=6)
+
+register(
+ id='MiniGrid-RedBlueDoors-6x6-v0',
+ entry_point='gym_minigrid.envs:RedBlueDoorEnv6x6'
+)
+
+register(
+ id='MiniGrid-RedBlueDoors-8x8-v0',
+ entry_point='gym_minigrid.envs:RedBlueDoorEnv'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/unlock.py b/gym-minigrid/gym_minigrid/envs/unlock.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f93a1c1108923288a5e5495585ef906f72645b1
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/unlock.py
@@ -0,0 +1,46 @@
+from gym_minigrid.minigrid import Ball
+from gym_minigrid.roomgrid import RoomGrid
+from gym_minigrid.register import register
+
+class Unlock(RoomGrid):
+ """
+ Unlock a door
+ """
+
+ def __init__(self, seed=None):
+ room_size = 6
+ super().__init__(
+ num_rows=1,
+ num_cols=2,
+ room_size=room_size,
+ max_steps=8*room_size**2,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ # Make sure the two rooms are directly connected by a locked door
+ door, _ = self.add_door(0, 0, 0, locked=True)
+ # Add a key to unlock the door
+ self.add_object(0, 0, 'key', door.color)
+
+ self.place_agent(0, 0)
+
+ self.door = door
+ self.mission = "open the door"
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ if action == self.actions.toggle:
+ if self.door.is_open:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-Unlock-v0',
+ entry_point='gym_minigrid.envs:Unlock'
+)
diff --git a/gym-minigrid/gym_minigrid/envs/unlockpickup.py b/gym-minigrid/gym_minigrid/envs/unlockpickup.py
new file mode 100644
index 0000000000000000000000000000000000000000..38f54a36ad2f279fe2fe3adc5c57cedbc0479151
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/envs/unlockpickup.py
@@ -0,0 +1,48 @@
+from gym_minigrid.minigrid import Ball
+from gym_minigrid.roomgrid import RoomGrid
+from gym_minigrid.register import register
+
+class UnlockPickup(RoomGrid):
+ """
+ Unlock a door, then pick up a box in another room
+ """
+
+ def __init__(self, seed=None):
+ room_size = 6
+ super().__init__(
+ num_rows=1,
+ num_cols=2,
+ room_size=room_size,
+ max_steps=8*room_size**2,
+ seed=seed
+ )
+
+ def _gen_grid(self, width, height):
+ super()._gen_grid(width, height)
+
+ # Add a box to the room on the right
+ obj, _ = self.add_object(1, 0, kind="box")
+ # Make sure the two rooms are directly connected by a locked door
+ door, _ = self.add_door(0, 0, 0, locked=True)
+ # Add a key to unlock the door
+ self.add_object(0, 0, 'key', door.color)
+
+ self.place_agent(0, 0)
+
+ self.obj = obj
+ self.mission = "pick up the %s %s" % (obj.color, obj.type)
+
+ def step(self, action):
+ obs, reward, done, info = super().step(action)
+
+ if action == self.actions.pickup:
+ if self.carrying and self.carrying == self.obj:
+ reward = self._reward()
+ done = True
+
+ return obs, reward, done, info
+
+register(
+ id='MiniGrid-UnlockPickup-v0',
+ entry_point='gym_minigrid.envs:UnlockPickup'
+)
diff --git a/gym-minigrid/gym_minigrid/minigrid.py b/gym-minigrid/gym_minigrid/minigrid.py
new file mode 100644
index 0000000000000000000000000000000000000000..e652e60d8c21539fc4b9060735a4d0b8b548119a
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/minigrid.py
@@ -0,0 +1,3492 @@
+import math
+import random
+import hashlib
+import gym
+from enum import IntEnum
+import numpy as np
+from gym import error, spaces, utils
+from gym.utils import seeding
+from .rendering import *
+from abc import ABC, abstractmethod
+import warnings
+import astar
+
+import traceback
+import warnings
+from functools import wraps
+
+SocialAINPCActionsDict = {
+ "go_forward": 0,
+ "rotate_left": 1,
+ "rotate_right": 2,
+ "toggle_action": 3,
+ "point_stop_point": 4,
+ "point_E": 5,
+ "point_S": 6,
+ "point_W": 7,
+ "point_N": 8,
+ "stop_point": 9,
+ "no_op": 10
+}
+
+point_dir_encoding = {
+ "point_E": 0,
+ "point_S": 1,
+ "point_W": 2,
+ "point_N": 3,
+}
+
+def get_traceback():
+ tb = traceback.extract_stack()
+ return "".join(traceback.format_list(tb)[:-1])
+
+
+# Size in pixels of a tile in the full-scale human view
+TILE_PIXELS = 32
+
+# Map of color names to RGB values
+COLORS = {
+ 'red' : np.array([255, 0, 0]),
+ 'green' : np.array([0, 255, 0]),
+ 'blue' : np.array([0, 0, 255]),
+ 'purple': np.array([112, 39, 195]),
+ 'yellow': np.array([255, 255, 0]),
+ 'grey' : np.array([100, 100, 100]),
+ 'brown': np.array([82, 36, 19])
+}
+
+COLOR_NAMES = sorted(list(COLORS.keys()))
+
+# Used to map colors to integers
+COLOR_TO_IDX = {
+ 'red' : 0,
+ 'green' : 1,
+ 'blue' : 2,
+ 'purple': 3,
+ 'yellow': 4,
+ 'grey' : 5,
+ 'brown' : 6,
+}
+
+IDX_TO_COLOR = dict(zip(COLOR_TO_IDX.values(), COLOR_TO_IDX.keys()))
+
+# Map of object type to integers
+OBJECT_TO_IDX = {
+ 'unseen' : 0,
+ 'empty' : 1,
+ 'wall' : 2,
+ 'floor' : 3,
+ 'door' : 4,
+ 'key' : 5,
+ 'ball' : 6,
+ 'box' : 7,
+ 'goal' : 8,
+ 'lava' : 9,
+ 'agent' : 10,
+ 'npc' : 11,
+ 'switch' : 12,
+ 'lockablebox' : 13,
+ 'apple' : 14,
+ 'applegenerator' : 15,
+ 'generatorplatform': 16,
+ 'marble' : 17,
+ 'marbletee' : 18,
+ 'fence' : 19,
+ 'remotedoor' : 20,
+ 'lever' : 21,
+}
+
+IDX_TO_OBJECT = dict(zip(OBJECT_TO_IDX.values(), OBJECT_TO_IDX.keys()))
+
+# Map of state names to integers
+STATE_TO_IDX = {
+ 'open' : 0,
+ 'closed': 1,
+ 'locked': 2,
+}
+
+# Map of agent direction indices to vectors
+DIR_TO_VEC = [
+ # Pointing right (positive X)
+ np.array((1, 0)),
+ # Down (positive Y)
+ np.array((0, 1)),
+ # Pointing left (negative X)
+ np.array((-1, 0)),
+ # Up (negative Y)
+ np.array((0, -1)),
+]
+
+class WorldObj:
+ """
+ Base class for grid world objects
+ """
+
+ def __init__(self, type, color):
+ assert type in OBJECT_TO_IDX, type
+ assert color in COLOR_TO_IDX, color
+ self.type = type
+ self.color = color
+ self.contains = None
+
+ # Initial position of the object
+ self.init_pos = None
+
+ # Current position of the object
+ self.cur_pos = np.array((0, 0))
+
+ def can_overlap(self):
+ """Can the agent overlap with this?"""
+ return False
+
+ def can_push(self):
+ """Can the agent push the object?"""
+ return False
+
+ def can_pickup(self):
+ """Can the agent pick this up?"""
+ return False
+
+ def can_contain(self):
+ """Can this contain another object?"""
+ return False
+
+ def see_behind(self):
+ """Can the agent see behind this object?"""
+ return True
+
+ def toggle(self, env, pos):
+ """Method to trigger/toggle an action this object performs"""
+ return False
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a nb_dims-tuple of integers"""
+ if absolute_coordinates:
+ core = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color])
+ else:
+ core = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color])
+
+ return core + (0,) * (nb_dims - len(core))
+
+ def cache(self, *args, **kwargs):
+ """Used for cached rendering."""
+ return self.encode(*args, **kwargs)
+
+ @staticmethod
+ def decode(type_idx, color_idx, state):
+ """Create an object from a 3-tuple state description"""
+
+ obj_type = IDX_TO_OBJECT[type_idx]
+ color = IDX_TO_COLOR[color_idx]
+
+ if obj_type == 'empty' or obj_type == 'unseen':
+ return None
+
+ if obj_type == 'wall':
+ v = Wall(color)
+ elif obj_type == 'floor':
+ v = Floor(color)
+ elif obj_type == 'ball':
+ v = Ball(color)
+ elif obj_type == 'marble':
+ v = Marble(color)
+ elif obj_type == 'apple':
+ eaten = state == 1
+ v = Apple(color, eaten=eaten)
+ elif obj_type == 'apple_generator':
+ is_pressed = state == 2
+ v = AppleGenerator(color, is_pressed=is_pressed)
+ elif obj_type == 'key':
+ v = Key(color)
+ elif obj_type == 'box':
+ v = Box(color)
+ elif obj_type == 'lockablebox':
+ is_locked = state == 2
+ v = LockableBox(color, is_locked=is_locked)
+ elif obj_type == 'door':
+ # State, 0: open, 1: closed, 2: locked
+ is_open = state == 0
+ is_locked = state == 2
+ v = Door(color, is_open, is_locked)
+ elif obj_type == 'remotedoor':
+ # State, 0: open, 1: closed
+ is_open = state == 0
+ v = RemoteDoor(color, is_open)
+ elif obj_type == 'goal':
+ v = Goal()
+ elif obj_type == 'lava':
+ v = Lava()
+ elif obj_type == 'fence':
+ v = Fence()
+ elif obj_type == 'switch':
+ v = Switch(color, is_on=state)
+ elif obj_type == 'lever':
+ v = Lever(color, is_on=state)
+ elif obj_type == 'npc':
+ warnings.warn("NPC's internal state cannot be decoded. Only the icon is shown.")
+ v = NPC(color)
+ v.npc_type=0
+ else:
+ assert False, "unknown object type in decode '%s'" % obj_type
+
+ return v
+
+ def render(self, r):
+ """Draw this object with the given renderer"""
+ raise NotImplementedError
+
+
+class BlockableWorldObj(WorldObj):
+
+ def __init__(self, type, color, block_set):
+ super(BlockableWorldObj, self).__init__(type, color)
+ self.block_set = block_set
+ self.blocked = False
+
+
+ def can_push(self):
+ return True
+
+ def push(self, *args, **kwargs):
+ return self.block_block_set()
+
+ def toggle(self, *args, **kwargs):
+ return self.block_block_set()
+
+ def block_block_set(self):
+ """A function that blocks the block set"""
+ if not self.blocked:
+ if self.block_set is not None:
+ # cprint("BLOCKED!", "red")
+ for e in self.block_set:
+ e.block()
+
+ return True
+
+ else:
+ return False
+
+ def block(self):
+ self.blocked = True
+
+
+class Goal(WorldObj):
+ def __init__(self):
+ super().__init__('goal', 'green')
+
+ def can_overlap(self):
+ return True
+
+ def render(self, img):
+ fill_coords(img, point_in_rect(0, 1, 0, 1), COLORS[self.color])
+
+
+class Floor(WorldObj):
+ """
+ Colored floor tile the agent can walk over
+ """
+
+ def __init__(self, color='blue'):
+ super().__init__('floor', color)
+
+ def can_overlap(self):
+ return True
+
+ def render(self, img):
+ # Give the floor a pale color
+ color = COLORS[self.color] / 2
+ fill_coords(img, point_in_rect(0.031, 1, 0.031, 1), color)
+
+
+class Lava(WorldObj):
+ def __init__(self):
+ super().__init__('lava', 'red')
+
+ def can_overlap(self):
+ return True
+
+ def render(self, img):
+ c = (255, 128, 0)
+
+ # Background color
+ fill_coords(img, point_in_rect(0, 1, 0, 1), c)
+
+ # Little waves
+ for i in range(3):
+ ylo = 0.3 + 0.2 * i
+ yhi = 0.4 + 0.2 * i
+ fill_coords(img, point_in_line(0.1, ylo, 0.3, yhi, r=0.03), (0,0,0))
+ fill_coords(img, point_in_line(0.3, yhi, 0.5, ylo, r=0.03), (0,0,0))
+ fill_coords(img, point_in_line(0.5, ylo, 0.7, yhi, r=0.03), (0,0,0))
+ fill_coords(img, point_in_line(0.7, yhi, 0.9, ylo, r=0.03), (0,0,0))
+
+
+class Fence(WorldObj):
+ """
+ Same as Lava but can't overlap.
+ """
+ def __init__(self):
+ super().__init__('fence', 'grey')
+
+ def can_overlap(self):
+ return False
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # ugly fence
+ fill_coords(img, point_in_rect(
+ 0.1, 0.9, 0.5, 0.9
+ # (0.15, 0.9),
+ # (0.10, 0.5),
+ # (0.95, 0.9),
+ # (0.90, 0.5),
+ # (0.10, 0.9),
+ # (0.10, 0.5),
+ # (0.95, 0.9),
+ # (0.95, 0.5),
+ ), c)
+ fill_coords(img, point_in_quadrangle(
+ # (0.15, 0.9),
+ # (0.10, 0.5),
+ # (0.95, 0.9),
+ # (0.90, 0.5),
+ (0.10, 0.9),
+ (0.10, 0.5),
+ (0.95, 0.9),
+ (0.95, 0.5),
+ ), c)
+ return
+
+ # preety fence
+ fill_coords(img, point_in_quadrangle(
+ (0.15, 0.3125),
+ (0.15, 0.4125),
+ (0.85, 0.4875),
+ (0.85, 0.5875),
+ ), c)
+
+ # h2
+ fill_coords(img, point_in_quadrangle(
+ (0.15, 0.6125),
+ (0.15, 0.7125),
+ (0.85, 0.7875),
+ (0.85, 0.8875),
+ ), c)
+
+ # vm
+ fill_coords(img, point_in_quadrangle(
+ (0.45, 0.2875),
+ (0.45, 0.8875),
+ (0.55, 0.3125),
+ (0.55, 0.9125),
+ ), c)
+ fill_coords(img, point_in_triangle(
+ (0.45, 0.2875),
+ (0.55, 0.3125),
+ (0.5, 0.25),
+ ), c)
+
+ # vl
+ fill_coords(img, point_in_quadrangle(
+ (0.25, 0.2375),
+ (0.25, 0.8375),
+ (0.35, 0.2625),
+ (0.35, 0.8625),
+ ), c)
+ # vl
+ fill_coords(img, point_in_triangle(
+ (0.25, 0.2375),
+ (0.35, 0.2625),
+ (0.3, 0.2),
+ ), c)
+
+
+ # vr
+ fill_coords(img, point_in_quadrangle(
+ (0.65, 0.3375),
+ (0.65, 0.9375),
+ (0.75, 0.3625),
+ (0.75, 0.9625),
+ ), c)
+ fill_coords(img, point_in_triangle(
+ (0.65, 0.3375),
+ (0.75, 0.3625),
+ (0.7, 0.3),
+ ), c)
+
+
+class Wall(WorldObj):
+ def __init__(self, color='grey'):
+ super().__init__('wall', color)
+
+ def see_behind(self):
+ return False
+
+ def render(self, img):
+ fill_coords(img, point_in_rect(0, 1, 0, 1), COLORS[self.color])
+
+
+class Lever(BlockableWorldObj):
+ def __init__(self, color, object=None, is_on=False, block_set=None, active_steps=None):
+ super().__init__('lever', color, block_set)
+ self.is_on = is_on
+ self.object = object
+
+ self.active_steps = active_steps
+ self.countdown = None # countdown timer
+
+ self.was_activated = False
+
+ if self.block_set is not None:
+ if self.is_on:
+ raise ValueError("If using a block set, a Switch must be initialized as OFF")
+
+ def can_overlap(self):
+ """The agent can only walk over this cell when the door is open"""
+ return False
+
+ def see_behind(self):
+ return True
+
+ def step(self):
+ if self.countdown is not None:
+ self.countdown = self.countdown - 1
+
+ if self.countdown <= 0:
+ # if nothing is on the door, close the door and deactivate timer
+ self.toggle()
+ self.countdown = None
+
+ def toggle(self, env=None, pos=None):
+
+ if self.blocked:
+ return False
+
+ if self.was_activated and not self.is_on:
+ # cannot be activated twice
+ return False
+
+ self.is_on = not self.is_on
+
+ if self.is_on:
+ if self.active_steps is not None:
+ # activate countdown to shutdown
+ self.countdown = self.active_steps
+ self.was_activated = True
+
+ if self.object is not None:
+ # open object
+ self.object.open_close()
+
+ if self.is_on:
+ self.block_block_set()
+
+ return True
+
+ def block(self):
+ self.blocked = True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ # State, 0: off, 1: on
+ state = 1 if self.is_on else 0
+
+ count = self.countdown if self.countdown is not None else 255
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], state, count)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state, count)
+
+ v += (0,) * (nb_dims-len(v))
+
+ return v
+
+ def render(self, img):
+ c = COLORS[self.color]
+ black = (0, 0, 0)
+
+ # off_angle = -math.pi/3
+ off_angle = -math.pi/2
+ on_angle = -math.pi/8
+
+
+ rotating_lever_shapes = []
+ rotating_lever_shapes.append((point_in_rect(0.5, 0.9, 0.77, 0.83), c))
+ rotating_lever_shapes.append((point_in_circle(0.9, 0.8, 0.1), c))
+
+ rotating_lever_shapes.append((point_in_circle(0.5, 0.8, 0.08), c))
+
+ if self.is_on:
+ if self.countdown is None:
+ angle = on_angle
+ else:
+ angle = (self.countdown/self.active_steps) * (on_angle-off_angle) + off_angle
+
+ else:
+ angle = off_angle
+
+ fill_coords(img, point_in_circle_clip(0.5, 0.8, 0.12, theta_end=-math.pi), c)
+ # fill_coords(img, point_in_circle_clip(0.5, 0.8, 0.08, theta_end=-math.pi), black)
+
+ rotating_lever_shapes = [(rotate_fn(v, cx=0.5, cy=0.8, theta=angle), col) for v, col in rotating_lever_shapes]
+
+ for v, col in rotating_lever_shapes:
+ fill_coords(img, v, col)
+
+ fill_coords(img, point_in_rect(0.2, 0.8, 0.78, 0.82), c)
+ fill_coords(img, point_in_circle(0.5, 0.8, 0.03), (0, 0, 0))
+
+
+class RemoteDoor(BlockableWorldObj):
+ """Door that are unlocked by a lever"""
+ def __init__(self, color, is_open=False, block_set=None):
+ super().__init__('remotedoor', color, block_set)
+ self.is_open = is_open
+
+ def can_overlap(self):
+ """The agent can only walk over this cell when the door is open"""
+ return self.is_open
+
+ def see_behind(self):
+ return self.is_open
+
+ # def toggle(self, env, pos=None):
+ # return False
+
+ def open_close(self):
+ # If the player has the right key to open the door
+
+ self.is_open = not self.is_open
+ return True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ # State, 0: open, 1: closed
+ state = 0 if self.is_open else 1
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], state)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+
+ v += (0,) * (nb_dims-len(v))
+ return v
+
+ def block(self):
+ self.blocked = True
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ if self.is_open:
+ fill_coords(img, point_in_rect(0.88, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.92, 0.96, 0.04, 0.96), (0,0,0))
+ else:
+
+ fill_coords(img, point_in_rect(0.00, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.04, 0.96, 0.04, 0.96), (0,0,0))
+ fill_coords(img, point_in_rect(0.08, 0.92, 0.08, 0.92), c)
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), (0,0,0))
+
+ # wifi symbol
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.5, theta_start=-np.pi/3, theta_end=-2*np.pi/3), c)
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.45, theta_start=-np.pi/3, theta_end=-2*np.pi/3), (0,0,0))
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.4, theta_start=-np.pi/3, theta_end=-2*np.pi/3), c)
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.35, theta_start=-np.pi/3, theta_end=-2*np.pi/3), (0,0,0))
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.3, theta_start=-np.pi/3, theta_end=-2*np.pi/3), c)
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.25, theta_start=-np.pi/3, theta_end=-2*np.pi/3), (0,0,0))
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.2, theta_start=-np.pi/3, theta_end=-2*np.pi/3), c)
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.15, theta_start=-np.pi/3, theta_end=-2*np.pi/3), (0,0,0))
+ fill_coords(img, point_in_circle_clip(cx=0.5, cy=0.8, r=0.1, theta_start=-np.pi/3, theta_end=-2*np.pi/3), c)
+
+ return
+
+
+class Door(BlockableWorldObj):
+ def __init__(self, color, is_open=False, is_locked=False, block_set=None):
+ super().__init__('door', color, block_set)
+ self.is_open = is_open
+ self.is_locked = is_locked
+
+ def can_overlap(self):
+ """The agent can only walk over this cell when the door is open"""
+ return self.is_open
+
+ def see_behind(self):
+ return self.is_open
+
+ def toggle(self, env, pos=None):
+
+ if self.blocked:
+ return False
+
+ # If the player has the right key to open the door
+ if self.is_locked:
+ if isinstance(env.carrying, Key) and env.carrying.color == self.color:
+ self.is_locked = False
+ self.is_open = True
+ ret = True
+ ret = False
+
+ else:
+ self.is_open = not self.is_open
+ ret = True
+
+ self.block_block_set()
+
+ return ret
+
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ # State, 0: open, 1: closed, 2: locked
+ if self.is_open:
+ state = 0
+ elif self.is_locked:
+ state = 2
+ elif not self.is_open:
+ state = 1
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], state)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+
+ v += (0,) * (nb_dims-len(v))
+ return v
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ if self.is_open:
+ fill_coords(img, point_in_rect(0.88, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.92, 0.96, 0.04, 0.96), (0,0,0))
+ return
+
+ # Door frame and door
+ if self.is_locked:
+ fill_coords(img, point_in_rect(0.00, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.06, 0.94, 0.06, 0.94), 0.45 * np.array(c))
+
+ # Draw key slot
+ fill_coords(img, point_in_rect(0.52, 0.75, 0.50, 0.56), c)
+ else:
+ fill_coords(img, point_in_rect(0.00, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.04, 0.96, 0.04, 0.96), (0,0,0))
+ fill_coords(img, point_in_rect(0.08, 0.92, 0.08, 0.92), c)
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), (0,0,0))
+
+ # Draw door handle
+ fill_coords(img, point_in_circle(cx=0.75, cy=0.50, r=0.08), c)
+
+
+class Switch(BlockableWorldObj):
+ def __init__(self, color, lockable_object=None, is_on=False, no_turn_off=True, no_light=True, locker_switch=False, block_set=None):
+ super().__init__('switch', color, block_set)
+ self.is_on = is_on
+ self.lockable_object = lockable_object
+ self.no_turn_off = no_turn_off
+ self.no_light = no_light
+ self.locker_switch = locker_switch
+
+ if self.block_set is not None:
+
+ if self.is_on:
+ raise ValueError("If using a block set, a Switch must be initialized as OFF")
+
+ if not self.no_turn_off:
+ raise ValueError("If using a block set, a Switch must be initialized can't be turned off")
+
+
+ def can_overlap(self):
+ """The agent can only walk over this cell when the door is open"""
+ return False
+
+ def see_behind(self):
+ return True
+
+ def toggle(self, env, pos=None):
+
+ if self.blocked:
+ return False
+
+ if self.is_on:
+ if self.no_turn_off:
+ return False
+
+ self.is_on = not self.is_on
+ if self.lockable_object is not None:
+ if self.locker_switch:
+ # locker/unlocker switch
+ self.lockable_object.is_locked = not self.lockable_object.is_locked
+ else:
+ # opener switch
+ self.lockable_object.toggle(env, pos)
+
+
+ if self.is_on:
+ self.block_block_set()
+
+ if self.no_turn_off:
+ # assert that obj is toggled only once
+ assert not hasattr(self, "was_toggled")
+ self.was_toggled = True
+
+ return True
+
+ def block(self):
+ self.blocked = True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ # State, 0: off, 1: on
+ state = 1 if self.is_on else 0
+
+ if self.no_light:
+ state = 0
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], state)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+
+ v += (0,) * (nb_dims-len(v))
+
+ return v
+
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # Door frame and door
+ if self.is_on and not self.no_light:
+ fill_coords(img, point_in_rect(0.00, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.04, 0.96, 0.04, 0.96), (0,0,0))
+ fill_coords(img, point_in_rect(0.08, 0.92, 0.08, 0.92), c)
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), 0.45 * np.array(c))
+
+ else:
+
+ fill_coords(img, point_in_rect(0.00, 1.00, 0.00, 1.00), c)
+ fill_coords(img, point_in_rect(0.04, 0.96, 0.04, 0.96), (0,0,0))
+ fill_coords(img, point_in_rect(0.08, 0.92, 0.08, 0.92), c)
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), (0,0,0))
+
+
+class Key(WorldObj):
+ def __init__(self, color='blue'):
+ super(Key, self).__init__('key', color)
+
+ def can_pickup(self):
+ return True
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # Vertical quad
+ fill_coords(img, point_in_rect(0.50, 0.63, 0.31, 0.88), c)
+
+ # Teeth
+ fill_coords(img, point_in_rect(0.38, 0.50, 0.59, 0.66), c)
+ fill_coords(img, point_in_rect(0.38, 0.50, 0.81, 0.88), c)
+
+ # Ring
+ fill_coords(img, point_in_circle(cx=0.56, cy=0.28, r=0.190), c)
+ fill_coords(img, point_in_circle(cx=0.56, cy=0.28, r=0.064), (0,0,0))
+
+
+class MarbleTee(WorldObj):
+ def __init__(self, color="red"):
+ super(MarbleTee, self).__init__('marbletee', color)
+
+ def can_pickup(self):
+ return False
+
+ def can_push(self):
+ return False
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ fill_coords(img, point_in_quadrangle(
+ (0.2, 0.5),
+ (0.8, 0.5),
+ (0.4, 0.6),
+ (0.6, 0.6),
+ ), c)
+
+ fill_coords(img, point_in_triangle(
+ (0.4, 0.6),
+ (0.6, 0.6),
+ (0.5, 0.9),
+ ), c)
+
+
+class Marble(WorldObj):
+ def __init__(self, color='blue', env=None):
+ super(Marble, self).__init__('marble', color)
+ self.is_tagged = False
+ self.moving_dir = None
+ self.env = env
+ self.was_pushed = False
+ self.tee = MarbleTee(color)
+ self.tee_uncovered = False
+
+ def can_pickup(self):
+ return True
+
+ def step(self):
+ if self.moving_dir is not None:
+ prev_pos = self.cur_pos
+ self.go_forward()
+
+ if any(prev_pos != self.cur_pos) and not self.tee_uncovered:
+ assert self.was_pushed
+
+ # if Marble was moved for the first time, uncover the Tee
+ # self.env.grid.set(*prev_pos, self.tee)
+ self.env.put_obj_np(self.tee, prev_pos)
+ self.tee_uncovered = True
+
+ @property
+ def is_moving(self):
+ return self.moving_dir is not None
+
+ @property
+ def dir_vec(self):
+ """
+ Get the direction vector for the agent, pointing in the direction
+ of forward movement.
+ """
+ if self.moving_dir is not None:
+ return DIR_TO_VEC[self.moving_dir]
+ else:
+ return np.array((0, 0))
+
+ @property
+ def front_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+ return self.cur_pos + self.dir_vec
+
+ def go_forward(self):
+ # Get the position in front of the agent
+ fwd_pos = self.front_pos
+
+ # Get the contents of the cell in front of the agent
+ fwd_cell = self.env.grid.get(*fwd_pos)
+ # Don't move if you are going to collide
+ if fwd_pos.tolist() != self.env.agent_pos.tolist() and (fwd_cell is None or fwd_cell.can_overlap()):
+ self.env.grid.set(*self.cur_pos, None)
+ self.env.grid.set(*fwd_pos, self)
+ self.cur_pos = fwd_pos
+ return True
+
+ # push object if pushable
+ if fwd_pos.tolist() != self.env.agent_pos.tolist() and (fwd_cell is not None and fwd_cell.can_push()):
+ fwd_cell.push(push_dir=self.moving_dir, pusher=self)
+ self.moving_dir = None
+ return True
+
+ else:
+ self.moving_dir = None
+ return False
+
+ def can_push(self):
+ return True
+
+ def push(self, push_dir, pusher=None):
+ if type(push_dir) is not int:
+ raise ValueError("Direction must be of type int and is of type {}".format(type(push_dir)))
+
+ self.moving_dir = push_dir
+ if self.moving_dir is not None:
+ self.was_pushed = True
+
+ def render(self, img):
+ color = COLORS[self.color]
+ if self.is_tagged:
+ color = color / 2
+
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.20), color)
+ fill_coords(img, point_in_circle(0.55, 0.45, 0.07), (0, 0, 0))
+
+ def tag(self,):
+ self.is_tagged = True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a nb_dims-tuple of integers"""
+ if absolute_coordinates:
+ core = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color])
+ else:
+ core = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color])
+
+ return core + (1 if self.is_tagged else 0,) * (nb_dims - len(core))
+
+
+class Ball(WorldObj):
+ def __init__(self, color='blue'):
+ super(Ball, self).__init__('ball', color)
+ self.is_tagged = False
+
+ def can_pickup(self):
+ return True
+
+ def render(self, img):
+ color = COLORS[self.color]
+ if self.is_tagged:
+ color = color / 2
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.31), color)
+
+ def tag(self,):
+ self.is_tagged = True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a nb_dims-tuple of integers"""
+ if absolute_coordinates:
+ core = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color])
+ else:
+ core = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color])
+
+ return core + (1 if self.is_tagged else 0,) * (nb_dims - len(core))
+
+
+class Apple(WorldObj):
+ def __init__(self, color='red', eaten=False):
+ super(Apple, self).__init__('apple', color)
+ self.is_tagged = False
+ self.eaten = eaten
+ assert self.color != "yellow"
+
+ def revert(self, color='red', eaten=False):
+ self.color = color
+ self.is_tagged = False
+ self.eaten = eaten
+ assert self.color != "yellow"
+
+ def can_pickup(self):
+ return False
+
+ def render(self, img):
+ color = COLORS[self.color]
+
+ if self.is_tagged:
+ color = color / 2
+
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.31), color)
+ fill_coords(img, point_in_rect(0.1, 0.9, 0.1, 0.55), (0, 0, 0))
+ fill_coords(img, point_in_circle(0.35, 0.5, 0.15), color)
+ fill_coords(img, point_in_circle(0.65, 0.5, 0.15), color)
+
+ fill_coords(img, point_in_rect(0.48, 0.52, 0.2, 0.45), COLORS["brown"])
+
+ # quadrangle
+ fill_coords(img, point_in_quadrangle(
+ (0.52, 0.25),
+ (0.65, 0.1),
+ (0.75, 0.3),
+ (0.90, 0.15),
+ ), COLORS["green"])
+
+
+ if self.eaten:
+ assert self.color == "yellow"
+ fill_coords(img, point_in_circle(0.74, 0.6, 0.23), (0,0,0))
+ fill_coords(img, point_in_circle(0.26, 0.6, 0.23), (0,0,0))
+
+ def toggle(self, env, pos):
+ if not self.eaten:
+ self.eaten = True
+
+ assert self.color != "yellow"
+ self.color = "yellow"
+
+ return True
+
+ else:
+ assert self.color == "yellow"
+ return False
+
+ def tag(self,):
+ self.is_tagged = True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a nb_dims-tuple of integers"""
+
+ # eaten <=> yellow
+ assert self.eaten == (self.color == "yellow")
+ if absolute_coordinates:
+ core = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color])
+ else:
+ core = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color])
+
+ return core + (1 if self.is_tagged else 0,) * (nb_dims - len(core))
+
+
+class GeneratorPlatform(WorldObj):
+ def __init__(self, color="red"):
+ super(GeneratorPlatform, self).__init__('generatorplatform', color)
+
+ def can_pickup(self):
+ return False
+
+ def can_push(self):
+ return False
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.2), c)
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.18), (0,0,0))
+
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.16), c)
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.14), (0,0,0))
+
+
+class AppleGenerator(BlockableWorldObj):
+ def __init__(self, color="red", is_pressed=False, block_set=None, on_push=None, marble_activation=False):
+ super(AppleGenerator, self).__init__('applegenerator', color, block_set)
+ self.is_pressed = is_pressed
+ self.on_push = on_push
+ self.marble_activation = marble_activation
+
+ def can_pickup(self):
+ return False
+
+ def block(self):
+ self.blocked = True
+
+ def can_push(self):
+ return True
+
+ def push(self, push_dir=None, pusher=None):
+
+ if self.marble_activation:
+ # check that it is marble that pushed the generator
+ if type(pusher) != Marble:
+ return self.block_block_set()
+
+ if not self.blocked:
+ self.is_pressed = True
+
+ if self.on_push:
+ self.on_push()
+
+ return self.block_block_set()
+
+ else:
+ return False
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ if not self.marble_activation:
+ # Outline
+ fill_coords(img, point_in_rect(0.15, 0.85, 0.15, 0.85), c)
+ # fill_coords(img, point_in_rect(0.17, 0.83, 0.17, 0.83), (0, 0, 0))
+ fill_coords(img, point_in_rect(0.16, 0.84, 0.16, 0.84), (0, 0, 0))
+
+ # Outline
+ fill_coords(img, point_in_rect(0.22, 0.78, 0.22, 0.78), c)
+ fill_coords(img, point_in_rect(0.24, 0.76, 0.24, 0.76), (0, 0, 0))
+ else:
+ # Outline
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.37), c)
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.35), (0, 0, 0))
+
+ # Outline
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.32), c)
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.30), (0, 0, 0))
+
+ # Apple inside
+ fill_coords(img, point_in_circle(0.5, 0.5, 0.2), COLORS["red"])
+ # fill_coords(img, point_in_rect(0.18, 0.82, 0.18, 0.55), (0, 0, 0))
+ fill_coords(img, point_in_rect(0.30, 0.65, 0.30, 0.55), (0, 0, 0))
+ fill_coords(img, point_in_circle(0.42, 0.5, 0.12), COLORS["red"])
+ fill_coords(img, point_in_circle(0.58, 0.5, 0.12), COLORS["red"])
+
+ # peteljka
+ fill_coords(img, point_in_rect(0.49, 0.50, 0.25, 0.45), COLORS["brown"])
+
+ # leaf
+ fill_coords(img, point_in_quadrangle(
+ (0.52, 0.32),
+ (0.60, 0.21),
+ (0.70, 0.34),
+ (0.80, 0.23),
+ ), COLORS["green"])
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ type = 2 if self.marble_activation else 1
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], type)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], type)
+
+ v += (0,) * (nb_dims - len(v))
+
+ return v
+
+
+class Box(WorldObj):
+ def __init__(self, color="red", contains=None):
+ super(Box, self).__init__('box', color)
+ self.contains = contains
+
+ def can_pickup(self):
+ return True
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # Outline
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), c)
+ fill_coords(img, point_in_rect(0.18, 0.82, 0.18, 0.82), (0,0,0))
+
+ # Horizontal slit
+ fill_coords(img, point_in_rect(0.16, 0.84, 0.47, 0.53), c)
+
+ def toggle(self, env, pos):
+ # Replace the box by its contents
+ env.grid.set(*pos, self.contains)
+ return True
+
+
+class LockableBox(BlockableWorldObj):
+ def __init__(self, color="red", is_locked=False, contains=None, block_set=None):
+ super(LockableBox, self).__init__('lockablebox', color, block_set)
+ self.contains = contains
+ self.is_locked = is_locked
+
+ self.is_open = False
+
+ def can_pickup(self):
+ return True
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ """Encode the a description of this object as a 3-tuple of integers"""
+
+ # State, 0: open, 1: closed, 2: locked
+ # 2 and 1 to be consistent with doors
+ if self.is_locked:
+ state = 2
+ else:
+ state = 1
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], state)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+
+ v += (0,) * (nb_dims - len(v))
+
+ return v
+
+ def render(self, img):
+ c = COLORS[self.color]
+
+ # Outline
+ fill_coords(img, point_in_rect(0.12, 0.88, 0.12, 0.88), c)
+
+ if self.is_locked:
+ fill_coords(img, point_in_rect(0.18, 0.82, 0.18, 0.82), 0.45 * np.array(c))
+ else:
+ fill_coords(img, point_in_rect(0.18, 0.82, 0.18, 0.82), (0, 0, 0))
+
+ # Horizontal slit
+ fill_coords(img, point_in_rect(0.16, 0.84, 0.47, 0.53), c)
+
+ def toggle(self, env, pos):
+ if self.blocked:
+ return False
+
+ if self.is_locked:
+ if isinstance(env.carrying, Key) and env.carrying.color == self.color:
+ self.is_locked = False
+ self.is_open = True
+ return True
+ return False
+
+ self.is_open = True
+ # Replace the box by its contents
+ env.grid.set(*pos, self.contains)
+
+ self.block_block_set()
+
+ # assert that obj is toggled only once
+ assert not hasattr(self, "was_toggled")
+ self.was_toggled = True
+
+ return True
+
+ def block(self):
+ self.blocked = True
+
+
+class NPC(ABC, WorldObj):
+ def __init__(self, color, view_size=7):
+ super().__init__('npc', color)
+ self.point_dir = 255 # initially no point
+ self.introduction_statement = "Help please "
+ self.list_of_possible_utterances = NPC.get_list_of_possible_utterances()
+ self.view_size = view_size
+ self.carrying = False
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ self.reset_last_action()
+
+ @staticmethod
+ def get_list_of_possible_utterances():
+ return ["no_op"]
+
+ def _npc_action(func):
+ """
+ Decorator that logs the last action
+ """
+ @wraps(func)
+ def func_wrapper(self, *args, **kwargs):
+
+ if self.env.add_npc_last_prim_action:
+ self.last_action = func.__name__
+
+ return func(self, *args, **kwargs)
+
+ return func_wrapper
+
+ def reset_last_action(self):
+ self.last_action = "no_op"
+
+ def step(self):
+ self.reset_last_action()
+
+ if self.env.hidden_npc:
+ info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+ return None, info
+
+ else:
+ return None, None
+
+ def handle_introduction(self, utterance):
+ reply, action = None, None
+ # introduction and language
+ if self.env.parameters.get("Pragmatic_frame_complexity", "No") == "No":
+
+ # only once
+ if not self.was_introduced_to:
+ self.was_introduced_to = True
+
+ elif self.env.parameters["Pragmatic_frame_complexity"] == "Eye_contact":
+
+ # only first time at eye contact
+ if self.is_eye_contact() and not self.was_introduced_to:
+ self.was_introduced_to = True
+
+ # if not self.was_introduced_to:
+ # rotate to see the agent
+ # action = self.look_at_action(self.env.agent_pos)
+
+ elif self.env.parameters["Pragmatic_frame_complexity"] == "Ask":
+
+ # every time asked
+ if utterance == self.introduction_statement:
+ self.was_introduced_to = True
+
+ elif self.env.parameters["Pragmatic_frame_complexity"] == "Ask_Eye_contact":
+
+ # only first time at eye contact with the introduction statement
+ if (self.is_eye_contact() and utterance == self.introduction_statement) and not self.was_introduced_to:
+ self.was_introduced_to = True
+
+ # if not self.was_introduced_to:
+ # # rotate to see the agent
+ # action = self.look_at_action(self.env.agent_pos)
+
+ else:
+ raise NotImplementedError()
+
+ return reply, action
+
+ def look_at_action(self, target_pos):
+ # rotate to see the target_pos
+ wanted_dir = self.compute_wanted_dir(target_pos)
+ action = self.compute_turn_action(wanted_dir)
+ return action
+
+ @_npc_action
+ def rotate_left(self):
+ self.npc_dir -= 1
+ if self.npc_dir < 0:
+ self.npc_dir += 4
+ return True
+
+ @_npc_action
+ def rotate_right(self):
+ self.npc_dir = (self.npc_dir + 1) % 4
+ return True
+
+ def path_to_toggle_pos(self, goal_pos):
+ """
+ Return the next action from the path to toggling an object at goal_pos
+ """
+ if type(goal_pos) != np.ndarray or goal_pos.shape != (2,):
+ raise ValueError(f"goal_pos must be a np.ndarray of shape (2,) and is {goal_pos}")
+
+ assert type(self.front_pos) == np.ndarray and self.front_pos.shape == (2,)
+
+ if all(self.front_pos == goal_pos):
+ # in front of door
+ return self.toggle_action
+
+ else:
+ return self.path_to_pos(goal_pos)
+
+ def turn_to_see_agent(self):
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ action = self.compute_turn_action(wanted_dir)
+ return action
+
+ def relative_coords(self, x, y):
+ """
+ Check if a grid position belongs to the npc's field of view, and returns the corresponding coordinates
+ """
+
+ vx, vy = self.get_view_coords(x, y)
+
+ if vx < 0 or vy < 0 or vx >= self.view_size or vy >= self.view_size:
+ return None
+
+ return vx, vy
+
+
+ def get_view_coords(self, i, j):
+ """
+ Translate and rotate absolute grid coordinates (i, j) into the
+ npc's partially observable view (sub-grid). Note that the resulting
+ coordinates may be negative or outside of the npc's view size.
+ """
+
+ ax, ay = self.cur_pos
+ dx, dy = self.dir_vec
+ rx, ry = self.right_vec
+
+ # Compute the absolute coordinates of the top-left view corner
+ sz = self.view_size
+ hs = self.view_size // 2
+ tx = ax + (dx * (sz-1)) - (rx * hs)
+ ty = ay + (dy * (sz-1)) - (ry * hs)
+
+ lx = i - tx
+ ly = j - ty
+
+ # Project the coordinates of the object relative to the top-left
+ # corner onto the agent's own coordinate system
+ vx = (rx*lx + ry*ly)
+ vy = -(dx*lx + dy*ly)
+
+ return vx, vy
+
+ def is_pointing(self):
+ return self.point_dir != 255
+
+ def path_to_pos(self, goal_pos):
+ """
+ Return the next action from the path to goal_pos
+ """
+
+ if type(goal_pos) != np.ndarray or goal_pos.shape != (2,):
+ raise ValueError(f"goal_pos must be a np.ndarray of shape (2,) and is {goal_pos}")
+
+ def neighbors(n):
+
+ n_nd = np.array(n)
+
+ adjacent_positions = [
+ n_nd + np.array([ 0, 1]),
+ n_nd + np.array([ 0,-1]),
+ n_nd + np.array([ 1, 0]),
+ n_nd + np.array([-1, 0]),
+ ]
+ adjacent_cells = map(lambda pos: self.env.grid.get(*pos), adjacent_positions)
+
+ # keep the positions that don't have anything on or can_overlap
+ neighbors = [
+ tuple(pos) for pos, cell in
+ zip(adjacent_positions, adjacent_cells) if (
+ all(pos == goal_pos)
+ or cell is None
+ or cell.can_overlap()
+ ) and not all(pos == self.env.agent_pos)
+ ]
+
+ for n1 in neighbors:
+ yield n1
+
+ def distance(n1, n2):
+ return 1
+
+ def cost(n, goal):
+ # manhattan
+ return int(np.abs(np.array(n) - np.array(goal)).sum())
+
+ # def is_goal_reached(n, goal):
+ # return all(n == goal)
+
+ path = astar.find_path(
+ # tuples because np.ndarray is not hashable
+ tuple(self.cur_pos),
+ tuple(goal_pos),
+ neighbors_fnct=neighbors,
+ heuristic_cost_estimate_fnct=cost,
+ distance_between_fnct=distance,
+ # is_goal_reached_fnct=is_goal_reached
+ )
+
+ if path is None:
+ # no possible path
+ return None
+
+ path = list(path)
+ assert all(path[0] == self.cur_pos)
+ next_step = path[1]
+ wanted_dir = self.compute_wanted_dir(next_step)
+
+ if self.npc_dir == wanted_dir:
+ return self.go_forward
+
+ else:
+ return self.compute_turn_action(wanted_dir)
+
+ def gen_obs_grid(self):
+ """
+ Generate the sub-grid observed by the npc.
+ This method also outputs a visibility mask telling us which grid
+ cells the npc can actually see.
+ """
+ view_size = self.view_size
+
+ topX, topY, botX, botY = self.env.get_view_exts(dir=self.npc_dir, view_size=view_size, pos=self.cur_pos)
+
+ grid = self.env.grid.slice(topX, topY, view_size, view_size)
+
+ for i in range(self.npc_dir + 1):
+ grid = grid.rotate_left()
+
+ # Process ocluders and visibility
+ # Note that this incurs some performance cost
+ if not self.env.see_through_walls:
+ vis_mask = grid.process_vis(agent_pos=(view_size // 2, view_size - 1))
+ else:
+ vis_mask = np.ones(shape=(grid.width, grid.height), dtype=np.bool)
+
+ # Make it so the npc sees what it's carrying
+ # We do this by placing the carried object at the agent's position
+ # in the agent's partially observable view
+ npc_pos = grid.width // 2, grid.height - 1
+ if self.carrying:
+ grid.set(*npc_pos, self.carrying)
+ else:
+ grid.set(*npc_pos, None)
+
+ return grid, vis_mask
+
+ def is_near_agent(self):
+ ax, ay = self.env.agent_pos
+ wx, wy = self.cur_pos
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+ def is_eye_contact(self):
+ """
+ Returns true if the agent and the NPC are looking at each other
+ """
+ if self.cur_pos[1] == self.env.agent_pos[1]:
+ # same y
+ if self.cur_pos[0] > self.env.agent_pos[0]:
+ return self.npc_dir == 2 and self.env.agent_dir == 0
+ else:
+ return self.npc_dir == 0 and self.env.agent_dir == 2
+
+ if self.cur_pos[0] == self.env.agent_pos[0]:
+ # same x
+ if self.cur_pos[1] > self.env.agent_pos[1]:
+ return self.npc_dir == 3 and self.env.agent_dir == 1
+ else:
+ return self.npc_dir == 1 and self.env.agent_dir == 3
+
+ return False
+
+ def compute_wanted_dir(self, target_pos):
+ """
+ Computes the direction in which the NPC should look to see target pos
+ """
+
+ distance_vec = target_pos - self.cur_pos
+ angle = np.degrees(np.arctan2(*distance_vec))
+ if angle < 0:
+ angle += 360
+
+ if angle < 45:
+ wanted_dir = 1 # S
+ elif angle < 135:
+ wanted_dir = 0 # E
+ elif angle < 225:
+ wanted_dir = 3 # N
+ elif angle < 315:
+ wanted_dir = 2 # W
+ elif angle < 360:
+ wanted_dir = 1 # S
+
+ return wanted_dir
+
+ def compute_wanted_point_dir(self, target_pos):
+ point_dir = self.compute_wanted_dir(target_pos)
+
+ return point_dir
+
+ # dir = 0 # E
+ # dir = 1 # S
+ # dir = 2 # W
+ # dir = 3 # N
+ # dir = 255 # no point
+
+ @_npc_action
+ def stop_point(self):
+ self.point_dir = 255
+ return True
+
+ @_npc_action
+ def point_E(self):
+ self.point_dir = point_dir_encoding["point_E"]
+ return True
+
+ @_npc_action
+ def point_S(self):
+ self.point_dir = point_dir_encoding["point_S"]
+ return True
+
+ @_npc_action
+ def point_W(self):
+ self.point_dir = point_dir_encoding["point_W"]
+ return True
+
+ @_npc_action
+ def point_N(self):
+ self.point_dir = point_dir_encoding["point_N"]
+ return True
+
+ def compute_wanted_point_action(self, target_pos):
+ point_dir = self.compute_wanted_dir(target_pos)
+
+ if point_dir == point_dir_encoding["point_E"]:
+ return self.point_E
+ elif point_dir == point_dir_encoding["point_S"]:
+ return self.point_S
+ elif point_dir == point_dir_encoding["point_W"]:
+ return self.point_W
+ elif point_dir == point_dir_encoding["point_N"]:
+ return self.point_N
+ else:
+ raise ValueError("Unknown direction {}".format(point_dir))
+
+
+ def compute_turn_action(self, wanted_dir):
+ """
+ Return the action turning for in the direction of wanted_dir
+ """
+ if self.npc_dir == wanted_dir:
+ # return lambda *args: None
+ return None
+ if (wanted_dir - self.npc_dir) == 1 or (wanted_dir == 0 and self.npc_dir == 3):
+ return self.rotate_right
+ if (wanted_dir - self.npc_dir) == - 1 or (wanted_dir == 3 and self.npc_dir == 0):
+ return self.rotate_left
+ else:
+ return self.env._rand_elem([self.rotate_left, self.rotate_right])
+
+ @_npc_action
+ def go_forward(self):
+ # Get the position in front of the agent
+ fwd_pos = self.front_pos
+
+ # Get the contents of the cell in front of the agent
+ fwd_cell = self.env.grid.get(*fwd_pos)
+ # Don't move if you are going to collide
+ if fwd_pos.tolist() != self.env.agent_pos.tolist() and (fwd_cell is None or fwd_cell.can_overlap()):
+ self.env.grid.set(*self.cur_pos, None)
+ self.env.grid.set(*fwd_pos, self)
+ self.cur_pos = fwd_pos
+ return True
+
+ # push object if pushable
+ if fwd_pos.tolist() != self.env.agent_pos.tolist() and (fwd_cell is not None and fwd_cell.can_push()):
+ fwd_cell.push(push_dir=self.npc_dir, pusher=self)
+
+ else:
+ return False
+
+ @_npc_action
+ def toggle_action(self):
+ fwd_pos = self.front_pos
+ fwd_cell = self.env.grid.get(*fwd_pos)
+ if fwd_cell:
+ return fwd_cell.toggle(self.env, fwd_pos)
+
+ return False
+
+ @property
+ def dir_vec(self):
+ """
+ Get the direction vector for the agent, pointing in the direction
+ of forward movement.
+ """
+
+ assert self.npc_dir >= 0 and self.npc_dir < 4
+ return DIR_TO_VEC[self.npc_dir]
+
+ @property
+ def right_vec(self):
+ """
+ Get the vector pointing to the right of the agent.
+ """
+
+ dx, dy = self.dir_vec
+ return np.array((-dy, dx))
+
+
+ @property
+ def front_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.cur_pos + self.dir_vec
+
+ @property
+ def back_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.cur_pos - self.dir_vec
+
+ @property
+ def right_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.cur_pos + self.right_vec
+
+ @property
+ def left_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.cur_pos - self.right_vec
+
+ def draw_npc_face(self, c):
+ assert self.npc_type == 0
+
+ assert all(COLORS[self.color] == c)
+
+ shapes = []
+ shapes_colors = []
+
+ # Draw eyes
+ shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ shapes_colors.append(c)
+ shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+ shapes_colors.append(c)
+
+ # Draw mouth
+ shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+ shapes_colors.append(c)
+
+ # Draw bottom hat
+ shapes.append(point_in_triangle((0.15, 0.28),
+ (0.85, 0.28),
+ (0.50, 0.05)))
+ shapes_colors.append(c)
+ # Draw top hat
+ shapes.append(point_in_rect(0.30, 0.70, 0.05, 0.28))
+ shapes_colors.append(c)
+ return shapes, shapes_colors
+
+ def render(self, img):
+
+
+ c = COLORS[self.color]
+
+ npc_shapes = []
+ npc_shapes_colors = []
+
+
+ npc_face_shapes, npc_face_shapes_colors = self.draw_npc_face(c=c)
+
+ npc_shapes.extend(npc_face_shapes)
+ npc_shapes_colors.extend(npc_face_shapes_colors)
+
+ if hasattr(self, "npc_dir"):
+ # Pre-rotation to ensure npc_dir = 1 means NPC looks downwards
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=-1*(math.pi / 2)) for v in npc_shapes]
+ # Rotate npc based on its direction
+ npc_shapes = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in npc_shapes]
+
+ if hasattr(self, "point_dir"):
+ if self.is_pointing():
+ # default points east
+ finger = point_in_triangle((0.85, 0.1),
+ (0.85, 0.3),
+ (0.99, 0.2))
+ finger = rotate_fn(finger, cx=0.5, cy=0.5, theta=(math.pi/2) * self.point_dir)
+
+ npc_shapes.append(finger)
+ npc_shapes_colors.append(c)
+
+ if self.last_action == self.toggle_action.__name__:
+ # T symbol
+ t_symbol = [point_in_rect(0.8, 0.84, 0.02, 0.18), point_in_rect(0.8, 0.95, 0.08, 0.12)]
+ t_symbol = [rotate_fn(v, cx=0.5, cy=0.5, theta=(math.pi/2) * self.npc_dir) for v in t_symbol]
+ npc_shapes.extend(t_symbol)
+ npc_shapes_colors.extend([c, c])
+
+ elif self.last_action == self.go_forward.__name__:
+ # symbol for Forward (ommited for speed)
+ pass
+
+ if self.env.hidden_npc:
+ # crossed eye symbol
+ dx, dy = 0.15, -0.2
+
+ # draw eye
+ npc_shapes.append(point_in_circle(cx=0.70+dx, cy=0.48+dy, r=0.11))
+ npc_shapes_colors.append((128,128,128))
+
+ npc_shapes.append(point_in_circle(cx=0.30+dx, cy=0.52+dy, r=0.11))
+ npc_shapes_colors.append((128,128,128))
+
+ npc_shapes.append(point_in_circle(0.5+dx, 0.5+dy, 0.25))
+ npc_shapes_colors.append((128, 128, 128))
+
+ npc_shapes.append(point_in_circle(0.5+dx, 0.5+dy, 0.20))
+ npc_shapes_colors.append((0, 0, 0))
+
+ npc_shapes.append(point_in_circle(0.5+dx, 0.5+dy, 0.1))
+ npc_shapes_colors.append((128, 128, 128))
+
+ # cross it
+ npc_shapes.append(point_in_line(0.2+dx, 0.7+dy, 0.8+dx, 0.3+dy, 0.04))
+ npc_shapes_colors.append((128, 128, 128))
+
+
+ # Draw shapes
+ for v, c in zip(npc_shapes, npc_shapes_colors):
+ fill_coords(img, v, c)
+
+ def cache(self, *args, **kwargs):
+ """Used for cached rendering."""
+ # adding npc_dir and point_dir because, when egocentric coordinates are used,
+ # they can result in the same encoding but require new rendering
+ return self.encode(*args, **kwargs) + (self.npc_dir, self.point_dir,)
+
+ def can_overlap(self):
+ # If the NPC is hidden, agent can overlap on it
+ return self.env.hidden_npc
+
+ def encode(self, nb_dims=3, absolute_coordinates=False):
+ if not hasattr(self, "npc_type"):
+ raise ValueError("An NPC class must implement the npc_type (int)")
+
+ if not hasattr(self, "env"):
+ raise ValueError("An NPC class must have the env")
+
+ assert nb_dims == 6+2*bool(absolute_coordinates)
+
+ if self.env.hidden_npc:
+ return (1,) + (0,) * (nb_dims-1)
+
+ assert self.env.egocentric_observation == (not absolute_coordinates)
+
+ if absolute_coordinates:
+ v = (OBJECT_TO_IDX[self.type], *self.cur_pos, COLOR_TO_IDX[self.color], self.npc_type)
+ else:
+ v = (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], self.npc_type)
+
+ if self.env.add_npc_direction:
+ assert hasattr(self, "npc_dir"), "4D but there is no npc dir in NPC state"
+ assert self.npc_dir >= 0
+
+ if self.env.egocentric_observation:
+ assert self.env.agent_dir >= 0
+
+ # 0 - eye contact; 2 - gaze in same direction; 1 - to left; 3 - to right
+ npc_dir_enc = (self.npc_dir - self.env.agent_dir + 2) % 4
+
+ v += (npc_dir_enc,)
+ else:
+ v += (self.npc_dir,)
+
+ if self.env.add_npc_point_direction:
+ assert hasattr(self, "point_dir"), "5D but there is no npc point dir in NPC state"
+
+ if self.point_dir == 255:
+ # no pointing
+ v += (self.point_dir,)
+
+ elif 0 <= self.point_dir <= 3:
+ # pointing
+
+ if self.env.egocentric_observation:
+ assert self.env.agent_dir >= 0
+
+ # 0 - pointing at agent; 2 - point in direction of agent gaze; 1 - to left; 3 - to right
+ point_enc = (self.point_dir - self.env.agent_dir + 2) % 4
+ v += (point_enc,)
+
+ else:
+ v += (self.point_dir,)
+
+ else:
+ raise ValueError(f"Undefined point direction {self.point_dir}")
+
+ if self.env.add_npc_last_prim_action:
+ assert hasattr(self, "last_action"), "6D but there is no last action in NPC state"
+
+ if self.last_action in ["point_E", "point_S", "point_W", "point_N"] and self.env.egocentric_observation:
+
+ # get the direction of the last point
+ last_action_point_dir = point_dir_encoding[self.last_action]
+
+ # convert to relative dir
+ # 0 - pointing at agent; 2 - point in direction of agent gaze; 1 - to left; 3 - to right
+ last_action_relative_point_dir = (last_action_point_dir - self.env.agent_dir + 2) % 4
+
+ # the point_X action ids are in range [point_E, ... , point_N]
+ # id of point_E is the starting one, we use the same range [E, S, W ,N ] -> [at, left, same, right]
+ last_action_id = self.prim_actions_dict["point_E"] + last_action_relative_point_dir
+
+ else:
+ last_action_id = self.prim_actions_dict[self.last_action]
+
+ v += (last_action_id,)
+
+ assert self.point_dir >= 0
+ assert len(v) == nb_dims
+
+ return v
+
+
+class Grid:
+ """
+ Represent a grid and operations on it
+ """
+
+ # Static cache of pre-renderer tiles
+ tile_cache = {}
+
+ def __init__(self, width, height, nb_obj_dims):
+ assert width >= 3
+ assert height >= 3
+
+ self.width = width
+ self.height = height
+ self.nb_obj_dims = nb_obj_dims
+
+ self.grid = [None] * width * height
+
+ def __contains__(self, key):
+ if isinstance(key, WorldObj):
+ for e in self.grid:
+ if e is key:
+ return True
+ elif isinstance(key, tuple):
+ for e in self.grid:
+ if e is None:
+ continue
+ if (e.color, e.type) == key:
+ return True
+ if key[0] is None and key[1] == e.type:
+ return True
+ return False
+
+ def __eq__(self, other):
+ grid1 = self.encode()
+ grid2 = other.encode()
+ return np.array_equal(grid2, grid1)
+
+ def __ne__(self, other):
+ return not self == other
+
+ def copy(self):
+ from copy import deepcopy
+ return deepcopy(self)
+
+ def set(self, i, j, v):
+ assert i >= 0 and i < self.width
+ assert j >= 0 and j < self.height
+ self.grid[j * self.width + i] = v
+
+ def get(self, i, j):
+ assert i >= 0 and i < self.width
+ assert j >= 0 and j < self.height
+ return self.grid[j * self.width + i]
+
+ def horz_wall(self, x, y, length=None, obj_type=Wall):
+ if length is None:
+ length = self.width - x
+ for i in range(0, length):
+ o = obj_type()
+ o.cur_pos = np.array((x+i, y))
+ self.set(x + i, y, o)
+
+ def vert_wall(self, x, y, length=None, obj_type=Wall):
+ if length is None:
+ length = self.height - y
+ for j in range(0, length):
+ o = obj_type()
+ o.cur_pos = np.array((x, y+j))
+ self.set(x, y + j, o)
+
+ def wall_rect(self, x, y, w, h):
+ self.horz_wall(x, y, w)
+ self.horz_wall(x, y+h-1, w)
+ self.vert_wall(x, y, h)
+ self.vert_wall(x+w-1, y, h)
+
+ def rotate_left(self):
+ """
+ Rotate the grid to the left (counter-clockwise)
+ """
+
+ grid = Grid(self.height, self.width, self.nb_obj_dims)
+
+ for i in range(self.width):
+ for j in range(self.height):
+ v = self.get(i, j)
+ grid.set(j, grid.height - 1 - i, v)
+
+ return grid
+
+ def slice(self, topX, topY, width, height):
+ """
+ Get a subset of the grid
+ """
+
+ grid = Grid(width, height, self.nb_obj_dims)
+
+ for j in range(0, height):
+ for i in range(0, width):
+ x = topX + i
+ y = topY + j
+
+ if x >= 0 and x < self.width and \
+ y >= 0 and y < self.height:
+ v = self.get(x, y)
+ else:
+ v = Wall()
+
+ grid.set(i, j, v)
+
+ return grid
+
+ @classmethod
+ def render_tile(
+ cls,
+ obj,
+ agent_dir=None,
+ highlight=False,
+ tile_size=TILE_PIXELS,
+ subdivs=3,
+ nb_obj_dims=3,
+ mask_unobserved=False
+ ):
+ """
+ Render a tile and cache the result
+ """
+ # Hash map lookup key for the cache
+ key = (agent_dir, highlight, tile_size, mask_unobserved)
+ # key = obj.encode(nb_dims=nb_obj_dims) + key if obj else key
+ key = obj.cache(nb_dims=nb_obj_dims) + key if obj else key
+
+ if key in cls.tile_cache:
+ return cls.tile_cache[key]
+
+ img = np.zeros(shape=(tile_size * subdivs, tile_size * subdivs, 3), dtype=np.uint8) # 3D for rendering
+
+ # Draw the grid lines (top and left edges)
+ fill_coords(img, point_in_rect(0, 0.031, 0, 1), (100, 100, 100))
+ fill_coords(img, point_in_rect(0, 1, 0, 0.031), (100, 100, 100))
+
+ if obj != None:
+ obj.render(img)
+
+ # Overlay the agent on top
+ if agent_dir is not None:
+ tri_fn = point_in_triangle(
+ (0.12, 0.19),
+ (0.87, 0.50),
+ (0.12, 0.81),
+ )
+
+ # Rotate the agent based on its direction
+ tri_fn = rotate_fn(tri_fn, cx=0.5, cy=0.5, theta=0.5*math.pi*agent_dir)
+ fill_coords(img, tri_fn, (255, 0, 0))
+
+ # Highlight the cell if needed
+ if highlight:
+ highlight_img(img)
+ elif mask_unobserved:
+ # mask unobserved and not highlighted -> unobserved by the agent
+ img *= 0
+
+ # Downsample the image to perform supersampling/anti-aliasing
+ img = downsample(img, subdivs)
+
+ # Cache the rendered tile
+ cls.tile_cache[key] = img
+
+ return img
+
+ def render(
+ self,
+ tile_size,
+ agent_pos=None,
+ agent_dir=None,
+ highlight_mask=None,
+ mask_unobserved=False,
+ ):
+ """
+ Render this grid at a given scale
+ :param r: target renderer object
+ :param tile_size: tile size in pixels
+ """
+
+ if highlight_mask is None:
+ highlight_mask = np.zeros(shape=(self.width, self.height), dtype=np.bool)
+
+ # Compute the total grid size
+ width_px = self.width * tile_size
+ height_px = self.height * tile_size
+ img = np.zeros(shape=(height_px, width_px, 3), dtype=np.uint8)
+
+ # Render the grid
+ for j in range(0, self.height):
+ for i in range(0, self.width):
+ cell = self.get(i, j)
+
+ agent_here = np.array_equal(agent_pos, (i, j))
+ tile_img = Grid.render_tile(
+ cell,
+ agent_dir=agent_dir if agent_here else None,
+ highlight=highlight_mask[i, j],
+ tile_size=tile_size,
+ nb_obj_dims=self.nb_obj_dims,
+ mask_unobserved=mask_unobserved
+ )
+
+ ymin = j * tile_size
+ ymax = (j+1) * tile_size
+ xmin = i * tile_size
+ xmax = (i+1) * tile_size
+ img[ymin:ymax, xmin:xmax, :] = tile_img
+
+ return img
+
+ def encode(self, vis_mask=None, absolute_coordinates=False):
+ """
+ Produce a compact numpy encoding of the grid
+ """
+
+ if vis_mask is None:
+ vis_mask = np.ones((self.width, self.height), dtype=bool)
+
+ array = np.zeros((self.width, self.height, self.nb_obj_dims), dtype='uint8')
+
+ for i in range(self.width):
+ for j in range(self.height):
+ if vis_mask[i, j]:
+ v = self.get(i, j)
+
+ if v is None:
+ array[i, j, 0] = OBJECT_TO_IDX['empty']
+ array[i, j, 1:] = 0
+
+ else:
+ array[i, j, :] = v.encode(nb_dims=self.nb_obj_dims, absolute_coordinates=absolute_coordinates)
+
+ return array
+
+ @staticmethod
+ def decode(array):
+ """
+ Decode an array grid encoding back into a grid
+ """
+
+ width, height, channels = array.shape
+ assert channels in [5, 4, 3]
+
+ vis_mask = np.ones(shape=(width, height), dtype=np.bool)
+
+ grid = Grid(width, height, nb_obj_dims=channels)
+ for i in range(width):
+ for j in range(height):
+ if len(array[i, j]) == 3:
+ type_idx, color_idx, state = array[i, j]
+ else:
+ type_idx, color_idx, state, orient = array[i, j]
+
+ v = WorldObj.decode(type_idx, color_idx, state)
+ grid.set(i, j, v)
+ vis_mask[i, j] = (type_idx != OBJECT_TO_IDX['unseen'])
+
+ return grid, vis_mask
+
+ def process_vis(grid, agent_pos):
+ # mask = np.zeros(shape=(grid.width, grid.height), dtype=np.bool)
+ #
+ # mask[agent_pos[0], agent_pos[1]] = True
+ #
+ # for j in reversed(range(0, grid.height)):
+ # for i in range(0, grid.width-1):
+ # if not mask[i, j]:
+ # continue
+ #
+ # cell = grid.get(i, j)
+ # if cell and not cell.see_behind():
+ # continue
+ #
+ # mask[i+1, j] = True
+ # if j > 0:
+ # mask[i+1, j-1] = True
+ # mask[i, j-1] = True
+ #
+ # for i in reversed(range(1, grid.width)):
+ # if not mask[i, j]:
+ # continue
+ #
+ # cell = grid.get(i, j)
+ # if cell and not cell.see_behind():
+ # continue
+ #
+ # mask[i-1, j] = True
+ # if j > 0:
+ # mask[i-1, j-1] = True
+ # mask[i, j-1] = True
+
+ mask = np.ones(shape=(grid.width, grid.height), dtype=np.bool)
+ # handle frontal occlusions
+
+ # 45 deg
+ for j in reversed(range(0, agent_pos[1]+1)):
+ dy = abs(agent_pos[1] - j)
+
+ # in front of the agent
+ i = agent_pos[0]
+ cell = grid.get(i, j)
+ if (cell and not cell.see_behind()) or mask[i, j] == False:
+
+ if j < agent_pos[1] and j > 0:
+ # 45 deg
+ mask[i-1,j-1] = False
+ mask[i,j-1] = False
+ mask[i+1,j-1] = False
+
+ # agent -> to the left
+ for i in reversed(range(1, agent_pos[0])):
+ dx = abs(agent_pos[0] - i)
+ cell = grid.get(i, j)
+
+ if (cell and not cell.see_behind()) or mask[i,j] == False:
+ # angle
+ if dx >= dy:
+ mask[i - 1, j] = False
+
+ if j > 0:
+ mask[i - 1, j - 1] = False
+ if dy >= dx:
+ mask[i, j - 1] = False
+
+ # agent -> to the right
+ for i in range(agent_pos[0]+1, grid.width-1):
+ dx = abs(agent_pos[0] - i)
+ cell = grid.get(i, j)
+
+ if (cell and not cell.see_behind()) or mask[i,j] == False:
+ # angle
+ if dx >= dy:
+ mask[i + 1, j] = False
+
+ if j > 0:
+ mask[i + 1, j - 1] = False
+ if dy >= dx:
+ mask[i, j - 1] = False
+
+ # for i in range(0, grid.width):
+ # cell = grid.get(i, j)
+ # if (cell and not cell.see_behind()) or mask[i,j] == False:
+ # mask[i, j-1] = False
+
+ # grid
+ # for j in reversed(range(0, agent_pos[1]+1)):
+ #
+ # i = agent_pos[0]
+ # cell = grid.get(i, j)
+ # if (cell and not cell.see_behind()) or mask[i, j] == False:
+ # if j < agent_pos[1]:
+ # # grid
+ # mask[i,j-1] = False
+ #
+ # for i in reversed(range(1, agent_pos[0])):
+ # # agent -> to the left
+ # cell = grid.get(i, j)
+ # if (cell and not cell.see_behind()) or mask[i,j] == False:
+ # # grid
+ # mask[i-1, j] = False
+ # if j < agent_pos[1] and j > 0:
+ # mask[i, j-1] = False
+ #
+ # for i in range(agent_pos[0]+1, grid.width-1):
+ # # agent -> to the right
+ # cell = grid.get(i, j)
+ # if (cell and not cell.see_behind()) or mask[i,j] == False:
+ # # grid
+ # mask[i+1, j] = False
+ # if j < agent_pos[1] and j > 0:
+ # mask[i, j-1] = False
+
+ for j in range(0, grid.height):
+ for i in range(0, grid.width):
+ if not mask[i, j]:
+ grid.set(i, j, None)
+
+ return mask
+
+
+class MiniGridEnv(gym.Env):
+ """
+ 2D grid world game environment
+ """
+
+ metadata = {
+ 'render.modes': ['human', 'rgb_array'],
+ 'video.frames_per_second' : 10
+ }
+
+ # Enumeration of possible actions
+ class Actions(IntEnum):
+ # Turn left, turn right, move forward
+ left = 0
+ right = 1
+ forward = 2
+
+ # Pick up an object
+ pickup = 3
+ # Drop an object
+ drop = 4
+ # Toggle/activate an object
+ toggle = 5
+
+ # Done completing task
+ done = 6
+
+ def __init__(
+ self,
+ grid_size=None,
+ width=None,
+ height=None,
+ max_steps=100,
+ see_through_walls=False,
+ full_obs=False,
+ seed=None,
+ agent_view_size=7,
+ actions=None,
+ action_space=None,
+ add_npc_direction=False,
+ add_npc_point_direction=False,
+ add_npc_last_prim_action=False,
+ reward_diminish_factor=0.9,
+ egocentric_observation=True,
+ ):
+
+ # sanity check params for SocialAI experiments
+ if "SocialAI" in type(self).__name__:
+ assert egocentric_observation
+ assert grid_size == 10
+ assert not see_through_walls
+ assert max_steps == 80
+ assert agent_view_size == 7
+ assert not full_obs
+ assert add_npc_direction and add_npc_point_direction and add_npc_last_prim_action
+
+ self.egocentric_observation = egocentric_observation
+
+ if hasattr(self, "lever_active_steps"):
+ assert self.lever_active_steps == 10
+
+ # Can't set both grid_size and width/height
+ if grid_size:
+ assert width == None and height == None
+ width = grid_size
+ height = grid_size
+
+ # Action enumeration for this environment
+ if actions:
+ self.actions = actions
+ else:
+ self.actions = MiniGridEnv.Actions
+
+ # Actions are discrete integer values
+ if action_space:
+ self.action_space = action_space
+ else:
+ self.action_space = spaces.MultiDiscrete([len(self.actions)])
+
+ # Number of cells (width and height) in the agent view
+ assert agent_view_size % 2 == 1
+ assert agent_view_size >= 3
+ self.agent_view_size = agent_view_size
+
+ # Number of object dimensions (i.e. number of channels in symbolic image)
+ self.add_npc_direction = add_npc_direction
+ self.add_npc_point_direction = add_npc_point_direction
+ self.add_npc_last_prim_action = add_npc_last_prim_action
+ self.nb_obj_dims = 3 + 2*bool(not self.egocentric_observation) + int(self.add_npc_direction) + int(self.add_npc_point_direction) + int(self.add_npc_last_prim_action)
+
+ # Observations are dictionaries containing an
+ # encoding of the grid and a textual 'mission' string
+ self.observation_space = spaces.Box(
+ low=0,
+ high=255,
+ shape=(self.agent_view_size, self.agent_view_size, self.nb_obj_dims),
+ dtype='uint8'
+ )
+ self.observation_space = spaces.Dict({
+ 'image': self.observation_space
+ })
+
+ # Range of possible rewards
+ self.reward_range = (0, 1)
+
+ # Window to use for human rendering mode
+ self.window = None
+
+ # Environment configuration
+ self.width = width
+ self.height = height
+ self.max_steps = max_steps
+ self.see_through_walls = see_through_walls
+ self.full_obs = full_obs
+
+ self.reward_diminish_factor = reward_diminish_factor
+
+ # Current position and direction of the agent
+ self.agent_pos = None
+ self.agent_dir = None
+
+ # Initialize the RNG
+ self.seed(seed=seed)
+
+ # Initialize the state
+ self.reset()
+
+ def reset(self):
+ # Current position and direction of the agent
+ self.agent_pos = None
+ self.agent_dir = None
+
+ # Generate a new random grid at the start of each episode
+ # To keep the same grid for each episode, call env.seed() with
+ # the same seed before calling env.reset()
+ self._gen_grid(self.width, self.height)
+
+ # These fields should be defined by _gen_grid
+ assert self.agent_pos is not None
+ assert self.agent_dir is not None
+
+ # Check that the agent doesn't overlap with an object
+ start_cell = self.grid.get(*self.agent_pos)
+ assert start_cell is None or start_cell.can_overlap()
+
+ # Item picked up, being carried, initially nothing
+ self.carrying = None
+
+ # Step count since episode start
+ self.step_count = 0
+
+ # Return first observation
+ obs = self.gen_obs(full_obs=self.full_obs)
+ return obs
+
+ def reset_with_info(self, *args, **kwargs):
+ obs = self.reset(*args, **kwargs)
+ info = self.generate_info(done=False, reward=0)
+ return obs, info
+
+ def seed(self, seed=1337):
+ # Seed the random number generator
+ self.np_random, _ = seeding.np_random(seed)
+ return [seed]
+
+ def hash(self, size=16):
+ """Compute a hash that uniquely identifies the current state of the environment.
+ :param size: Size of the hashing
+ """
+ sample_hash = hashlib.sha256()
+
+ to_encode = [self.grid.encode(), self.agent_pos, self.agent_dir]
+ for item in to_encode:
+ sample_hash.update(str(item).encode('utf8'))
+
+ return sample_hash.hexdigest()[:size]
+
+ @property
+ def steps_remaining(self):
+ return self.max_steps - self.step_count
+
+ def is_near(self, pos1, pos2):
+ ax, ay = pos1
+ wx, wy = pos2
+ if (ax == wx and abs(ay - wy) == 1) or (ay == wy and abs(ax - wx) == 1):
+ return True
+ return False
+
+ def get_cell(self, x, y):
+ return self.grid.get(x, y)
+
+ def __str__(self):
+ """
+ Produce a pretty string of the environment's grid along with the agent.
+ A grid cell is represented by 2-character string, the first one for
+ the object and the second one for the color.
+ """
+
+ # Map of object types to short string
+ OBJECT_TO_STR = {
+ 'wall' : 'W',
+ 'floor' : 'F',
+ 'door' : 'D',
+ 'key' : 'K',
+ 'ball' : 'A',
+ 'box' : 'B',
+ 'goal' : 'G',
+ 'lava' : 'V',
+ }
+
+ # Short string for opened door
+ OPENDED_DOOR_IDS = '_'
+
+ # Map agent's direction to short string
+ AGENT_DIR_TO_STR = {
+ 0: '>',
+ 1: 'V',
+ 2: '<',
+ 3: '^'
+ }
+
+ str = ''
+
+ for j in range(self.grid.height):
+
+ for i in range(self.grid.width):
+ if i == self.agent_pos[0] and j == self.agent_pos[1]:
+ str += 2 * AGENT_DIR_TO_STR[self.agent_dir]
+ continue
+
+ c = self.grid.get(i, j)
+
+ if c == None:
+ str += ' '
+ continue
+
+ if c.type == 'door':
+ if c.is_open:
+ str += '__'
+ elif c.is_locked:
+ str += 'L' + c.color[0].upper()
+ else:
+ str += 'D' + c.color[0].upper()
+ continue
+
+ str += OBJECT_TO_STR[c.type] + c.color[0].upper()
+
+ if j < self.grid.height - 1:
+ str += '\n'
+
+ return str
+
+ def _gen_grid(self, width, height):
+ assert False, "_gen_grid needs to be implemented by each environment"
+
+ def _reward(self):
+ """
+ Compute the reward to be given upon success
+ """
+
+ return 1 - self.reward_diminish_factor * (self.step_count / self.max_steps)
+
+ def _rand_int(self, low, high):
+ """
+ Generate random integer in [low,high[
+ """
+ return self.np_random.randint(low, high)
+
+ def _rand_float(self, low, high):
+ """
+ Generate random float in [low,high[
+ """
+
+ return self.np_random.uniform(low, high)
+
+ def _rand_bool(self):
+ """
+ Generate random boolean value
+ """
+
+ return (self.np_random.randint(0, 2) == 0)
+
+ def _rand_elem(self, iterable):
+ """
+ Pick a random element in a list
+ """
+
+ lst = list(iterable)
+ idx = self._rand_int(0, len(lst))
+ return lst[idx]
+
+ def _rand_subset(self, iterable, num_elems):
+ """
+ Sample a random subset of distinct elements of a list
+ """
+
+ lst = list(iterable)
+ assert num_elems <= len(lst)
+
+ out = []
+
+ while len(out) < num_elems:
+ elem = self._rand_elem(lst)
+ lst.remove(elem)
+ out.append(elem)
+
+ return out
+
+ def _rand_color(self):
+ """
+ Generate a random color name (string)
+ """
+
+ return self._rand_elem(COLOR_NAMES)
+
+ def _rand_pos(self, xLow, xHigh, yLow, yHigh):
+ """
+ Generate a random (x,y) position tuple
+ """
+
+ return (
+ self.np_random.randint(xLow, xHigh),
+ self.np_random.randint(yLow, yHigh)
+ )
+
+ def find_loc(self,
+ top=None,
+ size=None,
+ reject_fn=None,
+ max_tries=math.inf,
+ reject_agent_pos=True,
+ reject_taken_pos=True
+ ):
+ """
+ Place an object at an empty position in the grid
+
+ :param top: top-left position of the rectangle where to place
+ :param size: size of the rectangle where to place
+ :param reject_fn: function to filter out potential positions
+ """
+
+ if top is None:
+ top = (0, 0)
+ else:
+ top = (max(top[0], 0), max(top[1], 0))
+
+ if size is None:
+ size = (self.grid.width, self.grid.height)
+
+ num_tries = 0
+
+ while True:
+ # This is to handle with rare cases where rejection sampling
+ # gets stuck in an infinite loop
+ if num_tries > max_tries:
+ raise RecursionError('rejection sampling failed in place_obj')
+ if num_tries % 10000 == 0 and num_tries > 0:
+ warnings.warn("num_tries = {}. This is probably an infinite loop. {}".format(num_tries, get_traceback()))
+ # warnings.warn("num_tries = {}. This is probably an infinite loop.".format(num_tries))
+ exit()
+ break
+
+ num_tries += 1
+
+ pos = np.array((
+ self._rand_int(top[0], min(top[0] + size[0], self.grid.width)),
+ self._rand_int(top[1], min(top[1] + size[1], self.grid.height))
+ ))
+
+ # Don't place the object on top of another object
+ if reject_taken_pos:
+ if self.grid.get(*pos) != None:
+ continue
+
+ # Don't place the object where the agent is
+ if reject_agent_pos and np.array_equal(pos, self.agent_pos):
+ continue
+
+ # Check if there is a filtering criterion
+ if reject_fn and reject_fn(self, pos):
+ continue
+
+ break
+
+ return pos
+
+ def place_obj(self,
+ obj,
+ top=None,
+ size=None,
+ reject_fn=None,
+ max_tries=math.inf
+ ):
+ """
+ Place an object at an empty position in the grid
+
+ :param top: top-left position of the rectangle where to place
+ :param size: size of the rectangle where to place
+ :param reject_fn: function to filter out potential positions
+ """
+
+ # if top is None:
+ # top = (0, 0)
+ # else:
+ # top = (max(top[0], 0), max(top[1], 0))
+ #
+ # if size is None:
+ # size = (self.grid.width, self.grid.height)
+ #
+ # num_tries = 0
+ #
+ # while True:
+ # # This is to handle with rare cases where rejection sampling
+ # # gets stuck in an infinite loop
+ # if num_tries > max_tries:
+ # raise RecursionError('rejection sampling failed in place_obj')
+ #
+ # num_tries += 1
+ #
+ # pos = np.array((
+ # self._rand_int(top[0], min(top[0] + size[0], self.grid.width)),
+ # self._rand_int(top[1], min(top[1] + size[1], self.grid.height))
+ # ))
+ #
+ # # Don't place the object on top of another object
+ # if self.grid.get(*pos) != None:
+ # continue
+ #
+ # # Don't place the object where the agent is
+ # if np.array_equal(pos, self.agent_pos):
+ # continue
+ #
+ # # Check if there is a filtering criterion
+ # if reject_fn and reject_fn(self, pos):
+ # continue
+ #
+ # break
+ #
+ # self.grid.set(*pos, obj)
+ #
+ # if obj is not None:
+ # obj.init_pos = pos
+ # obj.cur_pos = pos
+ #
+ # return pos
+
+ pos = self.find_loc(
+ top=top,
+ size=size,
+ reject_fn=reject_fn,
+ max_tries=max_tries
+ )
+
+ if obj is None:
+ self.grid.set(*pos, obj)
+ else:
+ self.put_obj_np(obj, pos)
+
+ return pos
+
+ def put_obj_np(self, obj, pos):
+ """
+ Put an object at a specific position in the grid
+ """
+
+ assert isinstance(pos, np.ndarray)
+
+ i, j = pos
+
+ cell = self.grid.get(i, j)
+ if cell is not None:
+ raise ValueError("trying to put {} on {}".format(obj, cell))
+
+ self.grid.set(i, j, obj)
+ obj.init_pos = np.array((i, j))
+ obj.cur_pos = np.array((i, j))
+
+ def put_obj(self, obj, i, j):
+ """
+ Put an object at a specific position in the grid
+ """
+ warnings.warn(
+ "This function is kept for backwards compatiblity with minigrid. It is recommended to use put_object_np()."
+ )
+ raise DeprecationWarning("Deprecated use put_obj_np. (or remove this Warning)")
+
+ self.grid.set(i, j, obj)
+ obj.init_pos = (i, j)
+ obj.cur_pos = (i, j)
+
+ def place_agent(
+ self,
+ top=None,
+ size=None,
+ rand_dir=True,
+ max_tries=math.inf
+ ):
+ """
+ Set the agent's starting point at an empty position in the grid
+ """
+
+ self.agent_pos = None
+ pos = self.place_obj(None, top, size, max_tries=max_tries)
+ self.agent_pos = pos
+
+ if rand_dir:
+ self.agent_dir = self._rand_int(0, 4)
+
+ return pos
+
+ @property
+ def dir_vec(self):
+ """
+ Get the direction vector for the agent, pointing in the direction
+ of forward movement.
+ """
+ assert self.agent_dir >= 0 and self.agent_dir < 4
+ return DIR_TO_VEC[self.agent_dir]
+
+ @property
+ def right_vec(self):
+ """
+ Get the vector pointing to the right of the agent.
+ """
+
+ dx, dy = self.dir_vec
+ return np.array((-dy, dx))
+
+ @property
+ def front_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.agent_pos + self.dir_vec
+
+ @property
+ def back_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.agent_pos - self.dir_vec
+
+ @property
+ def right_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.agent_pos + self.right_vec
+
+ @property
+ def left_pos(self):
+ """
+ Get the position of the cell that is right in front of the agent
+ """
+
+ return self.agent_pos - self.right_vec
+
+ def get_view_coords(self, i, j):
+ """
+ Translate and rotate absolute grid coordinates (i, j) into the
+ agent's partially observable view (sub-grid). Note that the resulting
+ coordinates may be negative or outside of the agent's view size.
+ """
+
+ ax, ay = self.agent_pos
+ dx, dy = self.dir_vec
+ rx, ry = self.right_vec
+
+ # Compute the absolute coordinates of the top-left view corner
+ sz = self.agent_view_size
+ hs = self.agent_view_size // 2
+ tx = ax + (dx * (sz-1)) - (rx * hs)
+ ty = ay + (dy * (sz-1)) - (ry * hs)
+
+ lx = i - tx
+ ly = j - ty
+
+ # Project the coordinates of the object relative to the top-left
+ # corner onto the agent's own coordinate system
+ vx = (rx*lx + ry*ly)
+ vy = -(dx*lx + dy*ly)
+
+ return vx, vy
+
+ def get_view_exts(self, dir=None, view_size=None, pos=None):
+ """
+ Get the extents of the square set of tiles visible to the agent (or to an npc if specified
+ Note: the bottom extent indices are not included in the set
+ """
+
+ # by default compute view exts for agent
+ if (dir is None) and (view_size is None) and (pos is None):
+ dir = self.agent_dir
+ view_size = self.agent_view_size
+ pos = self.agent_pos
+
+ # Facing right
+ if dir == 0:
+ topX = pos[0]
+ topY = pos[1] - view_size // 2
+ # Facing down
+ elif dir == 1:
+ topX = pos[0] - view_size // 2
+ topY = pos[1]
+ # Facing left
+ elif dir == 2:
+ topX = pos[0] - view_size + 1
+ topY = pos[1] - view_size // 2
+ # Facing up
+ elif dir == 3:
+ topX = pos[0] - view_size // 2
+ topY = pos[1] - view_size + 1
+ else:
+ assert False, "invalid agent direction: {}".format(dir)
+
+ botX = topX + view_size
+ botY = topY + view_size
+
+ return (topX, topY, botX, botY)
+
+ def relative_coords(self, x, y):
+ """
+ Check if a grid position belongs to the agent's field of view, and returns the corresponding coordinates
+ """
+
+ vx, vy = self.get_view_coords(x, y)
+
+ if vx < 0 or vy < 0 or vx >= self.agent_view_size or vy >= self.agent_view_size:
+ return None
+
+ return vx, vy
+
+ def in_view(self, x, y):
+ """
+ check if a grid position is visible to the agent
+ """
+
+ return self.relative_coords(x, y) is not None
+
+ def agent_sees(self, x, y):
+ """
+ Check if a non-empty grid position is visible to the agent
+ """
+
+ coordinates = self.relative_coords(x, y)
+ if coordinates is None:
+ return False
+ vx, vy = coordinates
+ assert not self.full_obs, "agent sees function not implemented with full_obs"
+ obs = self.gen_obs()
+ obs_grid, _ = Grid.decode(obs['image'])
+ obs_cell = obs_grid.get(vx, vy)
+ world_cell = self.grid.get(x, y)
+
+ return obs_cell is not None and obs_cell.type == world_cell.type
+
+ def step(self, action):
+ self.step_count += 1
+
+ reward = 0
+ done = False
+
+ # Get the position in front of the agent
+ fwd_pos = self.front_pos
+
+ # Get the contents of the cell in front of the agent
+ fwd_cell = self.grid.get(*fwd_pos)
+
+ # Rotate left
+ if action == self.actions.left:
+ self.agent_dir -= 1
+ if self.agent_dir < 0:
+ self.agent_dir += 4
+
+ # Rotate right
+ elif action == self.actions.right:
+ self.agent_dir = (self.agent_dir + 1) % 4
+
+ # Move forward
+ elif action == self.actions.forward:
+ if fwd_cell != None and fwd_cell.can_push():
+ fwd_cell.push(push_dir=self.agent_dir, pusher="agent")
+
+ if fwd_cell == None or fwd_cell.can_overlap():
+ self.agent_pos = fwd_pos
+ if fwd_cell != None and fwd_cell.type == 'goal':
+ done = True
+ reward = self._reward()
+ if fwd_cell != None and fwd_cell.type == 'lava':
+ done = True
+
+ # Pick up an object
+ elif hasattr(self.actions, "pickup") and action == self.actions.pickup:
+ if fwd_cell and fwd_cell.can_pickup():
+ if self.carrying is None:
+ self.carrying = fwd_cell
+ self.carrying.cur_pos = np.array([-1, -1])
+ self.grid.set(*fwd_pos, None)
+
+ # Drop an object
+ elif hasattr(self.actions, "drop") and action == self.actions.drop:
+ if not fwd_cell and self.carrying:
+ self.grid.set(*fwd_pos, self.carrying)
+ self.carrying.cur_pos = fwd_pos
+ self.carrying = None
+
+ # Toggle/activate an object
+ elif action == self.actions.toggle:
+ if fwd_cell:
+ fwd_cell.toggle(self, fwd_pos)
+
+ # Done action (not used by default)
+ elif action == self.actions.done:
+ pass
+
+ elif action in map(int, self.actions):
+ # action that was added in an inheriting class (ex. talk action)
+ pass
+
+ elif np.isnan(action):
+ # action skip
+ pass
+
+ else:
+ assert False, f"unknown action {action}"
+
+ if self.step_count >= self.max_steps:
+ done = True
+
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ info = self.generate_info(done, reward)
+
+ return obs, reward, done, info
+
+ def generate_info(self, done, reward):
+
+ success = done and reward > 0
+
+ info = {"success": success}
+
+ gen_extra_info_dict = self.gen_extra_info() # add stuff needed for textual observations here
+
+ assert not any(item in info for item in gen_extra_info_dict), "Duplicate keys found with gen_extra_info"
+
+ info = {
+ **info,
+ **gen_extra_info_dict,
+ }
+ return info
+
+ def gen_extra_info(self):
+ grid, vis_mask = self.gen_obs_grid()
+ carrying = self.carrying
+ agent_pos_vx, agent_pos_vy = self.get_view_coords(self.agent_pos[0], self.agent_pos[1])
+ npc_actions_dict = SocialAINPCActionsDict
+
+ extra_info = {
+ "image": grid.encode(vis_mask),
+ "vis_mask": vis_mask,
+ "carrying": carrying,
+ "agent_pos_vx": agent_pos_vx,
+ "agent_pos_vy": agent_pos_vy,
+ "npc_actions_dict": npc_actions_dict
+ }
+ return extra_info
+
+ def gen_obs_grid(self):
+ """
+ Generate the sub-grid observed by the agent.
+ This method also outputs a visibility mask telling us which grid
+ cells the agent can actually see.
+ """
+
+ topX, topY, botX, botY = self.get_view_exts()
+
+ grid = self.grid.slice(topX, topY, self.agent_view_size, self.agent_view_size)
+
+ for i in range(self.agent_dir + 1):
+ grid = grid.rotate_left()
+
+ # Process occluders and visibility
+ # Note that this incurs some performance cost
+ if not self.see_through_walls:
+ vis_mask = grid.process_vis(agent_pos=(self.agent_view_size // 2, self.agent_view_size - 1))
+ else:
+ vis_mask = np.ones(shape=(grid.width, grid.height), dtype=np.bool)
+
+ # Make it so the agent sees what it's carrying
+ # We do this by placing the carried object at the agent's position
+ # in the agent's partially observable view
+ agent_pos = grid.width // 2, grid.height - 1
+ if self.carrying:
+ grid.set(*agent_pos, self.carrying)
+ else:
+ grid.set(*agent_pos, None)
+
+ return grid, vis_mask
+
+ def add_agent_to_grid(self, image):
+ """
+ Add agent to symbolic pixel image, used only for full observation
+ """
+ ax, ay = self.agent_pos
+ image[ax,ay] = [9,9,9,self.agent_dir] # could be made cleaner by creating an Agent_id (here we use Lava_id)
+ return image
+
+ def gen_obs(self, full_obs=False):
+ """
+ Generate the agent's view (partially observable, low-resolution encoding)
+ Fully observable view can be returned when full_obs is set to True
+ """
+ if full_obs:
+ image = self.add_agent_to_grid(self.grid.encode())
+
+ else:
+ grid, vis_mask = self.gen_obs_grid()
+
+ # Encode the partially observable view into a numpy array
+ image = grid.encode(vis_mask, absolute_coordinates=not self.egocentric_observation)
+
+ assert hasattr(self, 'mission'), "environments must define a textual mission string"
+
+ # Observations are dictionaries containing:
+ # - an image (partially observable view of the environment)
+ # - the agent's direction/orientation (acting as a compass)
+ # - a textual mission string (instructions for the agent)
+ obs = {
+ 'image': image,
+ 'direction': self.agent_dir,
+ 'mission': self.mission
+ }
+
+ return obs
+
+ def get_obs_render(self, obs, tile_size=TILE_PIXELS//2):
+ """
+ Render an agent observation for visualization
+ """
+
+ grid, vis_mask = Grid.decode(obs)
+
+ # Render the whole grid
+ img = grid.render(
+ tile_size,
+ agent_pos=(self.agent_view_size // 2, self.agent_view_size - 1),
+ agent_dir=3,
+ highlight_mask=vis_mask
+ )
+
+ return img
+
+ def render(self, mode='human', close=False, highlight=True, tile_size=TILE_PIXELS, mask_unobserved=False):
+ """
+ Render the whole-grid human view
+ """
+ if mode == 'human' and close:
+ if self.window:
+ self.window.close()
+ return
+
+ if mode == 'human' and not self.window:
+ import gym_minigrid.window
+ self.window = gym_minigrid.window.Window('gym_minigrid')
+ self.window.show(block=False)
+
+ # Compute which cells are visible to the agent
+ _, vis_mask = self.gen_obs_grid()
+
+ # Compute the world coordinates of the bottom-left corner
+ # of the agent's view area
+ f_vec = self.dir_vec
+ r_vec = self.right_vec
+ top_left = self.agent_pos + f_vec * (self.agent_view_size-1) - r_vec * (self.agent_view_size // 2)
+
+ # Mask of which cells to highlight
+ highlight_mask = np.zeros(shape=(self.width, self.height), dtype=np.bool)
+
+ # For each cell in the visibility mask
+ for vis_j in range(0, self.agent_view_size):
+ for vis_i in range(0, self.agent_view_size):
+ # If this cell is not visible, don't highlight it
+ if not vis_mask[vis_i, vis_j]:
+ continue
+
+ # Compute the world coordinates of this cell
+ abs_i, abs_j = top_left - (f_vec * vis_j) + (r_vec * vis_i)
+
+ if abs_i < 0 or abs_i >= self.width:
+ continue
+ if abs_j < 0 or abs_j >= self.height:
+ continue
+
+ # Mark this cell to be highlighted
+ highlight_mask[abs_i, abs_j] = True
+
+ # Render the whole grid
+ img = self.grid.render(
+ tile_size,
+ self.agent_pos,
+ self.agent_dir,
+ highlight_mask=highlight_mask if highlight else None,
+ mask_unobserved=mask_unobserved
+ )
+ if mode == 'human':
+ # self.window.set_caption(self.mission)
+ self.window.show_img(img)
+
+ return img
+
+ def get_mission(self):
+ return self.mission
+
+ def close(self):
+ if self.window:
+ self.window.close()
+ return
+
+ def gen_text_obs(self):
+ grid, vis_mask = self.gen_obs_grid()
+
+ # Encode the partially observable view into a numpy array
+ image = grid.encode(vis_mask)
+
+ # (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+ # State, 0: open, 1: closed, 2: locked
+ IDX_TO_COLOR = dict(zip(COLOR_TO_IDX.values(), COLOR_TO_IDX.keys()))
+ IDX_TO_OBJECT = dict(zip(OBJECT_TO_IDX.values(), OBJECT_TO_IDX.keys()))
+
+ list_textual_descriptions = []
+
+ if self.carrying is not None:
+ list_textual_descriptions.append("You carry a {} {}".format(self.carrying.color, self.carrying.type))
+
+ agent_pos_vx, agent_pos_vy = self.get_view_coords(self.agent_pos[0], self.agent_pos[1])
+
+ view_field_dictionary = dict()
+
+ for i in range(image.shape[0]):
+ for j in range(image.shape[1]):
+ if image[i][j][0] != 0 and image[i][j][0] != 1 and image[i][j][0] != 2:
+ if i not in view_field_dictionary.keys():
+ view_field_dictionary[i] = dict()
+ view_field_dictionary[i][j] = image[i][j]
+ else:
+ view_field_dictionary[i][j] = image[i][j]
+
+ # Find the wall if any
+ # We describe a wall only if there is no objects between the agent and the wall in straight line
+
+ # Find wall in front
+ add_wall_descr = False
+ if add_wall_descr:
+ j = agent_pos_vy - 1
+ object_seen = False
+ while j >= 0 and not object_seen:
+ if image[agent_pos_vx][j][0] != 0 and image[agent_pos_vx][j][0] != 1:
+ if image[agent_pos_vx][j][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {agent_pos_vy - j} steps in front of you. \n") # forward
+ object_seen = True
+ else:
+ object_seen = True
+ j -= 1
+ # Find wall left
+ i = agent_pos_vx - 1
+ object_seen = False
+ while i >= 0 and not object_seen:
+ if image[i][agent_pos_vy][0] != 0 and image[i][agent_pos_vy][0] != 1:
+ if image[i][agent_pos_vy][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {agent_pos_vx - i} steps to the left. \n") # left
+ object_seen = True
+ else:
+ object_seen = True
+ i -= 1
+ # Find wall right
+ i = agent_pos_vx + 1
+ object_seen = False
+ while i < image.shape[0] and not object_seen:
+ if image[i][agent_pos_vy][0] != 0 and image[i][agent_pos_vy][0] != 1:
+ if image[i][agent_pos_vy][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {i - agent_pos_vx} steps to the right. \n") # right
+ object_seen = True
+ else:
+ object_seen = True
+ i += 1
+
+ # list_textual_descriptions.append("You see the following objects: ")
+ # returns the position of seen objects relative to you
+ for i in view_field_dictionary.keys():
+ for j in view_field_dictionary[i].keys():
+ if i != agent_pos_vx or j != agent_pos_vy:
+ object = view_field_dictionary[i][j]
+
+ front_dist = agent_pos_vy - j
+ left_right_dist = i-agent_pos_vx
+
+ loc_descr = ""
+ if front_dist == 1 and left_right_dist == 0:
+ loc_descr += "Right in front of you "
+
+ elif left_right_dist == 1 and front_dist == 0:
+ loc_descr += "Just to the right of you"
+
+ elif left_right_dist == -1 and front_dist == 0:
+ loc_descr += "Just to the left of you"
+
+ else:
+ front_str = str(front_dist)+" steps in front of you " if front_dist > 0 else ""
+
+ loc_descr += front_str
+
+ suff = "s" if abs(left_right_dist) > 0 else ""
+ and_ = "and" if loc_descr != "" else ""
+
+ if left_right_dist < 0:
+ left_right_str = f"{and_} {-left_right_dist} step{suff} to the left"
+ loc_descr += left_right_str
+
+ elif left_right_dist > 0:
+ left_right_str = f"{and_} {left_right_dist} step{suff} to the right"
+ loc_descr += left_right_str
+
+ else:
+ left_right_str = ""
+ loc_descr += left_right_str
+
+ loc_descr += f" there is a "
+
+ obj_type = IDX_TO_OBJECT[object[0]]
+ if obj_type == "npc":
+ IDX_TO_STATE = {0: 'friendly', 1: 'antagonistic'}
+
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} peer. "
+
+ # gaze
+ gaze_dir = {
+ 0: "towards you",
+ 1: "to the left of you",
+ 2: "in the same direction as you",
+ 3: "to the right of you",
+ }
+ description += f"It is looking {gaze_dir[object[3]]}. "
+
+ # point
+ point_dir = {
+ 0: "towards you",
+ 1: "to the left of you",
+ 2: "in the same direction as you",
+ 3: "to the right of you",
+ }
+
+ if object[4] != 255:
+ description += f"It is pointing {point_dir[object[4]]}. "
+
+ # last action
+ last_action = {v: k for k, v in SocialAINPCActionsDict.items()}[object[5]]
+
+
+ last_action = {
+ "go_forward": "foward",
+ "rotate_left": "turn left",
+ "rotate_right": "turn right",
+ "toggle_action": "toggle",
+ "point_stop_point": "stop pointing",
+ "point_E": "",
+ "point_S": "",
+ "point_W": "",
+ "point_N": "",
+ "stop_point": "stop pointing",
+ "no_op": ""
+ }[last_action]
+
+ if last_action not in ["no_op", ""]:
+ description += f"It's last action is {last_action}. "
+
+ elif obj_type in ["switch", "apple", "generatorplatform", "marble", "marbletee", "fence"]:
+
+ if obj_type == "switch":
+ # assumes that Switch.no_light == True
+ assert object[-1] == 0
+
+ description = f"{IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[2:].mean() == 0
+
+ elif obj_type == "lockablebox":
+ IDX_TO_STATE = {0: 'open', 1: 'closed', 2: 'locked'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "applegenerator":
+ IDX_TO_STATE = {1: 'square', 2: 'round'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "remotedoor":
+ IDX_TO_STATE = {0: 'open', 1: 'closed'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "door":
+ IDX_TO_STATE = {0: 'open', 1: 'closed', 2: 'locked'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "lever":
+ IDX_TO_STATE = {1: 'activated', 0: 'unactivated'}
+ if object[3] == 255:
+ countdown_txt = ""
+ else:
+ countdown_txt = f"with {object[3]} timesteps left. "
+
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} {countdown_txt}"
+
+ assert object[4:].mean() == 0
+ else:
+ raise ValueError(f"Undefined object type {obj_type}")
+
+ full_destr = loc_descr + description + "\n"
+
+ list_textual_descriptions.append(full_destr)
+
+ if len(list_textual_descriptions) == 0:
+ list_textual_descriptions.append("\n")
+
+ return {'descriptions': list_textual_descriptions}
+
+class MultiModalMiniGridEnv(MiniGridEnv):
+
+ grammar = None
+
+ def reset(self, *args, **kwargs):
+ obs = super().reset()
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+ return obs
+
+ def append_existing_utterance_to_history(self):
+ if self.utterance != self.empty_symbol:
+ if self.utterance.startswith(self.empty_symbol):
+ self.utterance_history += self.utterance[len(self.empty_symbol):]
+ else:
+ assert self.utterance == self.beginning_string
+ self.utterance_history += self.utterance
+
+ def add_utterance_to_observation(self, obs):
+ obs["utterance"] = self.utterance
+ obs["utterance_history"] = self.utterance_history
+ obs["mission"] = "Hidden"
+ return obs
+
+ def reset_utterance(self):
+ # set utterance to empty indicator
+ self.utterance = self.empty_symbol
+
+ def render(self, *args, show_dialogue=True, **kwargs):
+
+ obs = super().render(*args, **kwargs)
+
+ if args[0] == 'human':
+ # draw text to the side of the image
+ self.window.clear_text() # erase previous text
+ if show_dialogue:
+ self.window.set_caption(self.full_conversation)
+
+ # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize': 15, 'color': color, 'weight': "bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+ def add_obstacles(self):
+ self.obstacles = self.parameters.get("Obstacles", "No") if self.parameters else "No"
+
+ if self.obstacles != "No":
+ n_stumps_range = {
+ "A_bit": (1, 2),
+ "Medium": (3, 4),
+ "A_lot": (5, 6),
+ }[self.obstacles]
+
+ n_stumps = random.randint(*n_stumps_range)
+
+ for _ in range(n_stumps):
+ self.wall_start_x = self._rand_int(1, self.current_width - 2)
+ self.wall_start_y = self._rand_int(1, self.current_height - 2)
+ if random.choice([True, False]):
+ self.grid.horz_wall(
+ x=self.wall_start_x,
+ y=self.wall_start_y,
+ length=1
+ )
+ else:
+ self.grid.horz_wall(
+ x=self.wall_start_x,
+ y=self.wall_start_y,
+ length=1
+ )
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/parametric_env.py b/gym-minigrid/gym_minigrid/parametric_env.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce5f2e979df70cb0f6f44b5b67820b0be7907acb
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/parametric_env.py
@@ -0,0 +1,301 @@
+from abc import ABC
+from graphviz import Digraph
+import re
+import random
+from termcolor import cprint
+
+
+class Node:
+ def __init__(self, id, label, type, parent=None):
+ """
+ type: type must be "param" or "value"
+ for type "param" one of the children must be chosen
+ for type "value" all of the children must be set
+ """
+ self.id = id
+ self.label = label
+ self.parent = parent
+ self.children = []
+ self.type = type
+
+ def __repr__(self):
+ return f"{self.id}({self.type})-'{self.label}'"
+
+
+class ParameterTree(ABC):
+
+ def __init__(self):
+
+ self.last_node_id = 0
+
+ self.create_digraph()
+
+ self.nodes = {}
+ self.root = None
+
+ def create_digraph(self):
+ self.tree = Digraph("unix", format='svg')
+ self.tree.attr(size='30,100')
+
+ def add_node(self, label, parent=None, type="param"):
+ """
+ All children of this node must be set
+ """
+ if type not in ["param", "value"]:
+ raise ValueError('Node type must be "param" or "value"')
+
+ if parent is None and self.root is not None:
+ raise ValueError("Root already set: {}. parent cannot be None. ".format(self.root.id))
+
+ # add to graph
+ node_id = self.new_node_id()
+ self.nodes[node_id] = Node(id=node_id, label=label, parent=parent, type=type)
+
+ if parent is None:
+ self.root = self.nodes[node_id]
+ else:
+ self.nodes[parent.id].children.append(self.nodes[node_id])
+
+ return self.nodes[node_id]
+
+ def sample_env_params(self, ACL=None):
+ parameters = {}
+
+ nodes = [self.root]
+
+ # BFS
+ while nodes:
+ node = nodes.pop(0)
+
+ if node.type == "param":
+
+ if len(node.children) == 0:
+ raise ValueError("Node {} doesn't have any children.".format(node.label))
+
+ if ACL is None:
+ # choose randomly
+ chosen = random.choice(node.children)
+ else:
+ # let the ACL choose
+ chosen = ACL.choose(node, parameters)
+
+ assert chosen.type == "value"
+ nodes.append(chosen)
+
+ parameters[node] = chosen
+
+ elif node.type == "value":
+ nodes.extend(node.children)
+
+ else:
+ raise ValueError('Node type must be "param" or "value" and is {}'.format(node.type))
+
+ return parameters
+
+ def new_node_id(self):
+ new_id = self.last_node_id + 1
+ self.last_node_id = new_id
+ return str("node_"+str(new_id))
+
+ def print_tree(self, selected_parameters={}):
+ print("Parameter tree")
+
+ nodes = [self.root]
+ color = None
+
+ # BFS
+ while nodes:
+ node = nodes.pop(0)
+
+ if node.type == "param":
+ if node in selected_parameters.keys():
+ color = "blue"
+ else:
+ color = None
+
+ if node.parent is not None:
+
+ cprint("{}: {} ({}) -----> {}: {} ({})".format(
+ node.parent.type, node.parent.label, node.parent.id,
+ node.type, node.label, node.id
+ ), color)
+
+ else:
+ cprint("{}: {} ({})".format(node.type, node.label, node.id), color)
+
+ nodes.extend(node.children)
+
+ def draw_tree(self, filename, selected_parameters={}, ignore_labels=[], folded_nodes=[], label_parser={}, save=True):
+
+ self.create_digraph()
+
+ nodes = [self.root]
+
+ color_param = "grey60"
+ color_value = "lightgray"
+ fontcolor = "black"
+ fontsize = "18"
+
+ dots_fontsize = "30"
+ folded_param = "grey95"
+ folded_value = "grey95"
+ folded_fontcolor = "gray70"
+
+ def add_fold_symbol(label, folded=False):
+ return label
+ # return label + " ❯" if folded else label
+
+ # BFS - construct vizgraph
+ while nodes:
+
+ node = nodes.pop(0)
+
+ while node.label in ignore_labels:
+ node = nodes.pop(0)
+
+ if node.label in folded_nodes:
+
+ n_label = label_parser.get(node.label, node.label)
+ n_label = add_fold_symbol(n_label, folded=True)
+
+ if node.type == "param":
+ color = folded_param
+ self.tree.attr('node', shape='box', style="filled", color=color, fontcolor=folded_fontcolor, fontsize=fontsize)
+ self.tree.node(name=node.id, label=n_label, type="parameter")
+
+ elif node.type == "value":
+ color = folded_value
+ self.tree.attr('node', shape='ellipse', style='filled', color=color, fontcolor=folded_fontcolor, fontsize=fontsize)
+ self.tree.node(name=node.id, label=n_label, type="value")
+
+
+ else:
+ raise ValueError(f"Undefined node type {node.type}")
+
+ # add folded node sign
+ folded_node_id = node.id+"_fold"
+ # self.tree.attr('node', shape='ellipse', style='filled', color="white", fontcolor=folded_fontcolor, fontsize=fontsize)
+ # self.tree.attr('node', shape='none', style='filled', color="gray", fontcolor=folded_fontcolor, fontsize=dots_fontsize)
+ self.tree.attr('node', shape='none', color="white",fontcolor=folded_fontcolor, fontsize=dots_fontsize)
+ self.tree.node(name=folded_node_id, label="...", type="value")
+ self.tree.edge(node.id, folded_node_id, color=folded_fontcolor)
+
+ elif node.type == "param":
+
+ if node.label in selected_parameters.keys() and (node == self.root or node.parent.selected):
+ color = "lightblue3"
+ node.selected=True
+ else:
+ color = color_param
+ node.selected=False
+
+ n_label = label_parser.get(node.label, node.label)
+ n_label = add_fold_symbol(n_label, folded=False)
+
+ self.tree.attr('node', shape='box', style="filled", color=color, fontcolor=fontcolor, fontsize=fontsize)
+ self.tree.node(name=node.id, label=n_label, type="parameter")
+
+ nodes.extend(node.children)
+
+ elif node.type == "value":
+
+ if (selected_parameters.get(node.parent.label, "Not existent") == node.label) and (node == self.root or node.parent.selected):
+ # if node.label in selected_parameters.values() and (node == self.root or node.parent.selected):
+ color = "lightblue2"
+ node.selected = True
+ else:
+ color = color_value
+ node.selected = False
+
+ n_label = label_parser.get(node.label, node.label)
+ n_label = add_fold_symbol(n_label, folded=False)
+
+ # add to vizgraph
+ self.tree.attr('node', shape='ellipse', style='filled', color=color, fontcolor=fontcolor, fontsize=fontsize)
+ self.tree.node(name=node.id, label=n_label, type="value")
+
+ nodes.extend(node.children)
+ else:
+ raise ValueError(f"Undefined node type {node.type}")
+
+ if node.parent is not None:
+ self.tree.edge(node.parent.id, node.id)
+
+
+ # draw image
+ if save:
+ self.tree.render(filename)
+ print("Tree image saved in : {}".format(filename))
+
+
+if __name__ == '__main__':
+ # demo of how to use the ParameterTree class
+
+ tree = ParameterTree()
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+ perc_inf_nd = tree.add_node("Perception_inference", parent=env_type_nd, type="value")
+ raise DeprecationWarning("deprecated parameters")
+
+ # Information seeking
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ tree.add_node("lot", parent=scaffolding_nd, type="value")
+ tree.add_node("medium", parent=scaffolding_nd, type="value")
+ tree.add_node("little", parent=scaffolding_nd, type="value")
+ tree.add_node("none", parent=scaffolding_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye contact", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Hello", parent=prag_fr_compl_nd, type="value")
+
+ emulation_nd = tree.add_node("Emulation", parent=inf_seeking_nd, type="param")
+ tree.add_node("N", parent=emulation_nd, type="value")
+ tree.add_node("Y", parent=emulation_nd, type="value")
+
+ pointing_nd = tree.add_node("Pointing", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=pointing_nd, type="value")
+ tree.add_node("Direct", parent=pointing_nd, type="value")
+ tree.add_node("Indirect", parent=pointing_nd, type="value")
+
+ language_graounding_nd = tree.add_node("Language_grounding", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=language_graounding_nd, type="value")
+ tree.add_node("Color", parent=language_graounding_nd, type="value")
+ tree.add_node("Feedback", parent=language_graounding_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+ tree.add_node("Boxes", parent=problem_nd, type="value")
+ tree.add_node("Switches", parent=problem_nd, type="value")
+ tree.add_node("Corridors", parent=problem_nd, type="value")
+
+ obstacles_nd = tree.add_node("Obstacles", parent=inf_seeking_nd, type="param")
+ tree.add_node("no", parent=obstacles_nd, type="value")
+ tree.add_node("lava", parent=obstacles_nd, type="value")
+ tree.add_node("walls", parent=obstacles_nd, type="value")
+
+ # Collaboration
+ colab_type_nd = tree.add_node("Collaboration type", parent=collab_nd, type="param")
+ tree.add_node("Door Lever", parent=colab_type_nd, type="value")
+ tree.add_node("Door Button", parent=colab_type_nd, type="value")
+ tree.add_node("Marble Run", parent=colab_type_nd, type="value")
+ tree.add_node("Marble Pass", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ tree.add_node("asocial", parent=role_nd, type="value")
+
+ # Perception inference
+ NPC_movement_nd = tree.add_node("NPC movement", parent=perc_inf_nd, type="param")
+ tree.add_node("can't turn; can't move", parent=NPC_movement_nd, type="value")
+ tree.add_node("can turn; can't move", parent=NPC_movement_nd, type="value")
+ tree.add_node("can turn; can move", parent=NPC_movement_nd, type="value")
+
+ occlusion_nd = tree.add_node("Occlusions", parent=perc_inf_nd, type="param")
+ tree.add_node("no", parent=occlusion_nd, type="value")
+ tree.add_node("walls", parent=occlusion_nd, type="value")
+
+ params = tree.sample_env_params()
+ tree.draw_tree("viz/demotree", params)
+
diff --git a/gym-minigrid/gym_minigrid/register.py b/gym-minigrid/gym_minigrid/register.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c7fff1a53a8318f085a7b55439f2d34411817cd
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/register.py
@@ -0,0 +1,25 @@
+from gym.envs.registration import register as gym_register
+
+env_list = []
+
+def register(
+ id,
+ entry_point,
+ reward_threshold=0.95,
+ kwargs={}
+):
+ assert id.startswith("MiniGrid-") or id.startswith("SocialAI-")
+ assert id not in env_list
+
+ # print("Registered:", id)
+
+ # Register the environment with OpenAI gym
+ gym_register(
+ id=id,
+ entry_point=entry_point,
+ reward_threshold=reward_threshold,
+ kwargs=kwargs
+ )
+
+ # Add the environment to the set
+ env_list.append(id)
diff --git a/gym-minigrid/gym_minigrid/rendering.py b/gym-minigrid/gym_minigrid/rendering.py
new file mode 100644
index 0000000000000000000000000000000000000000..de2024e4576cccd4606e5d3eed7be631f806c308
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/rendering.py
@@ -0,0 +1,137 @@
+import math
+import numpy as np
+
+def downsample(img, factor):
+ """
+ Downsample an image along both dimensions by some factor
+ """
+
+ assert img.shape[0] % factor == 0
+ assert img.shape[1] % factor == 0
+
+ img = img.reshape([img.shape[0]//factor, factor, img.shape[1]//factor, factor, 3])
+ img = img.mean(axis=3)
+ img = img.mean(axis=1)
+
+ return img
+
+def fill_coords(img, fn, color):
+ """
+ Fill pixels of an image with coordinates matching a filter function
+ """
+
+ for y in range(img.shape[0]):
+ for x in range(img.shape[1]):
+ yf = (y + 0.5) / img.shape[0]
+ xf = (x + 0.5) / img.shape[1]
+ if fn(xf, yf):
+ img[y, x] = color
+
+ return img
+
+def rotate_fn(fin, cx, cy, theta):
+ def fout(x, y):
+ x = x - cx
+ y = y - cy
+
+ x2 = cx + x * math.cos(-theta) - y * math.sin(-theta)
+ y2 = cy + y * math.cos(-theta) + x * math.sin(-theta)
+
+ return fin(x2, y2)
+
+ return fout
+
+def point_in_line(x0, y0, x1, y1, r):
+ p0 = np.array([x0, y0])
+ p1 = np.array([x1, y1])
+ dir = p1 - p0
+ dist = np.linalg.norm(dir)
+ dir = dir / dist
+
+ xmin = min(x0, x1) - r
+ xmax = max(x0, x1) + r
+ ymin = min(y0, y1) - r
+ ymax = max(y0, y1) + r
+
+ def fn(x, y):
+ # Fast, early escape test
+ if x < xmin or x > xmax or y < ymin or y > ymax:
+ return False
+
+ q = np.array([x, y])
+ pq = q - p0
+
+ # Closest point on line
+ a = np.dot(pq, dir)
+ a = np.clip(a, 0, dist)
+ p = p0 + a * dir
+
+ dist_to_line = np.linalg.norm(q - p)
+ return dist_to_line <= r
+
+ return fn
+
+def point_in_circle(cx, cy, r):
+ def fn(x, y):
+ return (x-cx)*(x-cx) + (y-cy)*(y-cy) <= r * r
+ return fn
+
+
+def point_in_circle_clip(cx, cy, r, theta_start=0, theta_end=-np.pi):
+ def fn(x, y):
+
+ if (x-cx)*(x-cx) + (y-cy)*(y-cy) <= r * r:
+ if theta_start < 0:
+ return theta_start > np.arctan2(y-cy, x-cx) > theta_end
+ else:
+ return theta_start < np.arctan2(y - cy, x - cx) < theta_end
+
+ return fn
+
+def point_in_rect(xmin, xmax, ymin, ymax):
+ def fn(x, y):
+ return x >= xmin and x <= xmax and y >= ymin and y <= ymax
+ return fn
+
+def point_in_triangle(a, b, c):
+ a = np.array(a)
+ b = np.array(b)
+ c = np.array(c)
+
+ def fn(x, y):
+ v0 = c - a
+ v1 = b - a
+ v2 = np.array((x, y)) - a
+
+ # Compute dot products
+ dot00 = np.dot(v0, v0)
+ dot01 = np.dot(v0, v1)
+ dot02 = np.dot(v0, v2)
+ dot11 = np.dot(v1, v1)
+ dot12 = np.dot(v1, v2)
+
+ # Compute barycentric coordinates
+ inv_denom = 1 / (dot00 * dot11 - dot01 * dot01)
+ u = (dot11 * dot02 - dot01 * dot12) * inv_denom
+ v = (dot00 * dot12 - dot01 * dot02) * inv_denom
+
+ # Check if point is in triangle
+ return (u >= 0) and (v >= 0) and (u + v) < 1
+
+ return fn
+
+def point_in_quadrangle(a, b, c, d):
+ fn1 = point_in_triangle(a, b, c)
+ fn2 = point_in_triangle(b, c, d)
+
+ fn = lambda x, y: fn1(x, y) or fn2(x, y)
+ return fn
+
+def highlight_img(img, color=(255, 255, 255), alpha=0.30):
+ """
+ Add highlighting to an image
+ """
+
+ blend_img = img + alpha * (np.array(color, dtype=np.uint8) - img)
+ blend_img = blend_img.clip(0, 255).astype(np.uint8)
+ img[:, :, :] = blend_img
diff --git a/gym-minigrid/gym_minigrid/roomgrid.py b/gym-minigrid/gym_minigrid/roomgrid.py
new file mode 100644
index 0000000000000000000000000000000000000000..de64ac9d9a8b5fe63552511ef791c359ab237fb5
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/roomgrid.py
@@ -0,0 +1,395 @@
+from .minigrid import *
+
+def reject_next_to(env, pos):
+ """
+ Function to filter out object positions that are right next to
+ the agent's starting point
+ """
+
+ sx, sy = env.agent_pos
+ x, y = pos
+ d = abs(sx - x) + abs(sy - y)
+ return d < 2
+
+class Room:
+ def __init__(
+ self,
+ top,
+ size
+ ):
+ # Top-left corner and size (tuples)
+ self.top = top
+ self.size = size
+
+ # List of door objects and door positions
+ # Order of the doors is right, down, left, up
+ self.doors = [None] * 4
+ self.door_pos = [None] * 4
+
+ # List of rooms adjacent to this one
+ # Order of the neighbors is right, down, left, up
+ self.neighbors = [None] * 4
+
+ # Indicates if this room is behind a locked door
+ self.locked = False
+
+ # List of objects contained
+ self.objs = []
+
+ def rand_pos(self, env):
+ topX, topY = self.top
+ sizeX, sizeY = self.size
+ return env._randPos(
+ topX + 1, topX + sizeX - 1,
+ topY + 1, topY + sizeY - 1
+ )
+
+ def pos_inside(self, x, y):
+ """
+ Check if a position is within the bounds of this room
+ """
+
+ topX, topY = self.top
+ sizeX, sizeY = self.size
+
+ if x < topX or y < topY:
+ return False
+
+ if x >= topX + sizeX or y >= topY + sizeY:
+ return False
+
+ return True
+
+class RoomGrid(MiniGridEnv):
+ """
+ Environment with multiple rooms and random objects.
+ This is meant to serve as a base class for other environments.
+ """
+
+ def __init__(
+ self,
+ room_size=7,
+ num_rows=3,
+ num_cols=3,
+ max_steps=100,
+ seed=0
+ ):
+ assert room_size > 0
+ assert room_size >= 3
+ assert num_rows > 0
+ assert num_cols > 0
+ self.room_size = room_size
+ self.num_rows = num_rows
+ self.num_cols = num_cols
+
+ height = (room_size - 1) * num_rows + 1
+ width = (room_size - 1) * num_cols + 1
+
+ # By default, this environment has no mission
+ self.mission = ''
+
+ super().__init__(
+ width=width,
+ height=height,
+ max_steps=max_steps,
+ see_through_walls=False,
+ seed=seed
+ )
+
+ def room_from_pos(self, x, y):
+ """Get the room a given position maps to"""
+
+ assert x >= 0
+ assert y >= 0
+
+ i = x // (self.room_size-1)
+ j = y // (self.room_size-1)
+
+ assert i < self.num_cols
+ assert j < self.num_rows
+
+ return self.room_grid[j][i]
+
+ def get_room(self, i, j):
+ assert i < self.num_cols
+ assert j < self.num_rows
+ return self.room_grid[j][i]
+
+ def _gen_grid(self, width, height):
+ # Create the grid
+ self.grid = Grid(width, height, self.nb_obj_dims)
+
+ self.room_grid = []
+
+ # For each row of rooms
+ for j in range(0, self.num_rows):
+ row = []
+
+ # For each column of rooms
+ for i in range(0, self.num_cols):
+ room = Room(
+ (i * (self.room_size-1), j * (self.room_size-1)),
+ (self.room_size, self.room_size)
+ )
+ row.append(room)
+
+ # Generate the walls for this room
+ self.grid.wall_rect(*room.top, *room.size)
+
+ self.room_grid.append(row)
+
+ # For each row of rooms
+ for j in range(0, self.num_rows):
+ # For each column of rooms
+ for i in range(0, self.num_cols):
+ room = self.room_grid[j][i]
+
+ x_l, y_l = (room.top[0] + 1, room.top[1] + 1)
+ x_m, y_m = (room.top[0] + room.size[0] - 1, room.top[1] + room.size[1] - 1)
+
+ # Door positions, order is right, down, left, up
+ if i < self.num_cols - 1:
+ room.neighbors[0] = self.room_grid[j][i+1]
+ room.door_pos[0] = (x_m, self._rand_int(y_l, y_m))
+ if j < self.num_rows - 1:
+ room.neighbors[1] = self.room_grid[j+1][i]
+ room.door_pos[1] = (self._rand_int(x_l, x_m), y_m)
+ if i > 0:
+ room.neighbors[2] = self.room_grid[j][i-1]
+ room.door_pos[2] = room.neighbors[2].door_pos[0]
+ if j > 0:
+ room.neighbors[3] = self.room_grid[j-1][i]
+ room.door_pos[3] = room.neighbors[3].door_pos[1]
+
+ # The agent starts in the middle, facing right
+ self.agent_pos = (
+ (self.num_cols // 2) * (self.room_size-1) + (self.room_size // 2),
+ (self.num_rows // 2) * (self.room_size-1) + (self.room_size // 2)
+ )
+ self.agent_dir = 0
+
+ def place_in_room(self, i, j, obj):
+ """
+ Add an existing object to room (i, j)
+ """
+
+ room = self.get_room(i, j)
+
+ pos = self.place_obj(
+ obj,
+ room.top,
+ room.size,
+ reject_fn=reject_next_to,
+ max_tries=1000
+ )
+
+ room.objs.append(obj)
+
+ return obj, pos
+
+ def add_object(self, i, j, kind=None, color=None):
+ """
+ Add a new object to room (i, j)
+ """
+ if kind == None:
+ kind = self._rand_elem(['key', 'ball', 'box'])
+
+ if color == None:
+ color = self._rand_color()
+
+ assert kind in ['key', 'ball', 'box']
+ if kind == 'key':
+ obj = Key(color)
+ elif kind == 'ball':
+ obj = Ball(color)
+ elif kind == 'box':
+ obj = Box(color)
+
+ return self.place_in_room(i, j, obj)
+
+ def add_door(self, i, j, door_idx=None, color=None, locked=None):
+ """
+ Add a door to a room, connecting it to a neighbor
+ """
+
+ room = self.get_room(i, j)
+
+ if door_idx == None:
+ # Need to make sure that there is a neighbor along this wall
+ # and that there is not already a door
+ while True:
+ door_idx = self._rand_int(0, 4)
+ if room.neighbors[door_idx] and room.doors[door_idx] is None:
+ break
+
+ if color == None:
+ color = self._rand_color()
+
+ if locked is None:
+ locked = self._rand_bool()
+
+ assert room.doors[door_idx] is None, "door already exists"
+
+ room.locked = locked
+ door = Door(color, is_locked=locked)
+
+ pos = room.door_pos[door_idx]
+ self.grid.set(*pos, door)
+ door.cur_pos = pos
+
+ neighbor = room.neighbors[door_idx]
+ room.doors[door_idx] = door
+ neighbor.doors[(door_idx+2) % 4] = door
+
+ return door, pos
+
+ def remove_wall(self, i, j, wall_idx):
+ """
+ Remove a wall between two rooms
+ """
+
+ room = self.get_room(i, j)
+
+ assert wall_idx >= 0 and wall_idx < 4
+ assert room.doors[wall_idx] is None, "door exists on this wall"
+ assert room.neighbors[wall_idx], "invalid wall"
+
+ neighbor = room.neighbors[wall_idx]
+
+ tx, ty = room.top
+ w, h = room.size
+
+ # Ordering of walls is right, down, left, up
+ if wall_idx == 0:
+ for i in range(1, h - 1):
+ self.grid.set(tx + w - 1, ty + i, None)
+ elif wall_idx == 1:
+ for i in range(1, w - 1):
+ self.grid.set(tx + i, ty + h - 1, None)
+ elif wall_idx == 2:
+ for i in range(1, h - 1):
+ self.grid.set(tx, ty + i, None)
+ elif wall_idx == 3:
+ for i in range(1, w - 1):
+ self.grid.set(tx + i, ty, None)
+ else:
+ assert False, "invalid wall index"
+
+ # Mark the rooms as connected
+ room.doors[wall_idx] = True
+ neighbor.doors[(wall_idx+2) % 4] = True
+
+ def place_agent(self, i=None, j=None, rand_dir=True):
+ """
+ Place the agent in a room
+ """
+
+ if i == None:
+ i = self._rand_int(0, self.num_cols)
+ if j == None:
+ j = self._rand_int(0, self.num_rows)
+
+ room = self.room_grid[j][i]
+
+ # Find a position that is not right in front of an object
+ while True:
+ super().place_agent(room.top, room.size, rand_dir, max_tries=1000)
+ front_cell = self.grid.get(*self.front_pos)
+ if front_cell is None or front_cell.type is 'wall':
+ break
+
+ return self.agent_pos
+
+ def connect_all(self, door_colors=COLOR_NAMES, max_itrs=5000):
+ """
+ Make sure that all rooms are reachable by the agent from its
+ starting position
+ """
+
+ start_room = self.room_from_pos(*self.agent_pos)
+
+ added_doors = []
+
+ def find_reach():
+ reach = set()
+ stack = [start_room]
+ while len(stack) > 0:
+ room = stack.pop()
+ if room in reach:
+ continue
+ reach.add(room)
+ for i in range(0, 4):
+ if room.doors[i]:
+ stack.append(room.neighbors[i])
+ return reach
+
+ num_itrs = 0
+
+ while True:
+ # This is to handle rare situations where random sampling produces
+ # a level that cannot be connected, producing in an infinite loop
+ if num_itrs > max_itrs:
+ raise RecursionError('connect_all failed')
+ num_itrs += 1
+
+ # If all rooms are reachable, stop
+ reach = find_reach()
+ if len(reach) == self.num_rows * self.num_cols:
+ break
+
+ # Pick a random room and door position
+ i = self._rand_int(0, self.num_cols)
+ j = self._rand_int(0, self.num_rows)
+ k = self._rand_int(0, 4)
+ room = self.get_room(i, j)
+
+ # If there is already a door there, skip
+ if not room.door_pos[k] or room.doors[k]:
+ continue
+
+ if room.locked or room.neighbors[k].locked:
+ continue
+
+ color = self._rand_elem(door_colors)
+ door, _ = self.add_door(i, j, k, color, False)
+ added_doors.append(door)
+
+ return added_doors
+
+ def add_distractors(self, i=None, j=None, num_distractors=10, all_unique=True):
+ """
+ Add random objects that can potentially distract/confuse the agent.
+ """
+
+ # Collect a list of existing objects
+ objs = []
+ for row in self.room_grid:
+ for room in row:
+ for obj in room.objs:
+ objs.append((obj.type, obj.color))
+
+ # List of distractors added
+ dists = []
+
+ while len(dists) < num_distractors:
+ color = self._rand_elem(COLOR_NAMES)
+ type = self._rand_elem(['key', 'ball', 'box'])
+ obj = (type, color)
+
+ if all_unique and obj in objs:
+ continue
+
+ # Add the object to a random room if no room specified
+ room_i = i
+ room_j = j
+ if room_i == None:
+ room_i = self._rand_int(0, self.num_cols)
+ if room_j == None:
+ room_j = self._rand_int(0, self.num_rows)
+
+ dist, pos = self.add_object(room_i, room_j, *obj)
+
+ objs.append(obj)
+ dists.append(dist)
+
+ return dists
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/__init__.py b/gym-minigrid/gym_minigrid/social_ai_envs/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d257e316295bcf6550d6b89d9e997f744731ea31
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/__init__.py
@@ -0,0 +1,31 @@
+from gym_minigrid.social_ai_envs.informationseekingenv import *
+
+from gym_minigrid.social_ai_envs.leverdoorenv import *
+from gym_minigrid.social_ai_envs.marblepassenv import *
+from gym_minigrid.social_ai_envs.marblepushenv import *
+from gym_minigrid.social_ai_envs.objectscollaborationenv import *
+
+from gym_minigrid.social_ai_envs.applestealingenv import *
+
+# from gym_minigrid.social_ai_envs.othersperceptioninferenceparamenv import *
+# from gym_minigrid.social_ai_envs.informationseekingparamenv import *
+# from gym_minigrid.social_ai_envs.collaborationparamenv import *
+
+from gym_minigrid.social_ai_envs.socialaiparamenv import *
+
+# from gym_minigrid.social_ai_envs.testsocialaienvs import *
+
+from gym_minigrid.social_ai_envs.case_studies_envs.casestudiesenvs import *
+
+# from gym_minigrid.social_ai_envs.case_studies_envs.pointingcasestudyenvs import *
+# from gym_minigrid.social_ai_envs.case_studies_envs.langcolorcasestudyenvs import *
+# from gym_minigrid.social_ai_envs.case_studies_envs.langfeedbackcasestudyenvs import *
+from gym_minigrid.social_ai_envs.case_studies_envs.informationseekingcasestudyenvs import *
+
+from gym_minigrid.social_ai_envs.case_studies_envs.imitationcasestudyenvs import *
+
+from gym_minigrid.social_ai_envs.case_studies_envs.formatscasestudyenvs import *
+
+from gym_minigrid.social_ai_envs.case_studies_envs.applestealingcasestudiesenvs import *
+
+from gym_minigrid.social_ai_envs.case_studies_envs.LLMcasestudyenvs import *
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/applestealingenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/applestealingenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..efeb3a74dab2f7988a36e62be3358e500a282d9c
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/applestealingenv.py
@@ -0,0 +1,405 @@
+import time
+
+import numpy as np
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+import time
+from collections import deque
+
+
+class AppleGuardingNPC(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_dir = np.random.randint(0, 4) # NPC initially looks downward
+ self.npc_type = 1 # this will be put into the encoding
+
+ self.was_introduced_to = False
+
+ self.ate_an_apple = False
+ self.demo_over = False
+ self.demo_over_and_position_safe = False
+ self.apple_unlocked_for_agent = False
+
+
+ self.target_obj = self.env.apple
+
+ self.waiting_counter = 0
+ self.wait_steps = 4
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+ def draw_npc_face(self, c):
+ assert self.npc_type == 1
+
+ assert all(COLORS[self.color] == c)
+
+ shapes = []
+ shapes_colors = []
+
+ # Draw eyes
+ shapes.append(point_in_circle(cx=0.70, cy=0.50, r=0.10))
+ shapes_colors.append(c)
+
+ shapes.append(point_in_circle(cx=0.30, cy=0.50, r=0.10))
+ shapes_colors.append(c)
+
+ # Draw mouth
+ shapes.append(point_in_rect(0.20, 0.80, 0.72, 0.81))
+ shapes_colors.append(c)
+
+ # Draw eyebrows
+ shapes.append(point_in_triangle((0.15, 0.20),
+ (0.85, 0.20),
+ (0.50, 0.35)))
+ shapes_colors.append(c)
+
+ shapes.append(point_in_triangle((0.30, 0.20),
+ (0.70, 0.20),
+ (0.5, 0.35)))
+ shapes_colors.append((0,0,0))
+
+ return shapes, shapes_colors
+
+ def can_see_pos(self, obj_pos):
+
+ # is the npc seen by the agent
+ npc_view_obj = self.relative_coords(*obj_pos)
+ grid, vis_mask = self.gen_obs_grid()
+
+ if npc_view_obj is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = npc_view_obj
+
+ # is it occluded
+ object_observed = vis_mask[ag_view_npc_x, ag_view_npc_y]
+ else:
+ object_observed = False
+
+ return object_observed, grid, vis_mask
+
+ def step(self, utterance):
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ # reply, action = self.handle_introduction(utterance) # revert this?
+ reply, action = None, None
+
+ NPC_movement = self.env.parameters.get("NPC_movement", "Rotating")
+
+ if self.waiting_counter >= self.wait_steps:
+ self.waiting_counter = 0
+
+ if NPC_movement == "Rotating":
+ action = random.choice([self.rotate_left, self.rotate_right])
+
+ elif NPC_movement == "Walking":
+ action = random.choice([
+ random.choice([
+ self.rotate_left, # 25 %
+ self.rotate_right # 25 %
+ ]),
+ self.go_forward # 50%
+ ])
+ else:
+ raise DeprecationWarning(f"Undefined movement option {NPC_movement}")
+
+ else:
+ self.waiting_counter += 1
+
+ if action is not None:
+ action()
+
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": reply or "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+
+ assert (reply or "no_op") in self.list_of_possible_utterances
+
+ return reply, info
+
+
+class AppleStealingEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ switch_no_light=False,
+ reward_diminish_factor=0.1,
+ see_through_walls=False,
+ egocentric_observation=True,
+ tagged_apple=False,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+ self.switch_no_light = switch_no_light
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ # encoding size should be 5
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+
+ self.reward_diminish_factor = reward_diminish_factor
+
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=see_through_walls,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+ self.all_npc_utterance_actions = AppleGuardingNPC.get_list_of_possible_utterances()
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ self.tagged_apple = tagged_apple
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ # new
+ self.current_width = self._rand_int(7, width_+1)
+ self.current_height = self._rand_int(7, height_+1)
+ # print("Room size: {}x{}".format(self.current_width, self.current_height))
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ self.version = self.parameters["Version"] if self.parameters else "Asocial"
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ self.add_obstacles()
+
+ # apple
+ self.apple_pos = (self.current_width, self.current_height)
+
+ # find the position for the apple/box/generator_platform
+
+ self.apple_current_pos = self.find_loc(size=self.apple_pos, reject_agent_pos=True, reject_taken_pos=True)
+ assert all(self.apple_current_pos < np.array([self.current_width-1, self.current_height-1]))
+
+ self.apple = Apple()
+ self.put_obj_np(self.apple, self.apple_current_pos)
+
+ # NPC
+ color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = AppleGuardingNPC(color, "Peer", self)
+
+ if self.version == "Social":
+ self.place_obj(self.caretaker, size=(self.current_width, self.current_height))
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(self.current_width, self.current_height))
+
+ # Generate the mission string
+ self.mission = 'undefined'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_the_apple = False
+
+ return obs
+
+ def step(self, action):
+
+ success = False
+
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ apple_had_been_eaten = self.apple.eaten
+ if self.version == "Social":
+ agent_seen_by_npc, _, _ = self.caretaker.can_see_pos(self.agent_pos)
+ else:
+ agent_seen_by_npc = False
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ if not self.agent_ate_the_apple:
+ self.agent_ate_the_apple = self.apple.eaten and not apple_had_been_eaten
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.version == "Social":
+ reply, npc_info = self.caretaker.step(utterance)
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+
+ else:
+ npc_info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": False,
+ }
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ elif self.agent_ate_the_apple:
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.apple
+
+ if agent_seen_by_npc:
+ reward = 0
+ success = False
+
+ else:
+ reward = self._reward()
+ success = True
+
+ done = True
+
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.apple
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ # for rendering
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ if self.version == "Social":
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+ assert success == (reward > 0)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, show_dialogue=False, **kwargs)
+ return obs
+
+
+register(
+ id='SocialAI-AppleStealingEnv-v0',
+ entry_point='gym_minigrid.social_ai_envs:AppleStealingEnv'
+)
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/LLMcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/LLMcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad4926b6b600b49b3b7ef2858aa5315cdca27519
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/LLMcasestudyenvs.py
@@ -0,0 +1,176 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+'''
+These are the environments for case studies 1-3: Pointing, Language (Color and Feedback), and Joint Attention.
+
+Intro sequence is always eye contact (E) in both the training and testing envs
+
+The Training environments have the 5 problems and Marbles in the Asocial version (no distractor, no peer)
+registered training envs : cues x {joint attention, no}
+
+The Testing e environments are always one problem per env - i.e no testing on two problems at the same time
+registered testing envs : cues x problems x {social, asocial} x {joint attention, no}
+'''
+
+PROBLEMS = ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]
+CUES = ["Pointing", "LangFeedback", "LangColor"]
+INTRO_SEC = ["E"]
+# INTRO_SEC = ["N", "E", "A", "AE"]
+
+
+class AsocialBoxInformationSeekingParamEnv(SocialAIParamEnv):
+ '''
+ Env with all problems in the asocial version -> just for testing
+ '''
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+
+class ColorBoxesLLMCSParamEnv(SocialAIParamEnv):
+
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+class ColorLLMCSParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ # boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ # version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ # tree.add_node("2", parent=version_nd, type="value")
+ # peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ # tree.add_node("Y", parent=peer_nd, type="value")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+# register dummy env
+register(
+ id='SocialAI-AsocialBoxInformationSeekingParamEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:AsocialBoxInformationSeekingParamEnv',
+)
+
+
+register(
+ id='SocialAI-ColorBoxesLLMCSParamEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:ColorBoxesLLMCSParamEnv',
+)
+
+register(
+ id='SocialAI-ColorLLMCSParamEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:ColorLLMCSParamEnv',
+)
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/applestealingcasestudiesenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/applestealingcasestudiesenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..f03e6052301ca5322a766e04c3c5cf093e773c44
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/applestealingcasestudiesenvs.py
@@ -0,0 +1,87 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+class AppleStealingParamEnv(SocialAIParamEnv):
+
+ def __init__(self, obstacles, asocial, walk, **kwargs):
+
+ self.asocial = asocial
+ self.obstacles = obstacles
+ self.walk = walk
+
+ super(AppleStealingParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("AppleStealing", parent=env_type_nd, type="value")
+
+ # colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ # tree.add_node("AppleStealing", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ if self.asocial:
+ tree.add_node("Asocial", parent=role_nd, type="value")
+ else:
+ social_nd = tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("NPC_movement", parent=social_nd, type="param")
+ if self.walk:
+ tree.add_node("Walking", parent=role_nd, type="value")
+ else:
+ tree.add_node("Rotating", parent=role_nd, type="value")
+
+ obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+
+ if self.obstacles not in ["No", "A_bit", "Medium", "A_lot"]:
+ raise ValueError("Undefined obstacle amount.")
+
+ tree.add_node(self.obstacles, parent=obstacles_nd, type="value")
+
+ return tree
+
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+
+# register testing envs : cues x problems x {social, asocial} x {joint attention, no}
+for asocial in [True, False]:
+ for obst in ["No", "A_bit", "Medium", "A_lot"]:
+ if asocial:
+ env_name = f'{"Asocial" if asocial else ""}AppleStealingObst_{obst}ParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:AppleStealingParamEnv',
+ kwargs={
+ 'asocial': asocial,
+ 'obstacles': obst,
+ 'walk': False,
+ }
+ )
+
+ else:
+ for walk in [True, False]:
+ env_name = f'{"Asocial" if asocial else ""}AppleStealing{"Walk" if walk and not asocial else ""}Obst_{obst}ParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:AppleStealingParamEnv',
+ kwargs={
+ 'asocial': asocial,
+ 'obstacles': obst,
+ 'walk': walk,
+ }
+ )
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/casestudiesenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/casestudiesenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4342b269d95abdd83e8ab676f469da40aad5b4a
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/casestudiesenvs.py
@@ -0,0 +1,1950 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+
+# # Pointing case study (table 1)
+#
+# # training
+# class EPointingInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# # testing
+# class EPointingBoxesInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingSwitchesInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingGeneratorsInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingLeversInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingDoorsInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+# # Lang Color case study (table 1)
+# # training
+# class EPointingInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# # testing
+# class EPointingBoxesInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingSwitchesInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingGeneratorsInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingLeversInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EPointingDoorsInformationSeekingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+
+
+
+
+# grid searches envs
+# Doors
+# class ELanguageColorDoorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackDoorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+#
+#
+# # Levers
+# class ELanguageColorLeversInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackLeversInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+# # Switches
+# class ELanguageColorSwitchesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackSwitchesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# # Marble
+# class ELanguageColorMarbleInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackMarbleInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+# # Generators
+# class ELanguageColorGeneratorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackGeneratorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+#
+# class CuesGridSearchParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class EmulationGridSearchParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Emulation", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class CuesGridSearchPointingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class CuesGridSearchLangColorParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class CuesGridSearchLangFeedbackParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+# tree.add_node("Marbles", parent=problem_nd, type="value")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class GridSearchParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class GridSearchPointingParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+# class GridSearchLangColorParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class GridSearchLangFeedbackParamEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+# # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+# tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+# # tree.add_node("Pointing", parent=cue_type_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# # Boxes
+# class ELanguageColorBoxesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackBoxesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingBoxesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Boxes", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+# # Levers
+# class ELanguageColorLeversInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackLeversInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingLeversInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Levers", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+#
+# # Doors
+# class ELanguageColorDoorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackDoorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingDoorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Doors", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+# # Switches
+# class ELanguageColorSwitchesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackSwitchesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingSwitchesInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Switches", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+#
+#
+# # Marble
+# class ELanguageColorMarbleInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackMarbleInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingMarbleInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Marble", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# # Generators
+# class ELanguageColorGeneratorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Color", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class ELanguageFeedbackGeneratorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# language_grounding_nd = tree.add_node("Language_grounding", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Feedback", parent=language_grounding_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+#
+#
+# class EPointingGeneratorsInformationSeekingEnv(SocialAIParamEnv):
+#
+# def construct_tree(self):
+# tree = ParameterTree()
+#
+# env_type_nd = tree.add_node("Env_type", type="param")
+#
+# # Information seeking
+# inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+#
+# prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+# tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+#
+# # scaffolding
+# scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+# scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+#
+# pointing_nd = tree.add_node("Pointing", parent=scaffolding_N_nd, type="param")
+# tree.add_node("Direct", parent=pointing_nd, type="value")
+#
+# N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+# tree.add_node("2", parent=N_bo_nd, type="value")
+#
+# problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+# tree.add_node("Generators", parent=problem_nd, type="value")
+#
+# return tree
+
+
+# Collaboration
+class LeverDoorCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("LeverDoor", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+ tree.add_node("No", parent=obstacles_nd, type="value")
+
+ return tree
+
+
+class MarblePushCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ return tree
+
+
+class MarblePassCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ return tree
+
+class MarblePassACollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+
+ return tree
+
+class MarblePassBCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ return tree
+
+class ObjectsCollaborationParamEnv(SocialAIParamEnv):
+ def __init__(self, problem=None, **kwargs):
+
+ self.problem = problem
+
+ super(ObjectsCollaborationParamEnv, self).__init__(**kwargs)
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ if self.problem is None:
+ tree.add_node("Boxes", parent=colab_type_nd, type="value")
+ tree.add_node("Switches", parent=colab_type_nd, type="value")
+ tree.add_node("Generators", parent=colab_type_nd, type="value")
+ tree.add_node("Marble", parent=colab_type_nd, type="value")
+ else:
+ tree.add_node(self.problem, parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ return tree
+
+
+
+class RoleReversalCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ tree.add_node("Boxes", parent=colab_type_nd, type="value")
+ tree.add_node("Switches", parent=colab_type_nd, type="value")
+ tree.add_node("Generators", parent=colab_type_nd, type="value")
+ tree.add_node("Marble", parent=colab_type_nd, type="value")
+ tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+ tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+ tree.add_node("LeverDoor", parent=colab_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ # obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+ # tree.add_node("No", parent=obstacles_nd, type="value")
+
+ return tree
+
+class RoleReversalGroupExperimentalCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+
+ problem_nd = tree.add_node("Boxes", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Switches", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Generators", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Marble", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ # tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("LeverDoor", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+
+ # obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+ # tree.add_node("No", parent=obstacles_nd, type="value")
+
+ return tree
+
+class RoleReversalGroupControlCollaborationParamEnv(SocialAIParamEnv):
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+
+ problem_nd = tree.add_node("Boxes", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Switches", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Generators", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Marble", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Asocial", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("LeverDoor", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+
+ # obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+ # tree.add_node("No", parent=obstacles_nd, type="value")
+
+ return tree
+
+class AsocialMarbleCollaborationParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+
+ problem_nd = tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ # tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Asocial", parent=role_nd, type="value")
+
+ return tree
+
+class AsocialMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ # irrelevant because no peer: todo: remove?
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+
+ # irrelevant because no peer, todo: remove?
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+ boxes_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
+
+PROBLEMS = ["Boxes", "Switches", "Generators", "Marble"]
+for problem in PROBLEMS:
+ env_name = f'Objects{problem}CollaborationParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:ObjectsCollaborationParamEnv',
+ kwargs={
+ 'problem': problem,
+ }
+ )
+
+
+role_reversal_test_set = [
+ "SocialAI-LeverDoorCollaborationParamEnv-v1",
+ "SocialAI-MarblePushCollaborationParamEnv-v1",
+ "SocialAI-MarblePassACollaborationParamEnv-v1",
+ "SocialAI-MarblePassBCollaborationParamEnv-v1",
+ "SocialAI-AsocialMarbleCollaborationParamEnv-v1",
+ "SocialAI-ObjectsBoxesCollaborationParamEnv-v1",
+ "SocialAI-ObjectsSwitchesCollaborationParamEnv-v1",
+ "SocialAI-ObjectsGeneratorsCollaborationParamEnv-v1",
+ "SocialAI-ObjectsMarbleCollaborationParamEnv-v1",
+]
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/formatscasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/formatscasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..252aa334fe00d702783b45505f812b7b069800e4
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/formatscasestudyenvs.py
@@ -0,0 +1,253 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+'''
+These are the environments for the formats case study:
+
+All the environments use the Language Feedback cue type.
+
+There are four Intro sequences:
+- no (N)
+- eye contact (E)
+- ask (A)
+- ask during eye contact (AE)
+
+The Training environments all have the 6 problems.
+There will be four registered training corresponding to the 4 different introductory sequences.
+
+The Testing environments are always one problem per env - i.e no testing on two problems at the same time
+registered testing envs : problems x {N, A, E, AE}
+'''
+
+PROBLEMS = ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]
+CUES = ["LangFeedback"]
+INTRO_SEC = ["N", "E", "A", "AE"]
+
+# training
+class TrainingFormatsCSParamEnv(SocialAIParamEnv):
+
+ def __init__(self, cue, intro_sec, scaffolding=False, **kwargs):
+
+ if cue not in CUES:
+ raise ValueError(f"Cue {cue} undefined.")
+ self.cue = cue
+
+ if intro_sec not in INTRO_SEC:
+ raise ValueError(f"Cue {intro_sec} undefined.")
+ self.intro_sec = intro_sec
+
+ self.scaffolding = scaffolding
+
+ super(TrainingFormatsCSParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ if self.scaffolding:
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ else:
+ if self.intro_sec == "N":
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "E":
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "A":
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "AE":
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+ if self.scaffolding:
+ scaffolding_Y_nd = tree.add_node("Y", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+
+ if self.cue == "Pointing":
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ elif self.cue == "LangColor":
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ elif self.cue == "LangFeedback":
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class TestingFormatsCSParamEnv(SocialAIParamEnv):
+
+ def __init__(self, cue, intro_sec, problem, **kwargs):
+
+ if cue not in CUES:
+ raise ValueError(f"Cue {cue} undefined.")
+ self.cue = cue
+
+ if intro_sec not in INTRO_SEC:
+ raise ValueError(f"Cue {intro_sec} undefined.")
+ self.intro_sec = intro_sec
+
+ self.problem = problem
+ if self.problem not in PROBLEMS:
+ raise ValueError(f"Problem {self.problem} undefined.")
+
+ super(TestingFormatsCSParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ if self.intro_sec == "N":
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "E":
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "A":
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "AE":
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ if self.cue == "Pointing":
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ elif self.cue == "LangColor":
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ elif self.cue == "LangFeedback":
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node(self.problem, parent=problem_nd, type="value")
+
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+
+# register training envs
+for cue in CUES:
+ for intro_sec in INTRO_SEC:
+ env_name = f'{intro_sec}{cue}TrainFormatsCSParamEnv'
+
+ assert cue == "LangFeedback"
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TrainingFormatsCSParamEnv',
+ kwargs={
+ 'intro_sec': intro_sec,
+ 'cue': cue,
+ }
+ )
+
+for intro_sec in INTRO_SEC:
+ # scaffolding train env
+ for cue in CUES:
+ # intro_sec = "AE"
+ env_name = f'{intro_sec}{cue}TrainScaffoldingCSParamEnv'
+
+ assert cue == "LangFeedback"
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TrainingFormatsCSParamEnv',
+ kwargs={
+ 'intro_sec': intro_sec,
+ 'cue': cue,
+ 'scaffolding': True
+ }
+ )
+
+
+# register testing envs : cues x problems x {social, asocial} x {joint attention, no}
+for cue in CUES:
+ for problem in PROBLEMS:
+ for intro_sec in INTRO_SEC:
+ env_name = f'{intro_sec}{cue}{problem}TestFormatsCSParamEnv'
+
+ assert cue == "LangFeedback"
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TestingFormatsCSParamEnv',
+ kwargs={
+ 'problem': problem,
+ 'cue': cue,
+ 'intro_sec': intro_sec
+ }
+ )
+
+N_formats_test_set = [
+ f"SocialAI-NLangFeedback{problem}TestFormatsCSParamEnv-v1" for problem in PROBLEMS
+]
+E_formats_test_set = [
+ f"SocialAI-ELangFeedback{problem}TestFormatsCSParamEnv-v1" for problem in PROBLEMS
+]
+A_formats_test_set = [
+ f"SocialAI-ALangFeedback{problem}TestFormatsCSParamEnv-v1" for problem in PROBLEMS
+]
+AE_formats_test_set = [
+ f"SocialAI-AELangFeedback{problem}TestFormatsCSParamEnv-v1" for problem in PROBLEMS
+]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/imitationcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/imitationcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c05e7226cb74beb37f49e2a4f56961386a36ed
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/imitationcasestudyenvs.py
@@ -0,0 +1,224 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+
+# Emulation case study (table 2)
+
+# emulation without distractor
+# training
+class EEmulationNoDistrInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Emulation", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class EEmulationNoDistrDoorsInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Emulation", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ marble_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+
+# emulation with a distractor
+
+# training
+class EEmulationDistrInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Emulation", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class EEmulationDistrDoorsInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Emulation", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ try:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
+ except:
+ print(f"Env : {env} registratoin failed.")
+ exit()
+
+
+distr_emulation_test_set = [
+ # "SocialAI-EEmulationDistrBoxesInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationDistrSwitchesInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationDistrMarbleInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationDistrGeneratorsInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationDistrLeversInformationSeekingParamEnv-v1",
+ "SocialAI-EEmulationDistrDoorsInformationSeekingParamEnv-v1",
+]
+
+no_distr_emulation_test_set = [
+ # "SocialAI-EEmulationNoDistrBoxesInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationNoDistrSwitchesInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationNoDistrMarbleInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationNoDistrGeneratorsInformationSeekingParamEnv-v1",
+ # "SocialAI-EEmulationNoDistrLeversInformationSeekingParamEnv-v1",
+ "SocialAI-EEmulationNoDistrDoorsInformationSeekingParamEnv-v1",
+]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/informationseekingcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/informationseekingcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..935f40f4b32ebb3400f568f558eb665a2dcfc72d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/informationseekingcasestudyenvs.py
@@ -0,0 +1,484 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+'''
+These are the environments for case studies 1-3: Pointing, Language (Color and Feedback), and Joint Attention.
+
+Intro sequence is always eye contact (E) in both the training and testing envs
+
+The Training environments have the 5 problems and Marbles in the Asocial version (no distractor, no peer)
+registered training envs : cues x {joint attention, no}
+
+The Testing e environments are always one problem per env - i.e no testing on two problems at the same time
+registered testing envs : cues x problems x {social, asocial} x {joint attention, no}
+'''
+
+PROBLEMS = ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]
+CUES = ["Pointing", "LangFeedback", "LangColor"]
+INTRO_SEC = ["E"]
+# INTRO_SEC = ["N", "E", "A", "AE"]
+
+
+class EAsocialInformationSeekingParamEnv(SocialAIParamEnv):
+ '''
+ Env with all problems in the asocial version -> just for testing
+ '''
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# Pointing case study
+
+# training
+class TrainingInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def __init__(self, cue, intro_sec, ja, heldout="Doors", **kwargs):
+
+ if cue not in CUES:
+ raise ValueError(f"Cue {cue} undefined.")
+ self.cue = cue
+
+ if intro_sec not in INTRO_SEC:
+ raise ValueError(f"Cue {intro_sec} undefined.")
+ self.intro_sec = intro_sec
+
+ self.heldout=heldout
+
+ if ja not in [True, False]:
+ raise ValueError(f"JA {ja} undefined.")
+ self.ja = ja
+
+ super(TrainingInformationSeekingParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ if self.intro_sec == "N":
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "E":
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "A":
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "AE":
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+
+ if self.cue == "Pointing":
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ elif self.cue == "LangColor":
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ elif self.cue == "LangFeedback":
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+
+ if self.heldout == "Doors":
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+ else:
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+
+ if self.heldout == "Marble":
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+ else:
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ if self.ja:
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+ tree.add_node("N", parent=N_bo_nd, type="value")
+
+ return tree
+
+# testing
+class TestingInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def __init__(self, cue, intro_sec, ja, problem, asocial, **kwargs):
+
+ if cue not in CUES:
+ raise ValueError(f"Cue {cue} undefined.")
+ self.cue = cue
+
+ if intro_sec not in INTRO_SEC:
+ raise ValueError(f"Cue {intro_sec} undefined.")
+ self.intro_sec = intro_sec
+
+ if ja not in [True, False]:
+ raise ValueError(f"JA {ja} undefined.")
+ self.ja = ja
+
+ self.problem = problem
+ if self.problem not in PROBLEMS:
+ raise ValueError(f"Problem {self.problem} undefined.")
+
+ self.asocial = asocial
+
+ super(TestingInformationSeekingParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ if self.intro_sec == "N":
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "E":
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "A":
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "AE":
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ if self.cue == "Pointing":
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ elif self.cue == "LangColor":
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ elif self.cue == "LangFeedback":
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node(self.problem, parent=problem_nd, type="value")
+
+ N = "1" if self.asocial else "2"
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node(N, parent=version_nd, type="value")
+
+ peer = "N" if self.asocial else "Y"
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node(peer, parent=peer_nd, type="value")
+
+ if self.ja:
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+class DrawingEnv(SocialAIParamEnv):
+
+ def __init__(self, **kwargs):
+
+ self.cue = "Pointing"
+
+ self.intro_sec = "E"
+
+ self.heldout= "Doors"
+
+ self.ja = False
+
+ super(DrawingEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+ # colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+ # tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+ #
+ # role_nd = tree.add_node("Role", parent=collab_nd, type="param")
+ # tree.add_node("A", parent=role_nd, type="value")
+ #
+ # role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ # tree.add_node("Social", parent=role_nd, type="value")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+
+ # as_nd = tree.add_node("AppleStealing", parent=env_type_nd, type="value")
+ # ver_nd = tree.add_node("Version", parent=as_nd, type="param")
+ # tree.add_node("Asocial", parent=ver_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+
+ if self.intro_sec == "N":
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "E":
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "A":
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ elif self.intro_sec == "AE":
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Peer_help", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+
+ if self.cue == "Pointing":
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ elif self.cue == "LangColor":
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ elif self.cue == "LangFeedback":
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+
+ if self.heldout == "Doors":
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+ else:
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+
+ if self.heldout == "Marble":
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+ else:
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ if self.ja:
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+ tree.add_node("N", parent=N_bo_nd, type="value")
+
+ return tree
+
+# register drawing env
+register(
+ id='SocialAI-DrawingEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:DrawingEnv',
+)
+
+# register dummy env
+register(
+ id='SocialAI-EAsocialInformationSeekingParamEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:EAsocialInformationSeekingParamEnv',
+)
+
+
+# register training envs
+for cue in CUES:
+ for ja in [True, False]:
+ for intro_sec in INTRO_SEC:
+ env_name = f'{"JA" if ja else ""}{intro_sec}{cue}TrainInformationSeekingParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TrainingInformationSeekingParamEnv',
+ kwargs={
+ 'intro_sec': intro_sec,
+ 'cue': cue,
+ 'ja': ja,
+ }
+ )
+
+# register training envs: heldout generators
+for cue in CUES:
+ for ja in [True, False]:
+ for intro_sec in INTRO_SEC:
+ env_name = f'{"JA" if ja else ""}{intro_sec}{cue}HeldoutDoorsTrainInformationSeekingParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TrainingInformationSeekingParamEnv',
+ kwargs={
+ 'intro_sec': intro_sec,
+ 'cue': cue,
+ 'ja': ja,
+ 'heldout': "Doors",
+ }
+ )
+
+
+# register testing envs : cues x problems x {social, asocial} x {joint attention, no}
+for cue in CUES:
+ for problem in PROBLEMS:
+ for asocial in [True, False]:
+ for ja in [True, False]:
+ for intro_sec in INTRO_SEC:
+ env_name = f'{"JA" if ja else ""}{intro_sec}{cue}{problem}{"Asocial" if asocial else ""}TestInformationSeekingParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:TestingInformationSeekingParamEnv',
+ kwargs={
+ 'asocial': asocial,
+ 'problem': problem,
+ 'cue': cue,
+ 'ja': ja,
+ 'intro_sec': intro_sec
+ }
+ )
+
+pointing_test_set = [
+ f"SocialAI-EPointing{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-EPointingDoorsAsocialTestInformationSeekingParamEnv-v1"]
+
+language_feedback_test_set = [
+ f"SocialAI-ELangFeedback{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-ELangFeedbackDoorsAsocialTestInformationSeekingParamEnv-v1"]
+
+language_color_test_set = [
+ f"SocialAI-ELangColor{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-ELangColorDoorsAsocialTestInformationSeekingParamEnv-v1"]
+
+ja_pointing_test_set = [
+ f"SocialAI-JAEPointing{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-JAEPointingDoorsAsocialTestInformationSeekingParamEnv-v1"]
+
+ja_language_feedback_test_set = [
+ f"SocialAI-JAELangFeedback{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-JAELangFeedbackDoorsAsocialTestInformationSeekingParamEnv-v1"]
+
+ja_language_color_test_set = [
+ f"SocialAI-JAELangColor{problem}TestInformationSeekingParamEnv-v1" for problem in PROBLEMS
+]+["SocialAI-JAELangColorDoorsAsocialTestInformationSeekingParamEnv-v1"]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langcolorcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langcolorcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..042875adbb1839fa039c216c84987e4c381a39d3
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langcolorcasestudyenvs.py
@@ -0,0 +1,230 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+# LangColor case study
+
+
+# training
+class ELangColorInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class ELangColorMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+# Joint Attention
+
+# training
+class JAELangColorInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# testing
+class JAELangColorMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ try:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
+ except:
+ print(f"Env : {env} registratoin failed.")
+ exit()
+
+
+language_color_test_set = [
+ "SocialAI-ELangColorMarbleInformationSeekingParamEnv-v1",
+]
+
+ja_language_color_test_set = [
+ "SocialAI-JAELangColorMarbleInformationSeekingParamEnv-v1",
+]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langfeedbackcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langfeedbackcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..c85f728bdab460fd6f3b683cc2f9973384c5d2eb
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/langfeedbackcasestudyenvs.py
@@ -0,0 +1,230 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+# LangFeedback case study
+
+# training
+class ELangFeedbackInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class ELangFeedbackMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ return tree
+
+
+# Joint Attention
+
+# training
+class JAELangFeedbackInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# testing
+class JAELangFeedbackMarbleInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ try:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
+ except:
+ print(f"Env : {env} registratoin failed.")
+ exit()
+
+
+language_feedback_test_set = [
+ "SocialAI-ELangFeedbackMarbleInformationSeekingParamEnv-v1",
+]
+
+
+ja_language_feedback_test_set = [
+ "SocialAI-JAELangFeedbackMarbleInformationSeekingParamEnv-v1",
+]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/pointingcasestudyenvs.py b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/pointingcasestudyenvs.py
new file mode 100644
index 0000000000000000000000000000000000000000..a55a23bdb1079bb794c41df0163b803ded4f1a43
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/case_studies_envs/pointingcasestudyenvs.py
@@ -0,0 +1,361 @@
+from gym_minigrid.social_ai_envs.socialaiparamenv import SocialAIParamEnv
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+
+import inspect, importlib
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+class EAsocialInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# Pointing case study
+
+# training
+class EPointingInformationSeekingParamEnv(SocialAIParamEnv):
+
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ return tree
+
+# testing
+class EPointingTestingInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def __init__(self, problem, asocial, **kwargs):
+
+ self.problem = problem
+ if self.problem not in ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]:
+ raise ValueError(f"Problem {self.problem} undefined.")
+
+ self.asocial = asocial
+
+ super(EPointingTestingInformationSeekingParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node(self.problem, parent=problem_nd, type="value")
+
+ N = "1" if self.asocial else "2"
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node(N, parent=version_nd, type="value")
+
+ peer = "N" if self.asocial else "Y"
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node(peer, parent=peer_nd, type="value")
+
+ return tree
+
+
+# Joint Attention
+
+# training
+class JAEPointingInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ # tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ # tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ marble_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=marble_nd, type="param")
+ tree.add_node("1", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=marble_nd, type="param")
+ tree.add_node("N", parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# testing
+class JAEPointingTestingInformationSeekingParamEnv(SocialAIParamEnv):
+
+ def __init__(self, problem, asocial, **kwargs):
+
+ self.problem = problem
+ if self.problem not in ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]:
+ raise ValueError(f"Problem {self.problem} undefined.")
+
+ self.asocial = asocial
+
+ super(JAEPointingTestingInformationSeekingParamEnv, self).__init__(**kwargs)
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ boxes_nd = tree.add_node(self.problem, parent=problem_nd, type="value")
+
+ N = "1" if self.asocial else "2"
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node(N, parent=version_nd, type="value")
+
+ peer = "N" if self.asocial else "Y"
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node(peer, parent=peer_nd, type="value")
+
+ N_bo_nd = tree.add_node("JA_recursive", parent=inf_seeking_nd, type="param")
+ tree.add_node("Y", parent=N_bo_nd, type="value")
+
+ return tree
+
+
+# automatic registration of environments
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ try:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
+ except:
+ print(f"Env : {env} registratoin failed.")
+ exit()
+
+# register testing envs
+problems = ["Boxes", "Switches", "Generators", "Levers", "Doors", "Marble"]
+
+for problem in problems:
+ for asocial in [True, False]:
+
+ if asocial:
+ env_name = f'EPointing{problem}AsocialInformationSeekingParamEnv'
+ else:
+ env_name = f'EPointing{problem}InformationSeekingParamEnv'
+
+ print("env name:", env_name)
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:EPointingTestingInformationSeekingParamEnv',
+ kwargs={
+ 'asocial': asocial,
+ 'problem': problem,
+ }
+ )
+
+ if asocial:
+ env_name = f'JAEPointing{problem}AsocialInformationSeekingParamEnv'
+ else:
+ env_name = f'JAEPointing{problem}InformationSeekingParamEnv'
+
+ register(
+ id='SocialAI-{}-v1'.format(env_name),
+ entry_point='gym_minigrid.social_ai_envs:JAEPointingTestingInformationSeekingParamEnv',
+ kwargs={
+ 'asocial': asocial,
+ 'problem': problem,
+ }
+ )
+
+
+pointing_test_set = [
+ "SocialAI-EPointingMarbleInformationSeekingParamEnv-v1",
+]
+
+
+ja_pointing_test_set = [
+ "SocialAI-JAEPointingMarbleInformationSeekingParamEnv-v1",
+]
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/informationseekingenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/informationseekingenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..f91a0195e8fbe8347120afe74f631c30e13bdae7
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/informationseekingenv.py
@@ -0,0 +1,1274 @@
+import time
+import random
+
+import numpy as np
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+import time
+from collections import deque
+
+def next_to(posa, posb):
+ if type(posa) == tuple:
+ posa = np.array(posa)
+
+ if type(posb) == tuple:
+ posb = np.array(posb)
+
+ return abs(posa-posb).sum() == 1
+
+
+class Caretaker(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0 # this will be put into the encoding
+
+ self.was_introduced_to = False
+ self.decoy_color_given = False
+
+ self.ate_an_apple = False
+ self.demo_over = False
+ self.demo_over_and_position_safe = False
+ self.apple_unlocked_for_agent = False
+
+ self.list_of_possible_utterances = [
+ *self.list_of_possible_utterances,
+ "Hot",
+ "Warm",
+ "Medium",
+ "Cold",
+ *COLOR_NAMES
+ ]
+
+ # target obj
+ assert self.env.problem == self.env.parameters["Problem"] if self.env.parameters else "Apples"
+
+ if self.env.problem in ["Apples"]:
+ self.target_obj = self.env.apple
+ self.distractor_obj = None
+
+ elif self.env.problem == "Doors":
+ self.target_obj = self.env.door
+ self.distractor_obj = self.env.distractor_door
+
+ elif self.env.problem == "Levers":
+ self.target_obj = self.env.lever
+ self.distractor_obj = self.env.distractor_lever
+
+ elif self.env.problem == "Boxes":
+ self.target_obj = self.env.box
+ self.distractor_obj = self.env.distractor_box
+
+ elif self.env.problem == "Switches":
+ self.target_obj = self.env.switch
+ self.distractor_obj = self.env.distractor_switch
+
+ elif self.env.problem == "Generators":
+ self.target_obj = self.env.generator
+ self.distractor_obj = self.env.distractor_generator
+
+ elif self.env.problem in ["Marble", "Marbles"]:
+ self.target_obj = self.env.generator
+ self.distractor_obj = self.env.distractor_generator
+
+ if self.env.ja_recursive:
+ if int(self.env.parameters["N"]) == 1:
+ self.ja_decoy = self.env._rand_elem([self.target_obj])
+ else:
+ self.ja_decoy = self.env._rand_elem([self.target_obj, self.distractor_obj])
+
+ # the other object is a decoy distractor
+ self.ja_decoy_distractor = list({self.target_obj, self.distractor_obj} - {self.ja_decoy})[0]
+
+ self.decoy_point_from_loc = self.find_point_from_loc(
+ target_pos=self.ja_decoy.cur_pos,
+ distractor_pos=self.ja_decoy_distractor.cur_pos if self.ja_decoy_distractor else None
+ )
+
+ self.point_from_loc = self.find_point_from_loc()
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+ def step(self, utterance):
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ scaffolding = self.env.parameters.get("Scaffolding", "N") == "Y"
+ language_color = False
+ language_feedback = False
+ pointing = False
+ emulation = False
+
+ if not scaffolding:
+ cue_type = self.env.parameters["Cue_type"]
+
+ if cue_type == "Language_Color":
+ language_color = True
+ elif cue_type == "Language_Feedback":
+ language_feedback = True
+ elif cue_type == "Pointing":
+ pointing = True
+ elif cue_type == "Emulation":
+ emulation = True
+ else:
+ raise ValueError(f"Cue_type ({cue_type}) not defined.")
+ else:
+ # there are no cues if scaffolding is used (the peer gives the apples to the agent)
+ assert "Cue_type" not in self.env.parameters
+
+ # there is no additional test for joint attention (no cues are given so this wouldn't make sense)
+ assert not self.env.ja_recursive
+
+ reply, action = None, None
+ if not self.was_introduced_to:
+ # check introduction, updates was_introduced_to if needed
+ reply, action = self.handle_introduction(utterance)
+
+ assert action is None
+
+ if self.env.ja_recursive:
+ # look at the center of the room (this makes the cue giving in side and outisde JA different)
+ action = self.look_at_action([self.env.current_width // 2, self.env.current_height // 2])
+ else:
+ # look at the agent
+ action = self.look_at_action(self.env.agent_pos)
+
+ if self.was_introduced_to:
+ # was introduced just now
+ if self.is_pointing():
+ action = self.stop_point
+
+ if language_color:
+ # only say the color once
+ reply = self.target_obj.color
+
+ elif self.env.ja_recursive:
+ # was not introduced
+ if language_feedback:
+ # random reply
+ reply = self.env._rand_elem([
+ "Hot",
+ "Warm",
+ "Medium",
+ "Cold"
+ ])
+
+ if language_color and not self.decoy_color_given:
+ # color of a decoy (can be the correct one)
+ reply = self.ja_decoy.color
+ self.decoy_color_given=True
+
+ if pointing:
+ # point to a decoy
+ action = self.goto_point_action(
+ point_from_loc=self.decoy_point_from_loc,
+ target_pos=self.ja_decoy.cur_pos,
+ distractor_pos=self.ja_decoy_distractor.cur_pos if self.ja_decoy_distractor else None
+ )
+
+ if self.is_pointing():
+ # if it's already pointing, turn to look at the center (to avoid looking at the wall)
+ action = self.look_at_action([self.env.current_width//2, self.env.current_height//2])
+
+
+ else:
+
+ if self.was_introduced_to and language_color:
+ # language only once at introduction
+ # reply = self.target_obj.color
+ action = self.look_at_action(self.env.agent_pos)
+
+ if self.was_introduced_to and language_feedback:
+ # closeness string
+ agent_distance_to_target = np.abs(self.target_obj.cur_pos - self.env.agent_pos).sum()
+ if agent_distance_to_target <= 1:
+ reply = "Hot"
+ elif agent_distance_to_target <= 2:
+ reply = "Warm"
+ elif agent_distance_to_target <= 5:
+ reply = "Medium"
+ elif agent_distance_to_target >= 5:
+ reply = "Cold"
+
+ action = self.look_at_action(self.env.agent_pos)
+
+ # pointing
+ if self.was_introduced_to and pointing:
+ if self.env.parameters["N"] == "1":
+ distractor_pos = None
+ else:
+ distractor_pos = self.distractor_obj.cur_pos
+
+ action = self.goto_point_action(
+ point_from_loc=self.point_from_loc,
+ target_pos=self.target_obj.cur_pos,
+ distractor_pos=distractor_pos,
+ )
+
+ if self.is_pointing():
+ action = self.look_at_action(self.env.agent_pos)
+
+ # emulation or scaffolding
+ emulation_demo = self.was_introduced_to and emulation and not self.demo_over
+ scaffolding_help = self.was_introduced_to and scaffolding
+
+ # do the demonstration / unlock the apple
+ # in both of those two scenarios the NPC in essence solves the task
+ # in demonstration - it eats the apple, and reverts the env at the end
+ # in scaffolding - it doesn't eat the apple and looks at the agent
+ if emulation_demo or scaffolding_help:
+
+ if emulation_demo or (scaffolding_help and not self.apple_unlocked_for_agent):
+
+ if self.is_pointing():
+ # don't point during demonstration
+ action = self.stop_point
+
+ else:
+ # if apple unlocked go pick it up
+ if self.target_obj == self.env.switch and self.env.switch.is_on:
+ assert self.env.parameters["Problem"] == "Switches"
+ next_target_position = self.env.box.cur_pos
+
+ elif self.target_obj == self.env.generator and self.env.generator.is_pressed:
+ assert self.env.parameters["Problem"] in ["Generators", "Marbles", "Marble"]
+ next_target_position = self.env.generator_platform.cur_pos
+
+ elif self.target_obj == self.env.door and self.env.door.is_open:
+ next_target_position = self.env.apple.cur_pos
+
+ elif self.target_obj == self.env.lever and self.env.lever.is_on:
+ next_target_position = self.env.apple.cur_pos
+
+ else:
+ next_target_position = self.target_obj.cur_pos
+
+ if self.target_obj == self.env.generator and not self.env.generator.is_pressed:
+ if not self.env.generator.marble_activation:
+ # push generator
+ action = self.path_to_pos(next_target_position)
+ else:
+ # find angle
+ if self.env.marble.moving_dir is None:
+ distance = (self.env.marble.cur_pos - self.env.generator.cur_pos)
+ diff = np.sign(distance)
+
+ if sum(abs(diff)) == 1:
+ # if the agent pushed the ball during demo diff can be > 1, then it's unsolvable
+ push_pos = self.env.marble.cur_pos+diff
+ if all(self.cur_pos == push_pos):
+ next_target_position = self.env.marble.cur_pos
+ else:
+ next_target_position = push_pos
+
+ # go to loc in front of
+ # push
+ action = self.path_to_pos(next_target_position)
+
+ else:
+ # toggle all other objects
+ action = self.path_to_toggle_pos(next_target_position)
+
+ # for scaffolding check if trying to eat the apple
+ # if so, stop - apple is unlocked
+ if scaffolding_help:
+ if (
+ self.env.get_cell(*self.front_pos) == self.env.apple and
+ action == self.toggle_action
+ ):
+ # don't eat the apple
+ action = None
+ self.apple_unlocked_for_agent = True
+
+ # for emulation check if trying to toggle the eaten apple
+ # if so, stop and revert the env - demo is over
+ if emulation_demo:
+ if (
+ self.ate_an_apple and
+ self.env.get_cell(*self.front_pos) == self.env.apple and
+ action == self.toggle_action and
+ self.env.apple.eaten
+ ):
+ # trying to toggle an apple it ate
+ self.env.revert()
+ self.demo_over = True
+ action = None
+
+ # if scaffolding apple unlocked, look at the agent
+ if scaffolding_help and self.apple_unlocked_for_agent:
+ if all(self.cur_pos == self.initial_pos):
+ # if the apple is unlocked look at the agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+ action = self.compute_turn_action(wanted_dir)
+ else:
+ # go to init pos, this removes problems in case the apple is unreachable now
+ action = self.path_to_pos(self.initial_pos)
+
+ if self.was_introduced_to and emulation and self.demo_over and not self.demo_over_and_position_safe:
+ if self.env.is_in_marble_way(self.cur_pos):
+ action = self.path_to_pos(self.find_point_from_loc())
+ else:
+ self.demo_over_and_position_safe = True
+
+ if self.demo_over_and_position_safe:
+ assert emulation or scaffolding
+ # look at the agent after demo is done
+ action = self.look_at_action(self.env.agent_pos)
+
+ if self.was_introduced_to and self.env.parameters["Scaffolding"] == "Y":
+ if "Emulation" in self.env.parameters or "Pointing" in self.env.parameters or "Language_grounding" in self.env.parameters:
+ raise ValueError(
+ "Scaffolding cannot be used with information giving (Emulation, Pointing, Language_grounding)"
+ )
+
+ eaten_before = self.env.apple.eaten
+
+ if action is not None:
+ action()
+
+ # check if the NPC ate the apple
+ eaten_after = self.env.apple.eaten
+ self.ate_an_apple = not eaten_before and eaten_after
+
+ info = self.create_info(
+ action=action,
+ utterance=reply,
+ was_introduced_to=self.was_introduced_to,
+ )
+
+ assert (reply or "no_op") in self.list_of_possible_utterances
+
+ return reply, info
+
+ def create_info(self, action, utterance, was_introduced_to):
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": utterance or "no_op",
+ "was_introduced_to": was_introduced_to
+ }
+ return info
+
+ def is_point_from_loc(self, pos, target_pos=None, distractor_pos=None):
+
+ if target_pos is None:
+ target_pos = self.target_obj.cur_pos
+
+ if distractor_pos is None:
+ if self.distractor_obj is not None:
+ distractor_pos = self.distractor_obj.cur_pos
+ else:
+ distractor_pos = [None, None]
+
+ if self.env.is_in_marble_way(pos):
+ return False
+
+ if self.env.problem in ["Doors", "Levers"]:
+ # must not be in front of a door
+ if abs(self.env.door_current_pos - pos).sum() == 1:
+ return False
+
+ if self.env.problem in ["Doors"]:
+ if abs(self.env.distractor_current_pos - pos).sum() == 1:
+ return False
+
+ if any(pos == target_pos):
+ same_ind = np.argmax(target_pos == pos)
+
+ # is there an occlusion in the way
+ start = pos[1-same_ind]
+ end = target_pos[1-same_ind]
+ step = 1 if start <= end else -1
+ for i in np.arange(start, end, step):
+ p = pos.copy()
+ p[1-same_ind] = i
+ cell = self.env.grid.get(*p)
+
+ if cell is not None:
+ if not cell.see_behind():
+ return False
+
+ if pos[same_ind] != distractor_pos[same_ind]:
+ return True
+
+ if pos[same_ind] == distractor_pos[same_ind]:
+ # if in between
+ if distractor_pos[1-same_ind] < pos[1-same_ind] < target_pos[1-same_ind]:
+ return True
+
+ if distractor_pos[1-same_ind] > pos[1-same_ind] > target_pos[1-same_ind]:
+ return True
+ return False
+
+ def find_point_from_loc(self, target_pos=None, distractor_pos=None):
+ reject_fn = lambda env, p: not self.is_point_from_loc(p, target_pos=target_pos, distractor_pos=distractor_pos)
+
+ point = self.env.find_loc(size=(self.env.wall_x, self.env.wall_y), reject_fn=reject_fn, reject_agent_pos=False)
+
+ # assert all(point < np.array([self.env.wall_x, self.env.wall_y]))
+ # assert all(point > np.array([0, 0]))
+
+ return point
+
+ def goto_point_action(self, point_from_loc, target_pos, distractor_pos):
+ if self.is_point_from_loc(self.cur_pos, target_pos=target_pos, distractor_pos=distractor_pos):
+ # point to a direction
+ action = self.compute_wanted_point_action(target_pos)
+
+ else:
+ # do not point if not is_point_from_loc
+ if self.is_pointing():
+ # stop pointing
+ action = self.stop_point
+
+ else:
+ # move
+ action = self.path_to_pos(point_from_loc)
+
+ return action
+
+
+class InformationSeekingEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ switch_no_light=True,
+ reward_diminish_factor=0.1,
+ see_through_walls=False,
+ n_colors=None,
+ egocentric_observation=True,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+ self.switch_no_light = switch_no_light
+
+ if n_colors is None:
+ self.n_colors = len(COLOR_NAMES)
+ else:
+ self.n_colors = n_colors
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+
+ self.reward_diminish_factor = reward_diminish_factor
+
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=see_through_walls,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+ self.all_npc_utterance_actions = self.caretaker.list_of_possible_utterances
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ def revert(self):
+ self.grid.set(*self.caretaker.cur_pos, None)
+ self.place_npc()
+ self.put_objects_in_env(remove_objects=True)
+
+ def is_in_marble_way(self, pos):
+ target_pos = self.generator_current_pos
+
+ # generator distractor is in the same row / collumn as the marble and the generator
+ # if self.distractor_current_pos is not None:
+ # distractor_pos = self.distractor_current_pos
+ # else:
+ # distractor_pos = [None, None]
+
+ if self.problem in ["Marbles", "Marble"]:
+ # point can't be in the same row or column as both the marble and the generator
+ # all three: marble, generator, loc are in the same row or column
+ if any((pos == target_pos) * (pos == self.marble_current_pos)):
+ # all three: marble, generator, loc are in the same row or column -> is in its way
+ return True
+
+ if int(self.parameters["N"]) > 1:
+ # is it in the way for the distractor generator
+ if any((pos == self.distractor_current_pos) * (pos == self.marble_current_pos)):
+ # all three: marble, distractor generator, loc are in the same row or column -> is in its way
+ return True
+
+ # all good
+ return False
+
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ # new
+ min_w = min(9, width_)
+ min_h = min(9, height_)
+ self.current_width = self._rand_int(min_w, width_+1)
+ self.current_height = self._rand_int(min_h, height_+1)
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ # problem: Apples/Boxes/Switches/Generators/Marbles
+ self.problem = self.parameters["Problem"] if self.parameters else "Apples"
+ num_of_colors = self.parameters.get("Num_of_colors", None) if self.parameters else None
+ if num_of_colors is None:
+ num_of_colors = self.n_colors
+
+ # additional test for recursivness of joint attention -> cues are given outside of JA
+ self.ja_recursive = self.parameters.get("JA_recursive", False) if self.parameters else False
+
+ self.add_obstacles()
+ if self.obstacles != "No":
+ warnings.warn("InformationSeeking should no be using obstacles.")
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ if self.problem in ["Doors", "Levers"]:
+ # Add a second wall: this is needed so that an apple cannot be seen diagonally between the wall and the door
+ self.grid.wall_rect(1, 1, self.wall_x-1, self.wall_y-1)
+
+ # apple
+ self.apple_pos = (self.current_width, self.current_height)
+
+ # box
+ locked = self.problem == "Switches"
+
+ if num_of_colors is None:
+ POSSIBLE_COLORS = COLOR_NAMES.copy()
+
+ else:
+ POSSIBLE_COLORS = COLOR_NAMES[:int(num_of_colors)].copy()
+
+ self.box_color = self._rand_elem(POSSIBLE_COLORS)
+
+ if self.problem in ["Doors", "Levers"]:
+ # door
+
+ # find the position on a wall
+ self.apple_current_pos = self.find_loc(
+ size=(self.current_width, self.current_height),
+ reject_taken_pos=False, # we will create a gap in the wall
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos:
+ not (pos[0] in [0, self.wall_x] or pos[1] in [0, self.wall_y]) or # reject not on a wall
+ tuple(pos) in [
+ (0, 0),
+ (0, 1),
+ (1, 0),
+
+ (0, self.wall_y),
+ (0, self.wall_y-1),
+ (1, self.wall_y),
+
+ (self.wall_x, self.wall_y),
+ (self.wall_x-1, self.wall_y),
+ (self.wall_x, self.wall_y-1),
+
+ (self.wall_x, 0),
+ (self.wall_x, 1),
+ (self.wall_x-1, 0),
+ ]
+ )
+ self.grid.set(*self.apple_current_pos, None) # hole in the wall
+
+ # door is in front of the apple
+ door_x = {
+ 0: 1,
+ self.wall_x: self.wall_x - 1,
+ }.get(self.apple_current_pos[0], self.apple_current_pos[0])
+ door_y = {
+ 0: 1,
+ self.wall_y: self.wall_y - 1,
+ }.get(self.apple_current_pos[1], self.apple_current_pos[1])
+
+ self.door_current_pos = np.array([door_x, door_y])
+ self.grid.set(*self.door_current_pos, None) # hole in the wall
+
+
+ # lever
+ if self.problem in ["Levers"]:
+ self.lever_current_pos = self.find_loc(
+ top=(2, 2),
+ size=(self.current_width-4, self.current_height-4),
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: next_to(pos, self.door_current_pos) # reject in front of the door
+ )
+
+ else:
+ # find the position for the apple/box/generator_platform
+ self.apple_current_pos = self.find_loc(size=self.apple_pos, reject_agent_pos=True)
+ assert all(self.apple_current_pos < np.array([self.current_width-1, self.current_height-1]))
+
+ # door
+ self.door_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # lever
+ self.lever_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # switch
+ self.switch_pos = (self.current_width, self.current_height)
+ self.switch_color = self._rand_elem(POSSIBLE_COLORS)
+ self.switch_current_pos = self.find_loc(
+ size=self.switch_pos,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [self.apple_current_pos]),
+ )
+
+ # generator
+ self.generator_pos = (self.current_width, self.current_height)
+ self.generator_color = self._rand_elem(POSSIBLE_COLORS)
+ self.generator_current_pos = self.find_loc(
+ size=self.generator_pos,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: (
+ tuple(pos) in map(tuple, [self.apple_current_pos])
+ or
+ (self.problem in ["Marble"] and tuple(pos) in [
+ # not in corners
+ (1, 1),
+ (self.current_width-2, 1),
+ (1, self.current_height-2),
+ (self.current_width-2, self.current_height-2),
+ ])
+ or
+ # not in the same row collumn as the platform
+ (self.problem in ["Marble"] and any(pos == self.apple_current_pos))
+ ),
+ )
+
+ # generator platform
+ self.generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # marbles
+ self.marble_pos = (self.current_width, self.current_height)
+ self.marble_color = self._rand_elem(POSSIBLE_COLORS)
+ self.marble_current_pos = self.find_loc(
+ size=self.marble_pos,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: self.problem in ["Marbles", "Marble"] and (
+ tuple(pos) in map(tuple, [self.apple_current_pos, self.generator_current_pos])
+ or
+ all(pos != self.generator_current_pos) # reject if not in row or column as the generator
+ or
+ any(pos == 1) # next to a wall
+ or
+ pos[1] == self.current_height-2
+ or
+ pos[0] == self.current_width-2
+ ),
+ )
+
+ # distractor
+ if self.problem == "Boxes":
+ assert not locked
+ POSSIBLE_COLORS.remove(self.box_color)
+
+ elif self.problem == "Doors":
+ POSSIBLE_COLORS.remove(self.door_color)
+
+ elif self.problem == "Levers":
+ POSSIBLE_COLORS.remove(self.lever_color)
+
+ elif self.problem == "Switches":
+ POSSIBLE_COLORS.remove(self.switch_color)
+
+ elif self.problem in ["Generators", "Marble"]:
+ POSSIBLE_COLORS.remove(self.generator_color)
+
+ self.distractor_color = self._rand_elem(POSSIBLE_COLORS)
+ self.distractor_pos = (self.current_width, self.current_height)
+
+ # distractor reject function
+ if self.problem in ["Apples", "Boxes"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.apple_current_pos])
+
+ elif self.problem in ["Switches"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.apple_current_pos, self.switch_current_pos])
+
+ elif self.problem in ["Generators"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.apple_current_pos, self.generator_current_pos])
+
+ elif self.problem in ["Marble"]:
+ # problem is marbles
+ if self.parameters["N"] == "1":
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.apple_current_pos, self.generator_current_pos, self.marble_current_pos])
+ else:
+ same_dim = (self.generator_current_pos == self.marble_current_pos).argmax()
+ distactor_same_dim = 1-same_dim
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [
+ self.apple_current_pos,
+ self.generator_current_pos,
+ self.marble_current_pos
+ ]) or pos[distactor_same_dim] != self.marble_current_pos[distactor_same_dim]
+
+ elif self.problem in ["Doors"]:
+ # reject not next to a wall
+ distractor_reject_fn = lambda _, pos: (
+ not (pos[0] in [1, self.wall_x-1] or pos[1] in [1, self.wall_y-1]) or # reject not on a wall
+ tuple(pos) in [
+ (1, 1),
+ (self.wall_x-1, self.wall_y - 1),
+ (1, self.wall_y-1),
+ (self.wall_x-1, 1),
+ tuple(self.door_current_pos)
+ ]
+ )
+
+ elif self.problem in ["Levers"]:
+ # not in front of the door
+ distractor_reject_fn = lambda _, pos: next_to(pos, self.door_current_pos) or tuple(pos) in list(map(tuple, [self.door_current_pos, self.lever_current_pos]))
+
+ else:
+ raise ValueError("Problem {} indefined.".format(self.problem))
+
+ if self.problem == "Doors":
+
+ self.distractor_current_pos = self.find_loc(
+ top=(1, 1),
+ size=(self.current_width-2, self.current_height-2),
+ reject_agent_pos=True,
+ reject_fn=distractor_reject_fn,
+ reject_taken_pos=False
+ )
+
+ if self.parameters["N"] != "1":
+ self.grid.set(*self.distractor_current_pos, None) # hole in the wall
+ else:
+ self.distractor_current_pos = self.find_loc(
+ size=self.distractor_pos,
+ reject_agent_pos=True,
+ reject_fn=distractor_reject_fn
+ )
+
+ self.put_objects_in_env()
+
+
+ # NPC
+ put_peer = self.parameters["Peer"] if self.parameters else "N"
+ assert put_peer in ["Y", "N"]
+
+ color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = Caretaker(color, "Caretaker", self)
+
+ if put_peer == "Y":
+ self.place_npc()
+
+
+ # Randomize the agent's start position and orientation
+ self.place_agent(size=(self.current_width, self.current_height))
+
+ # Generate the mission string
+ self.mission = 'lets collaborate'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+ def place_npc(self):
+ if self.problem in ["Doors"]:
+ self.place_obj(
+ self.caretaker,
+ size=(self.current_width, self.current_height),
+ reject_fn=lambda _, pos: next_to(pos, self.door_current_pos) or next_to(pos, self.distractor_current_pos)
+ )
+
+ elif self.problem in ["Levers"]:
+ self.place_obj(
+ self.caretaker,
+ size=(self.current_width, self.current_height),
+ reject_fn=lambda _, pos: next_to(pos, self.door_current_pos)
+ )
+
+ else:
+ self.place_obj(self.caretaker, size=(self.current_width, self.current_height), reject_fn=InformationSeekingEnv.is_in_marble_way)
+
+ self.caretaker.initial_pos = self.caretaker.cur_pos
+
+ def put_objects_in_env(self, remove_objects=False):
+
+ assert self.apple_current_pos is not None
+ assert self.switch_current_pos is not None
+
+ self.doors_block_set = []
+ self.levers_block_set = []
+ self.switches_block_set = []
+ self.boxes_block_set = []
+ self.generators_block_set = []
+
+ self.distractor_door = None
+ self.distractor_lever = None
+ self.distractor_box = None
+ self.distractor_switch = None
+ self.distractor_generator = None
+
+ # problem: Apples/Boxes/Switches/Generators
+ assert self.problem == self.parameters["Problem"] if self.parameters else "Apples"
+
+ # move objects (used only in revert), not in gen_grid
+ if remove_objects:
+ # remove apple or box
+ # assert type(self.grid.get(*self.apple_current_pos)) in [Apple, LockableBox]
+ # self.grid.set(*self.apple_current_pos, None)
+
+ # remove apple (after demo it must be an apple)
+ assert type(self.grid.get(*self.apple_current_pos)) in [Apple]
+ self.grid.set(*self.apple_current_pos, None)
+
+ if self.problem in ["Doors"]:
+ # assert type(self.grid.get(*self.door_current_pos)) in [Door]
+ self.grid.set(*self.door.cur_pos, None)
+
+ elif self.problem in ["Levers"]:
+ # assert type(self.grid.get(*self.door_current_pos)) in [Door]
+ self.grid.set(*self.remote_door.cur_pos, None)
+ self.grid.set(*self.lever.cur_pos, None)
+
+ elif self.problem in ["Switches"]:
+ # remove switch
+ assert type(self.grid.get(*self.switch_current_pos)) in [Switch]
+ self.grid.set(*self.switch.cur_pos, None)
+
+ elif self.problem in ["Generators", "Marbles", "Marble"]:
+ # remove generator
+ assert type(self.grid.get(*self.generator.cur_pos)) in [AppleGenerator]
+ self.grid.set(*self.generator.cur_pos, None)
+
+ if self.problem in ["Marbles", "Marble"]:
+ # remove generator
+ assert type(self.grid.get(*self.marble.cur_pos)) in [Marble]
+ self.grid.set(*self.marble.cur_pos, None)
+
+ if self.marble.tee_uncovered:
+ self.grid.set(*self.marble.tee.cur_pos, None)
+
+ elif self.problem in ["Apples", "Boxes"]:
+ pass
+
+ else:
+ raise ValueError("Undefined problem {}".format(self.problem))
+
+ # remove distractor
+ if self.problem in ["Boxes", "Switches", "Generators", "Marbles", "Marble", "Doors", "Levers"] and self.parameters["N"] != "1":
+ assert type(self.grid.get(*self.distractor_current_pos)) in [LockableBox, Switch, AppleGenerator, Door, Lever]
+ self.grid.set(*self.distractor_current_pos, None)
+
+ # apple
+ self.apple = Apple()
+
+ # Box
+ locked = self.problem == "Switches"
+
+ self.box = LockableBox(
+ self.box_color,
+ contains=self.apple,
+ is_locked=locked,
+ block_set=self.boxes_block_set
+ )
+ self.boxes_block_set.append(self.box)
+
+ # Doors
+ self.door = Door(
+ color=self.door_color,
+ is_locked=False,
+ block_set=self.doors_block_set,
+ )
+ self.doors_block_set.append(self.door)
+
+ # Levers
+ self.remote_door = RemoteDoor(
+ color=self.door_color,
+ )
+
+ self.lever = Lever(
+ color=self.lever_color,
+ object=self.remote_door,
+ active_steps=None,
+ block_set=self.levers_block_set,
+ )
+ self.levers_block_set.append(self.lever)
+
+ # Switch
+ self.switch = Switch(
+ color=self.switch_color,
+ lockable_object=self.box,
+ locker_switch=True,
+ no_turn_off=True,
+ no_light=self.switch_no_light,
+ block_set=self.switches_block_set,
+ )
+ self.switches_block_set.append(self.switch)
+
+ # Generator
+ self.generator = AppleGenerator(
+ self.generator_color,
+ block_set=self.generators_block_set,
+ # on_push=lambda: self.put_obj_np(self.apple, self.apple_current_pos)
+ on_push=lambda: self.grid.set(*self.apple_current_pos, self.apple),
+ marble_activation=self.problem in ["Marbles", "Marble"],
+ )
+ self.generators_block_set.append(self.generator)
+
+ self.generator_platform = GeneratorPlatform(self.generator_platform_color)
+
+ self.marble = Marble(self.marble_color, env=self)
+
+ if self.problem in ["Apples"]:
+ self.put_obj_np(self.apple, self.apple_current_pos)
+
+ elif self.problem in ["Doors"]:
+ self.put_obj_np(self.apple, self.apple_current_pos)
+ self.put_obj_np(self.door, self.door_current_pos)
+
+ elif self.problem in ["Levers"]:
+ self.put_obj_np(self.apple, self.apple_current_pos)
+ self.put_obj_np(self.remote_door, self.door_current_pos)
+ self.put_obj_np(self.lever, self.lever_current_pos)
+
+ elif self.problem in ["Boxes"]:
+ self.put_obj_np(self.box, self.apple_current_pos)
+
+ elif self.problem in ["Switches"]:
+ self.put_obj_np(self.box, self.apple_current_pos)
+ self.put_obj_np(self.switch, self.switch_current_pos)
+
+ elif self.problem in ["Generators", "Marbles", "Marble"]:
+ self.put_obj_np(self.generator, self.generator_current_pos)
+ self.put_obj_np(self.generator_platform, self.apple_current_pos)
+
+ if self.problem in ["Marbles", "Marble"]:
+ self.put_obj_np(self.marble, self.marble_current_pos)
+ else:
+ raise ValueError("Problem {} not defined. ".format(self.problem))
+
+ # Distractors
+ if self.problem not in ["Apples"]:
+
+ N = int(self.parameters["N"])
+ if N > 1:
+ assert N == 2
+
+ if self.problem == "Boxes":
+ assert not locked
+
+ self.distractor_box = LockableBox(
+ self.distractor_color,
+ is_locked=locked,
+ block_set=self.boxes_block_set,
+ )
+ self.boxes_block_set.append(self.distractor_box)
+
+ self.put_obj_np(self.distractor_box, self.distractor_current_pos)
+
+ elif self.problem == "Doors":
+ self.distractor_door = Door(
+ color=self.distractor_color,
+ is_locked=False,
+ block_set=self.doors_block_set,
+ )
+ self.doors_block_set.append(self.distractor_door)
+
+ self.put_obj_np(self.distractor_door, self.distractor_current_pos)
+
+ elif self.problem == "Levers":
+ self.distractor_lever = Lever(
+ color=self.distractor_color,
+ active_steps=None,
+ block_set=self.levers_block_set,
+ )
+ self.levers_block_set.append(self.distractor_lever)
+ self.put_obj_np(self.distractor_lever, self.distractor_current_pos)
+
+ elif self.problem == "Switches":
+ self.distractor_switch = Switch(
+ color=self.distractor_color,
+ locker_switch=True,
+ no_turn_off=True,
+ no_light=self.switch_no_light,
+ block_set=self.switches_block_set,
+ )
+ self.switches_block_set.append(self.distractor_switch)
+
+ self.put_obj_np(self.distractor_switch, self.distractor_current_pos)
+
+ elif self.problem in ["Generators", "Marbles", "Marble"]:
+ self.distractor_generator = AppleGenerator(
+ color=self.distractor_color,
+ block_set=self.generators_block_set,
+ marble_activation=self.problem in ["Marbles", "Marble"],
+ )
+ self.generators_block_set.append(self.distractor_generator)
+
+ self.put_obj_np(self.distractor_generator, self.distractor_current_pos)
+
+ else:
+ raise ValueError("Undefined N for problem {}".format(self.problem))
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_the_apple = False
+ self.agent_opened_the_box = False
+ self.agent_opened_the_door = False
+ self.agent_pulled_the_lever = False
+ self.agent_turned_on_the_switch = False
+ self.agent_pressed_the_generator = False
+ self.agent_pushed_the_marble = False
+
+ return obs
+
+ def step(self, action):
+
+ success = False
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ apple_had_been_eaten = self.apple.eaten
+ box_had_been_opened = self.box.is_open
+ door_had_been_opened = self.door.is_open
+ lever_had_been_pulled = self.lever.is_on
+ switch_had_been_turned_on = self.switch.is_on
+ generator_had_been_pressed = self.generator.is_pressed
+ marble_had_been_pushed = self.marble.was_pushed
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ if self.problem in ["Marbles", "Marble"]:
+ # todo: create stepable objects which are stepped automatically?
+ self.marble.step()
+
+ # eaten just now by primitive actions of the agent
+ if not self.agent_ate_the_apple:
+ self.agent_ate_the_apple = self.apple.eaten and not apple_had_been_eaten
+
+ if not self.agent_opened_the_box:
+ self.agent_opened_the_box = self.box.is_open and not box_had_been_opened
+
+ if not self.agent_opened_the_door:
+ self.agent_opened_the_door = self.door.is_open and not door_had_been_opened
+
+ if not self.agent_pulled_the_lever:
+ self.agent_pulled_the_lever = self.lever.is_on and not lever_had_been_pulled
+
+ if not self.agent_turned_on_the_switch:
+ self.agent_turned_on_the_switch = self.switch.is_on and not switch_had_been_turned_on
+
+ if not self.agent_pressed_the_generator:
+ self.agent_pressed_the_generator = self.generator.is_pressed and not generator_had_been_pressed
+
+ if not self.agent_pushed_the_marble:
+ self.agent_pushed_the_marble = self.marble.was_pushed and not marble_had_been_pushed
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.parameters["Peer"] == "Y":
+ reply, npc_info = self.caretaker.step(utterance)
+ else:
+ reply = None
+ npc_info = self.caretaker.create_info(
+ action=None,
+ utterance=None,
+ was_introduced_to=False
+ )
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ elif self.agent_ate_the_apple:
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.apple
+
+ if self.parameters.get("Cue_type", "nan") == "Emulation":
+
+ # during emulation it can be the NPC who eats the apple, opens the box, and turns on the switch
+ if self.parameters["Scaffolding"] and self.caretaker.apple_unlocked_for_agent:
+ # if the caretaker unlocked the apple the agent gets reward upon eating it
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Apples":
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Doors" and self.agent_opened_the_door:
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Levers" and self.agent_pulled_the_lever:
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Boxes" and self.agent_opened_the_box:
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Switches" and self.agent_opened_the_box and self.agent_turned_on_the_switch:
+ reward = self._reward()
+ success = True
+
+ elif self.problem == "Generators" and self.agent_pressed_the_generator:
+ reward = self._reward()
+ success = True
+
+ elif self.problem in ["Marble"] and self.agent_pushed_the_marble:
+ reward = self._reward()
+ success = True
+
+ else:
+ reward = self._reward()
+ success = True
+
+ done = True
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+
+ assert success == (reward > 0)
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ def render(self, *args, **kwargs):
+ obs = super().render(*args, **kwargs)
+ if args[0] == 'human':
+ self.window.clear_text() # erase previous text
+ self.window.set_caption(self.full_conversation)
+
+ # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+
+ if self.outcome_info:
+ color = None
+ if "SUCCESS" in self.outcome_info:
+ color = "lime"
+ elif "FAILURE" in self.outcome_info:
+ color = "red"
+ self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ **{'fontsize': 15, 'color': color, 'weight': "bold"})
+
+ self.window.show_img(obs) # re-draw image to add changes to window
+ return obs
+
+
+register(
+ id='SocialAI-InformationSeeking-v0',
+ entry_point='gym_minigrid.social_ai_envs:InformationSeekingEnv'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/leverdoorenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/leverdoorenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..580b97776db4bc8bc6c064902a2279faec0ac970
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/leverdoorenv.py
@@ -0,0 +1,569 @@
+import time
+import random
+
+import numpy as np
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+import time
+from collections import deque
+
+
+class Partner(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0 # this will be put into the encoding
+
+ # opposite role of the agent
+ self.npc_side = "L" if self.env.agent_side == "R" else "R"
+
+ # how many random action at the beginning -> removes trivial solutions
+ self.random_to_go = random.randint(self.env.lever_active_steps, 10)
+
+ assert set([self.npc_side, self.env.agent_side]) == {"L", "R"}
+
+ self.was_introduced_to = False
+
+ self.ate_an_apple = False
+ self.pressed_the_lever = False
+ self.pushed_the_generator = False
+ self.toggling = False
+
+ # target obj
+ assert self.env.problem == self.env.parameters["Problem"] if self.env.parameters else "MarblePass"
+
+ self.target_obj = self.env.generator
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+
+ def step(self, utterance):
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ reply, action = self.handle_introduction(utterance)
+
+ if self.was_introduced_to:
+
+ if self.random_to_go > 0:
+ action = random.choice([self.go_forward, self.rotate_left, self.rotate_right])
+ self.random_to_go -= 1
+
+ elif self.npc_side == "L":
+
+ if not self.pressed_the_lever:
+ # is the NPC next to the lever
+ if np.abs(self.env.lever.cur_pos - self.cur_pos).sum() > 1:
+ # go to the lever
+ action = self.path_to_pos(self.env.lever.cur_pos)
+ else:
+
+ # look at the agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+
+ if wanted_dir != self.npc_dir and not self.toggling:
+ # turn to look at the agent
+ action = self.compute_turn_action(wanted_dir)
+ else:
+ # check if the agent is next to the door
+ if np.abs(self.env.door.cur_pos - self.env.agent_pos).sum() <= 1:
+ self.toggling = True
+ action = self.path_to_toggle_pos(self.env.lever.cur_pos)
+
+ elif self.npc_side == "R":
+ if not self.pushed_the_generator:
+ # go to generator and push it
+ action = self.path_to_pos(self.env.generator.cur_pos)
+
+ if action is None:
+ # the door is not open, no paths exist
+ action = self.path_to_pos(self.env.door.cur_pos)
+
+ else:
+ raise ValueError("Undefined role")
+
+ eaten_before = self.env.partner_apple.eaten
+ lever_on_before = self.env.lever.is_on
+ generator_pushed_before = self.env.generator.is_pressed
+
+ if action is not None:
+ action()
+
+ # check if the NPC ate the apple
+ if not self.ate_an_apple:
+ self.ate_an_apple = not eaten_before and self.env.partner_apple.eaten
+
+ if not self.pressed_the_lever:
+ self.pressed_the_lever = not lever_on_before and self.env.lever.is_on
+
+ if not self.pushed_the_generator:
+ self.pushed_the_generator = not generator_pushed_before and self.env.generator.is_pressed
+
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": reply or "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+
+ assert (reply or "no_op") in self.list_of_possible_utterances
+
+ return reply, info
+
+
+class LeverDoorEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ lever_active_steps=10,
+ reward_diminish_factor=0.1,
+ egocentric_observation=True,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ # encoding size should be 5
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+ self.lever_active_steps = lever_active_steps
+
+ self.reward_diminish_factor = reward_diminish_factor
+
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+
+ self.all_npc_utterance_actions = Partner.get_list_of_possible_utterances()
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ min_w = min(9, width_)
+ min_h = min(9, height_)
+ self.current_width = self._rand_int(min_w, width_+1)
+ self.current_height = self._rand_int(min_h, height_+1)
+ # print("Room size: {}x{}".format(self.current_width, self.current_height))
+
+ # previous
+ # self.current_width = self._rand_int(6, width_+1)
+ # self.current_height = self._rand_int(6, height_+1)
+
+ # original
+ # self.current_width = self._rand_int(5, width_+1)
+ # self.current_height = self._rand_int(5, height_+1)
+
+ # self.current_width = width_
+ # self.current_height = height_
+ # self.current_width = 8
+ # self.current_height = 8
+ # warnings.warn("env size fixed: {}x{}".format(self.current_width, self.current_height))
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ self.problem = self.parameters["Problem"] if self.parameters else "LeverDoor"
+ self.version = self.parameters["Version"] if self.parameters else "Asocial"
+ self.role = self.parameters["Role"] if self.parameters else "A"
+ assert self.role in ["A", "B", "Meta"]
+
+ if self.role in ["B", "Meta"]:
+ self.agent_side = "R" # starts on the right side
+ else:
+ self.agent_side = "L" # starts on the right side
+
+ num_of_colors = self.parameters.get("Num_of_colors", None) if self.parameters else None
+
+ self.add_obstacles()
+
+ # apple
+ if num_of_colors is None:
+ POSSIBLE_COLORS = COLOR_NAMES
+
+ else:
+ POSSIBLE_COLORS = COLOR_NAMES[:int(num_of_colors)]
+
+ self.left_half_size = (self.current_width//2, self.current_height)
+ self.left_half_top = (0, 0)
+
+ self.right_half_size = (self.current_width//2, self.current_height)
+ self.right_half_top = (self.current_width - self.current_width // 2, 0)
+
+ # generator
+ self.generator_pos = (self.current_width//2, self.current_height)
+ self.generator_color = self._rand_elem(POSSIBLE_COLORS)
+
+ self.generator_current_pos = self.find_loc(
+ # on the right wall
+ top=(self.current_width-1, 1),
+ size=(1, self.current_height-2),
+ reject_agent_pos=True,
+ reject_taken_pos=False, # so that it can be placed on the wall
+ )
+
+ # create hole in the wall for the generator
+ assert type(self.grid.get(*self.generator_current_pos)) == Wall
+ self.grid.set(*self.generator_current_pos, None)
+
+ # add fence to grid
+ self.grid.vert_wall(
+ x=self.current_width//2,
+ y=1,
+ length=self.current_height - 2,
+ obj_type=Fence
+ )
+
+ # door in front of generator
+ self.door_current_pos = self.generator_current_pos - np.array([1, 0])
+ self.door_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # lever
+ self.lever_current_pos = self.find_loc(
+ top=self.left_half_top, size=self.left_half_size, reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [
+ self.door_current_pos])
+ )
+ self.lever_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # generator platform
+
+ # find the position for generator_platforms
+ self.left_apple_current_pos = self.find_loc(
+ top=self.left_half_top, size=self.left_half_size, reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [
+ self.generator_current_pos, self.door_current_pos, self.lever_current_pos])
+ )
+
+ self.right_apple_current_pos = self.find_loc(
+ top=self.right_half_top, size=self.right_half_size, reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [
+ self.generator_current_pos, self.door_current_pos, self.lever_current_pos])
+ )
+
+ assert all(self.left_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+ assert all(self.right_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+
+ self.agent_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+ self.partner_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+
+ self.put_objects_in_env()
+
+ # agent
+ if self.agent_side == "L":
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+ else:
+ self.place_agent(size=self.right_half_size, top=self.right_half_top)
+
+ if self.version == "Social":
+ # NPC
+ self.npc_color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+
+ if self.agent_side == "L":
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top)
+ else:
+ self.place_obj(self.caretaker, size=self.left_half_size, top=self.left_half_top)
+
+ # Generate the mission string
+ self.mission = 'lets collaborate'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+ def put_objects_in_env(self, remove_objects=False):
+
+ assert self.left_apple_current_pos is not None
+ assert self.right_apple_current_pos is not None
+ assert self.generator_current_pos is not None
+ assert self.agent_generator_platform_color is not None
+ assert self.partner_generator_platform_color is not None
+
+ assert self.problem == self.parameters["Problem"] if self.parameters else "MarblePass"
+
+ if remove_objects:
+ self.grid.set(*self.agent_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.partner_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.generator.cur_pos, None) # remove generator
+ self.grid.set(*self.door.cur_pos, None) # remove door
+ self.grid.set(*self.lever.cur_pos, None) # remove lever
+
+ # apple
+ self.agent_apple = Apple()
+ self.partner_apple = Apple()
+
+ def generate_apples():
+ if self.agent_side == "L":
+ self.grid.set(*self.left_apple_current_pos, self.agent_apple),
+ self.grid.set(*self.right_apple_current_pos, self.partner_apple),
+ else:
+ self.grid.set(*self.left_apple_current_pos, self.partner_apple),
+ self.grid.set(*self.right_apple_current_pos, self.agent_apple),
+
+ # Generator
+ self.generator = AppleGenerator(
+ self.generator_color,
+ on_push=generate_apples,
+ )
+
+ door_open = self.version == "Asocial"
+ self.door = RemoteDoor(color=self.door_color, is_open=door_open)
+
+ self.lever = Lever(color=self.lever_color, object=self.door, active_steps=self.lever_active_steps)
+
+ self.agent_generator_platform = GeneratorPlatform(self.agent_generator_platform_color)
+ self.partner_generator_platform = GeneratorPlatform(self.partner_generator_platform_color)
+
+
+ self.put_obj_np(self.agent_generator_platform, self.left_apple_current_pos)
+ self.put_obj_np(self.partner_generator_platform, self.right_apple_current_pos)
+
+ self.put_obj_np(self.generator, self.generator_current_pos)
+ self.put_obj_np(self.door, self.door_current_pos)
+ self.put_obj_np(self.lever, self.lever_current_pos)
+
+
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_the_apple = False
+ self.agent_opened_the_box = False
+ self.agent_turned_on_the_switch = False
+ self.agent_pressed_the_generator = False
+
+ return obs
+
+ def step(self, action):
+ success = False
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ apple_had_been_eaten = self.agent_apple.eaten
+ generator_had_been_pressed = self.generator.is_pressed
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ self.lever.step()
+
+ # eaten just now by primitive actions of the agent
+ if not self.agent_ate_the_apple:
+ self.agent_ate_the_apple = self.agent_apple.eaten and not apple_had_been_eaten
+
+ if not self.agent_pressed_the_generator:
+ self.agent_pressed_the_generator = self.generator.is_pressed and not generator_had_been_pressed
+
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.version == "Social":
+ reply, npc_info = self.caretaker.step(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+ else:
+ npc_info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": False,
+ }
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ elif self.agent_ate_the_apple:
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.agent_apple
+
+ if self.version == "Asocial" or self.role in ["A", "B"]:
+ reward = self._reward()
+ success = True
+ done = True
+
+ elif self.role == "Meta":
+
+ if self.agent_side == "L":
+ reward = self._reward() / 2
+ success = True
+ done = True
+
+ elif self.agent_side == "R":
+ reward = self._reward() / 2
+ self.agent_ate_the_apple=False
+ self.agent_side = "L"
+ self.put_objects_in_env(remove_objects=True)
+
+ # teleport the agent and the NPC
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+
+ self.grid.set(*self.caretaker.cur_pos, None)
+
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top)
+
+ else:
+ raise ValueError(f"Side unknown - {self.agent_side}.")
+
+ else:
+ raise ValueError(f"Role unknown - {self.role}.")
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ if self.version == "Social":
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ # def render(self, *args, **kwargs):
+ # obs = super().render(*args, **kwargs)
+ # self.window.clear_text() # erase previous text
+ # self.window.set_caption(self.full_conversation)
+ #
+ # # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+ #
+ # if self.outcome_info:
+ # color = None
+ # if "SUCCESS" in self.outcome_info:
+ # color = "lime"
+ # elif "FAILURE" in self.outcome_info:
+ # color = "red"
+ # self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ # **{'fontsize':15, 'color':color, 'weight':"bold"})
+ #
+ # self.window.show_img(obs) # re-draw image to add changes to window
+ # return obs
+
+
+register(
+ id='SocialAI-LeverDoorEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:LeverDoorEnv'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/marblepassenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/marblepassenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..b21f95b39a220dde6f71f077d5dc067db7400cf3
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/marblepassenv.py
@@ -0,0 +1,585 @@
+import random
+import time
+
+import numpy as np
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+import time
+from collections import deque
+
+
+class Partner(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ self.npc_type = 0 # this will be put into the encoding
+
+ # opposite role of the agent
+ self.npc_side = "L" if self.env.agent_side == "R" else "R"
+
+ # how many random action at the beginning -> removes trivial solutions
+ self.random_to_go = 0 # not needed as no lever is used, solution is trivial anyway
+
+ assert set([self.npc_side, self.env.agent_side]) == {"L", "R"}
+
+ self.was_introduced_to = False
+ self.pushed_the_marble = False
+ self.ate_an_apple = False
+
+ # target obj
+ assert self.env.problem == self.env.parameters["Problem"] if self.env.parameters else "MarblePass"
+
+ self.target_obj = self.env.generator
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+ def step(self, utterance):
+
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ reply, action = self.handle_introduction(utterance)
+
+ if self.was_introduced_to:
+
+ if self.random_to_go > 0:
+ action = random.choice([self.go_forward, self.rotate_left, self.rotate_right])
+ self.random_to_go -= 1
+
+ elif not self.pushed_the_marble:
+
+ if self.npc_side == "L":
+ # find angle
+ push_pos = self.env.marble.cur_pos - np.array([1, 0])
+
+ if all(self.cur_pos == push_pos):
+ next_target_position = self.env.marble.cur_pos
+ else:
+ next_target_position = push_pos
+
+ # go to loc in front of marble and push
+ action = self.path_to_pos(next_target_position)
+
+ elif self.npc_side == "R":
+ if self.env.marble.cur_pos[0] == self.env.generator.cur_pos[0]:
+
+ distance = (self.env.marble.cur_pos - self.env.generator.cur_pos)
+
+ # keep only the direction and move for 1 step
+ diff = np.sign(distance)
+ # assert all(diff == np.nan_to_num(distance / np.abs(distance)))
+ if abs(diff.sum()) == 1:
+ push_pos = self.env.marble.cur_pos + diff
+
+ if all(self.cur_pos == push_pos):
+ next_target_position = self.env.marble.cur_pos
+
+ else:
+ next_target_position = push_pos
+
+ # go to loc in front of marble or push
+ action = self.path_to_pos(next_target_position)
+
+ else:
+ raise ValueError("Undefined role")
+
+ eaten_before = self.env.agent_apple.eaten
+ pushed_before = self.env.marble.is_moving
+
+ if action is not None:
+ action()
+
+ if not self.pushed_the_marble:
+ self.pushed_the_marble = not pushed_before and self.env.marble.is_moving
+
+ # check if the NPC ate the apple
+ eaten_after = self.env.agent_apple.eaten
+ self.ate_an_apple = not eaten_before and eaten_after
+
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": reply or "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+
+ assert (reply or "no_op") in self.list_of_possible_utterances
+
+ return reply, info
+
+
+class MarblePassEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ reward_diminish_factor=0.1,
+ egocentric_observation=True,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ # encoding size should be 5
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+
+ self.reward_diminish_factor = reward_diminish_factor
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+ self.all_npc_utterance_actions = Partner.get_list_of_possible_utterances()
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ def is_in_marble_way(self, pos):
+
+ if pos[0] == self.generator_current_pos[0]: # same column as generator
+ return True
+
+ if pos[1] == self.marble_current_pos[1]: # same row as marble
+ return True
+
+ # all good
+ return False
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ # new
+ min_w = min(9, width_)
+ min_h = min(9, height_)
+ self.current_width = self._rand_int(min_w, width_+1)
+ self.current_height = self._rand_int(min_h, height_+1)
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ self.problem = self.parameters["Problem"] if self.parameters else "MarblePass"
+ self.version = self.parameters["Version"] if self.parameters else "Asocial"
+ self.role = self.parameters["Role"] if self.parameters else "A"
+ assert self.role in ["A", "B", "Meta"]
+
+ if self.role in ["B", "Meta"]:
+ self.agent_side = "R" # starts on the right side
+ else:
+ self.agent_side = "L" # starts on the right side
+
+ num_of_colors = self.parameters.get("Num_of_colors", None) if self.parameters else None
+
+ self.add_obstacles()
+
+ # apple
+ if num_of_colors is None:
+ POSSIBLE_COLORS = COLOR_NAMES
+
+ else:
+ POSSIBLE_COLORS = COLOR_NAMES[:int(num_of_colors)]
+
+ self.left_half_size = (self.current_width//2, self.current_height)
+ self.left_half_top = (0, 0)
+
+ self.right_half_size = (self.current_width//2, self.current_height)
+ self.right_half_top = (self.current_width - self.current_width // 2, 0)
+
+ # generator
+ self.generator_pos = (self.current_width//2, self.current_height)
+ self.generator_color = self._rand_elem(POSSIBLE_COLORS)
+ self.generator_current_pos = self.find_loc(
+ # on the right most column
+ top=(self.current_width-2, 0),
+ size=(1, self.current_height),
+ reject_agent_pos=True,
+ )
+
+ # marble
+ self.marble_color = self._rand_elem(POSSIBLE_COLORS)
+ if self.version == "Social":
+ self.marble_current_pos = self.find_loc(
+ top=(self.current_width//2 - 1, 1), # fence or column left of fence, not next to wall
+ size=(2, self.current_height - 2),
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: (
+ # tuple(pos) in map(tuple, [self.left_apple_current_pos, self.right_apple_current_pos, self.generator_current_pos])
+ # or
+ pos[1] == self.generator_current_pos[1] # reject if in row as the generator
+ # or
+ # pos[1] == self.right_apple_current_pos[1] # reject if in row as the partner's platform
+ # or
+ # any(pos == self.left_apple_current_pos) # reject if in row or column as the agent's platform
+ or
+ any(pos == 1) # next to a wall
+ or
+ pos[1] == self.current_height - 2
+ ),
+ )
+ else:
+ self.marble_current_pos = self.find_loc(
+ top=(self.generator_current_pos[0], 1), # fence or column left of fence, not next to wall
+ size=(1, self.current_height - 2),
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: (
+ # tuple(pos) in map(tuple, [self.left_apple_current_pos, self.right_apple_current_pos, self.generator_current_pos])
+ # or
+ pos[1] == self.generator_current_pos[1] # reject if in row as the generator
+ # or
+ # pos[1] == self.right_apple_current_pos[1] # reject if in row as the partner's platform
+ # or
+ # any(pos == self.left_apple_current_pos) # reject if in row or column as the agent's platform
+ or
+ any(pos == 1) # next to a wall
+ or
+ pos[1] == self.current_height - 2
+ ),
+ )
+
+ # add fence to grid
+ self.grid.vert_wall(
+ x=self.current_width//2,
+ y=1,
+ length=self.current_height - 2,
+ obj_type=Fence
+ )
+
+ if self.version == "Social":
+ # create hole in fence wall to make room for the marble
+ self.grid.set(self.current_width//2, self.marble_current_pos[1], None)
+
+ # generator platform
+
+ # find the position for generator_platforms
+ self.left_apple_current_pos = self.find_loc(
+ top=self.left_half_top, size=self.left_half_size, reject_agent_pos=True,
+ # reject if in row or column as the generator
+ reject_fn=lambda _, pos: pos[1] == self.generator_current_pos[1] or pos[1] == self.marble_current_pos[1],
+ )
+
+ self.right_apple_current_pos = self.find_loc(
+ top=self.right_half_top, size=self.right_half_size, reject_agent_pos=True,
+ # reject if in row or column as the generator
+ reject_fn=lambda _, pos: any(pos == self.generator_current_pos) or pos[1] == self.marble_current_pos[1],
+ )
+
+ assert all(self.left_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+ assert all(self.right_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+
+ self.left_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+ self.right_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+
+ self.put_objects_in_env()
+
+ # place agent
+ if self.agent_side == "L":
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+ else:
+ self.place_agent(size=self.right_half_size, top=self.right_half_top)
+
+ # NPC
+ if self.version == "Social":
+ self.npc_color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+
+ if self.agent_side == "L":
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top, reject_fn=MarblePassEnv.is_in_marble_way)
+ else:
+ self.place_obj(self.caretaker, size=self.left_half_size, top=self.left_half_top, reject_fn=MarblePassEnv.is_in_marble_way)
+
+ # Generate the mission string
+ self.mission = 'lets collaborate'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n"
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+ def put_objects_in_env(self, remove_objects=False):
+
+ assert self.left_apple_current_pos is not None
+ assert self.right_apple_current_pos is not None
+ assert self.generator_current_pos is not None
+ assert self.left_generator_platform_color is not None
+ assert self.right_generator_platform_color is not None
+
+ assert self.problem == self.parameters["Problem"] if self.parameters else "MarblePass"
+
+ if remove_objects:
+ self.grid.set(*self.left_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.right_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.generator.cur_pos, None) # remove generator
+ self.grid.set(*self.marble.cur_pos, None) # remove marble
+ self.grid.set(*self.marble_current_pos, None) # remove tee
+
+
+ # Apple
+ self.agent_apple = Apple()
+ self.partner_apple = Apple()
+
+ def generate_apples():
+ if self.agent_side == "L":
+ self.grid.set(*self.left_apple_current_pos, self.agent_apple),
+ self.grid.set(*self.right_apple_current_pos, self.partner_apple),
+ else:
+ self.grid.set(*self.left_apple_current_pos, self.partner_apple),
+ self.grid.set(*self.right_apple_current_pos, self.agent_apple),
+
+ # Generator
+ self.generator = AppleGenerator(
+ self.generator_color,
+ on_push=generate_apples,
+ marble_activation=True,
+ )
+
+ self.left_generator_platform = GeneratorPlatform(self.left_generator_platform_color)
+ self.right_generator_platform = GeneratorPlatform(self.right_generator_platform_color)
+
+ self.marble = Marble(self.marble_color, env=self)
+
+ self.put_obj_np(self.left_generator_platform, self.left_apple_current_pos)
+ self.put_obj_np(self.right_generator_platform, self.right_apple_current_pos)
+
+ self.put_obj_np(self.generator, self.generator_current_pos)
+
+ self.put_obj_np(self.marble, self.marble_current_pos)
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_the_apple = False
+ self.agent_opened_the_box = False
+ self.agent_turned_on_the_switch = False
+ self.agent_pressed_the_generator = False
+ self.agent_pushed_the_marble = False
+
+ return obs
+
+ def step(self, action):
+ success = False
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ apple_had_been_eaten = self.agent_apple.eaten
+ generator_had_been_pressed = self.generator.is_pressed
+ marble_had_been_pushed = self.marble.was_pushed
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ self.marble.step()
+
+ # eaten just now by primitive actions of the agent
+ if not self.agent_ate_the_apple:
+ self.agent_ate_the_apple = self.agent_apple.eaten and not apple_had_been_eaten
+
+ if not self.agent_pressed_the_generator:
+ self.agent_pressed_the_generator = self.generator.is_pressed and not generator_had_been_pressed
+
+ if not self.agent_pushed_the_marble:
+ self.agent_pushed_the_marble = self.marble.was_pushed and not marble_had_been_pushed
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.version == "Social":
+ reply, npc_info = self.caretaker.step(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+ else:
+ npc_info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": False,
+ }
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ if self.agent_ate_the_apple:
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.agent_apple
+
+ if self.version == "Asocial" or self.role in ["A", "B"]:
+ reward = self._reward()
+ success = True
+ done = True
+
+ elif self.role == "Meta":
+
+ if self.agent_side == "L":
+ reward = self._reward() / 2
+ success = True
+ done = True
+
+ elif self.agent_side == "R":
+ # revert and rotate
+ reward = self._reward() / 2
+ self.agent_ate_the_apple = False
+ self.agent_side = "L"
+ self.put_objects_in_env(remove_objects=True)
+
+ # teleport the agent and the NPC
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+
+ self.grid.set(*self.caretaker.cur_pos, None)
+
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top, reject_fn=MarblePassEnv.is_in_marble_way)
+ else:
+ raise ValueError(f"Side unknown - {self.agent_side}.")
+ else:
+ raise ValueError(f"Role unknown - {self.role}.")
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ if self.version == "Social":
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ # def render(self, *args, **kwargs):
+ # obs = super().render(*args, **kwargs)
+ #
+ # self.window.clear_text() # erase previous text
+ # self.window.set_caption(self.full_conversation)
+ #
+ # # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+ #
+ # if self.outcome_info:
+ # color = None
+ # if "SUCCESS" in self.outcome_info: color = "lime"
+ # elif "FAILURE" in self.outcome_info:
+ # color = "red"
+ # self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ # **{'fontsize':15, 'color':color, 'weight':"bold"})
+ #
+ # self.window.show_img(obs) # re-draw image to add changes to window
+ # return obs
+
+
+register(
+ id='SocialAI-MarblePassEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:MarblePassEnv'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/marblepushenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/marblepushenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..8599c73392b100c7c4883c32e4eb5c518e01ecc4
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/marblepushenv.py
@@ -0,0 +1,639 @@
+import random
+import time
+
+import numpy as np
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+
+
+class Partner(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+
+ # opposite role of the agent
+ self.npc_side = "L" if self.env.agent_side == "R" else "R"
+
+ # how many random action at the beginning -> removes trivial solutions
+ self.random_to_go = random.randint(self.env.lever_active_steps, 10)
+
+ assert set([self.npc_side, self.env.agent_side]) == {"L", "R"}
+
+ self.was_introduced_to = False
+
+ self.pushed_the_marble = False
+
+ self.ate_an_apple = False
+ self.pressed_the_lever = False
+ self.toggling = False
+
+ # target obj
+
+ self.target_obj = self.env.generator
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+ def step(self, utterance):
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ reply, action = self.handle_introduction(utterance)
+
+ if self.was_introduced_to:
+
+ if self.random_to_go > 0:
+ action = random.choice([self.go_forward, self.rotate_left, self.rotate_right])
+ self.random_to_go -= 1
+
+ elif self.npc_side == "L":
+ if not self.pushed_the_marble:
+
+ push_pos = self.env.marble.cur_pos - np.array([1, 0])
+
+ # is the NPC next to the lever
+ if not all(push_pos == self.cur_pos):
+ # go to the push pos for lever
+ action = self.path_to_pos(push_pos)
+
+ else:
+ # look at the agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+
+ if wanted_dir != self.npc_dir and not self.toggling:
+ # turn to look at the agent
+ action = self.compute_turn_action(wanted_dir)
+
+ else:
+ # looking at the agent
+ # check if agent is next to the lever
+ if np.abs(self.env.agent_pos - self.env.lever.cur_pos).sum() <= 1:
+ action = self.path_to_pos(self.env.marble.cur_pos)
+
+ elif self.npc_side == "R":
+ if not self.pressed_the_lever:
+ # is the NPC next to the lever
+ if np.abs(self.env.lever.cur_pos - self.cur_pos).sum() > 1:
+ # go to the lever
+ action = self.path_to_pos(self.env.lever.cur_pos)
+ else:
+ # look at the agent
+ wanted_dir = self.compute_wanted_dir(self.env.agent_pos)
+
+ # # look at the lever
+ # wanted_dir = self.compute_wanted_dir(self.env.lever.cur_pos)
+
+ if wanted_dir != self.npc_dir and not self.toggling:
+ # turn to look at the agent
+ action = self.compute_turn_action(wanted_dir)
+
+ elif self.toggling or all(abs(self.env.agent_pos - self.env.marble.cur_pos) == np.array([1, 0])):
+ self.toggling = True
+ action = self.path_to_toggle_pos(self.env.lever.cur_pos)
+
+ # else:
+ # # check if the marble is next to the door
+ # if np.abs(self.env.door.cur_pos - self.env.marble.cur_pos).sum() <= 1:
+ # self.toggling = True
+ # action = self.path_to_toggle_pos(self.env.lever.cur_pos)
+
+ else:
+ raise ValueError("Undefined role")
+
+ eaten_before = self.env.agent_apple.eaten
+ pushed_before = self.env.marble.is_moving
+ lever_on_before = self.env.lever.is_on
+
+ if action is not None:
+ action()
+
+ if not self.pressed_the_lever:
+ self.pressed_the_lever = not lever_on_before and self.env.lever.is_on
+
+ if not self.pushed_the_marble:
+ self.pushed_the_marble = not pushed_before and self.env.marble.is_moving
+
+ # check if the NPC ate the apple
+ eaten_after = self.env.agent_apple.eaten
+ self.ate_an_apple = not eaten_before and eaten_after
+
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": reply or "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+
+ assert (reply or "no_op") in self.list_of_possible_utterances
+
+ return reply, info
+
+
+class MarblePushEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ lever_active_steps=10,
+ reward_diminish_factor=0.1,
+ egocentric_observation=True,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ # encoding size should be 5
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+ self.lever_active_steps = lever_active_steps
+ self.reward_diminish_factor = reward_diminish_factor
+
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=False,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+
+ self.all_npc_utterance_actions = Partner.get_list_of_possible_utterances()
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ def is_in_marble_way(self, pos):
+
+ if pos[0] == self.generator_current_pos[0]: # same column as generator
+ return True
+
+ if pos[1] == self.marble_current_pos[1]: # same row as marble
+ return True
+
+ # all good
+ return False
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ # newer
+ min_w = min(9, width_)
+ min_h = min(9, height_)
+ self.current_width = self._rand_int(min_w, width_+1)
+ self.current_height = self._rand_int(min_h, height_+1)
+
+ # new
+ # self.current_width = self._rand_int(7, width_+1)
+ # self.current_height = self._rand_int(7, height_+1)
+ # print("Room size: {}x{}".format(self.current_width, self.current_height))
+
+ # previous
+ # self.current_width = self._rand_int(6, width_+1)
+ # self.current_height = self._rand_int(6, height_+1)
+
+ # original
+ # self.current_width = self._rand_int(5, width_+1)
+ # self.current_height = self._rand_int(5, height_+1)
+
+ # self.current_width = width_
+ # self.current_height = height_
+ # self.current_width = 8
+ # self.current_height = 8
+ # warnings.warn("env size fixed: {}x{}".format(self.current_width, self.current_height))
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ self.problem = self.parameters["Problem"] if self.parameters else "MarblePush"
+ self.version = self.parameters["Version"] if self.parameters else "Asocial"
+ self.role = self.parameters["Role"] if self.parameters else "A"
+ assert self.role in ["A", "B", "Meta"]
+
+ if self.role in ["B", "Meta"]:
+ self.agent_side = "R" # starts on the right side
+ else:
+ self.agent_side = "L" # starts on the right side
+
+ num_of_colors = self.parameters.get("Num_of_colors", None) if self.parameters else None
+
+ self.add_obstacles()
+
+ # apple
+ if num_of_colors is None:
+ POSSIBLE_COLORS = COLOR_NAMES
+
+ else:
+ POSSIBLE_COLORS = COLOR_NAMES[:int(num_of_colors)]
+
+ self.left_half_size = (self.current_width//2, self.current_height)
+ self.left_half_top = (0, 0)
+
+ self.right_half_size = (self.current_width//2, self.current_height)
+ self.right_half_top = (self.current_width - self.current_width // 2, 0)
+
+ # generator
+ self.generator_pos = (self.current_width//2, self.current_height)
+ self.generator_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # self.generator_current_pos = self.find_loc(
+ # # on the right most column
+ # top=(self.current_width-2, 0),
+ # size=(1, self.current_height),
+ # reject_agent_pos=True,
+ # )
+ self.generator_current_pos = self.find_loc(
+ # on the right wall
+ top=(self.current_width-1, 1),
+ size=(1, self.current_height-2),
+ reject_agent_pos=True,
+ reject_taken_pos=False, # so that it can be placed on the wall
+ )
+
+ # create hole in the wall for the generator
+ assert type(self.grid.get(*self.generator_current_pos)) == Wall
+ self.grid.set(*self.generator_current_pos, None)
+
+ # door
+ self.door_current_pos = self.generator_current_pos - np.array([1, 0])
+ self.door_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # marble
+ self.marble_color = self._rand_elem(POSSIBLE_COLORS)
+ self.marble_current_pos = self.find_loc(
+ top=(self.current_width//2 - 1, self.generator_current_pos[1]), # same row as the generator
+ size=(2, 1),
+ reject_agent_pos=True,
+ )
+
+ # add fence to grid
+ self.grid.vert_wall(
+ x=self.current_width//2,
+ y=1,
+ length=self.current_height - 2,
+ obj_type=Fence
+ )
+ # create hole in fence wall
+ self.grid.set(self.current_width//2, self.marble_current_pos[1], None)
+
+ # lever
+ self.lever_current_pos = self.find_loc(
+ # above or below the door
+ top=(self.door_current_pos[0], self.door_current_pos[1]-1), size=(1, 3), reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [self.generator_current_pos]) or pos[1] == self.door_current_pos[1],
+ )
+
+ self.lever_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # generator platform
+
+ # find the position for generator_platforms
+ self.left_apple_current_pos = self.find_loc(
+ top=self.left_half_top, size=self.left_half_size, reject_agent_pos=True,
+ # reject if in row or column as the generator
+ reject_fn=lambda _, pos: (
+ pos[1] == self.generator_current_pos[1] or pos[1] == self.marble_current_pos[1])
+ or tuple(pos) in map(tuple, [
+ self.generator_current_pos, self.door_current_pos, self.lever_current_pos])
+ )
+
+ self.right_apple_current_pos = self.find_loc(
+ top=self.right_half_top, size=self.right_half_size, reject_agent_pos=True,
+ # reject if in row or column as the generator
+ reject_fn=lambda _, pos: (
+ any(pos == self.generator_current_pos) or pos[1] == self.marble_current_pos[1])
+ or tuple(pos) in map(tuple, [
+ self.generator_current_pos, self.door_current_pos, self.lever_current_pos])
+ )
+
+ assert all(self.left_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+ assert all(self.right_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+
+ # self.agent_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+ # self.partner_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+ self.agent_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+ self.partner_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+
+ self.put_objects_in_env()
+
+ # place agent
+ if self.agent_side == "L":
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+ else:
+ self.place_agent(size=self.right_half_size, top=self.right_half_top)
+
+ if self.version == "Social":
+ # NPC
+ self.npc_color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+
+ if self.agent_side == "L":
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top, reject_fn=MarblePushEnv.is_in_marble_way)
+ else:
+ self.place_obj(self.caretaker, size=self.left_half_size, top=self.left_half_top, reject_fn=MarblePushEnv.is_in_marble_way)
+
+
+ # Generate the mission string
+ self.mission = 'lets collaborate'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n" # todo: go back to "this what you hear?
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+ def put_objects_in_env(self, remove_objects=False):
+
+ assert self.left_apple_current_pos is not None
+ assert self.right_apple_current_pos is not None
+ assert self.generator_current_pos is not None
+ assert self.agent_generator_platform_color is not None
+ assert self.partner_generator_platform_color is not None
+
+ if remove_objects:
+ self.grid.set(*self.agent_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.partner_generator_platform.cur_pos, None) # remove apple
+ self.grid.set(*self.generator.cur_pos, None) # remove generator
+ self.grid.set(*self.marble.cur_pos, None) # remove marble
+ self.grid.set(*self.marble_current_pos, None) # remove tee
+ self.grid.set(*self.door.cur_pos, None) # remove door
+ self.grid.set(*self.lever.cur_pos, None) # remove lever
+
+
+ # apple
+ self.agent_apple = Apple()
+ self.partner_apple = Apple()
+
+ def generate_apples():
+ if self.agent_side == "L":
+ self.grid.set(*self.left_apple_current_pos, self.agent_apple),
+ self.grid.set(*self.right_apple_current_pos, self.partner_apple),
+ else:
+ self.grid.set(*self.left_apple_current_pos, self.partner_apple),
+ self.grid.set(*self.right_apple_current_pos, self.agent_apple),
+
+
+ # Generator
+ self.generator = AppleGenerator(
+ self.generator_color,
+ on_push=generate_apples,
+ marble_activation=True,
+ )
+
+ door_open = self.version == "Asocial"
+ self.door = RemoteDoor(color=self.door_color, is_open=door_open)
+
+ self.lever = Lever(color=self.lever_color, object=self.door, active_steps=self.lever_active_steps)
+
+ self.agent_generator_platform = GeneratorPlatform(self.agent_generator_platform_color)
+ self.partner_generator_platform = GeneratorPlatform(self.partner_generator_platform_color)
+
+
+ self.marble = Marble(self.marble_color, env=self)
+
+ self.put_obj_np(self.agent_generator_platform, self.left_apple_current_pos)
+ self.put_obj_np(self.partner_generator_platform, self.right_apple_current_pos)
+
+ self.put_obj_np(self.generator, self.generator_current_pos)
+ self.put_obj_np(self.door, self.door_current_pos)
+ self.put_obj_np(self.lever, self.lever_current_pos)
+
+ self.put_obj_np(self.marble, self.marble_current_pos)
+
+
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done # are you using it as part of an ParamEnv
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_the_apple = False
+ self.agent_opened_the_box = False
+ self.agent_turned_on_the_switch = False
+ self.agent_pressed_the_generator = False
+ self.agent_pushed_the_marble = False
+
+ return obs
+
+ def step(self, action):
+ success = False
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ apple_had_been_eaten = self.agent_apple.eaten
+ generator_had_been_pressed = self.generator.is_pressed
+ marble_had_been_pushed = self.marble.was_pushed
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ self.marble.step()
+
+ self.lever.step()
+
+ # eaten just now by primitive actions of the agent
+ if not self.agent_ate_the_apple:
+ self.agent_ate_the_apple = self.agent_apple.eaten and not apple_had_been_eaten
+
+ if not self.agent_pressed_the_generator:
+ self.agent_pressed_the_generator = self.generator.is_pressed and not generator_had_been_pressed
+
+ if not self.agent_pushed_the_marble:
+ self.agent_pushed_the_marble = self.marble.was_pushed and not marble_had_been_pushed
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.version == "Social":
+ reply, npc_info = self.caretaker.step(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+ else:
+ npc_info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": False,
+ }
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ elif self.agent_ate_the_apple:
+ # check that it is the agent who ate it
+ assert self.actions(p_action) == self.actions.toggle
+ assert self.get_cell(*self.front_pos) == self.agent_apple
+
+ if self.version == "Asocial" or self.role in ["A", "B"]:
+ reward = self._reward()
+ success = True
+ done = True
+
+ elif self.role == "Meta":
+
+ if self.agent_side == "R":
+ reward = self._reward() / 2
+ success = True
+ done = True
+
+ elif self.agent_side == "L":
+ reward = self._reward() / 2
+ self.agent_side = "R"
+ self.agent_ate_the_apple=False
+ self.put_objects_in_env(remove_objects=True)
+
+ # teleport the agent and the NPC
+ self.place_agent(size=self.right_half_size, top=self.right_half_top)
+
+ self.grid.set(*self.caretaker.cur_pos, None)
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+ self.place_obj(self.caretaker, size=self.left_half_size, top=self.left_half_top, reject_fn=MarblePushEnv.is_in_marble_way)
+ else:
+ raise ValueError(f"Side unknown - {self.agent_side}.")
+ else:
+ raise ValueError(f"Role unknown - {self.role}.")
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ if self.version == "Social":
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ # def render(self, *args, **kwargs):
+ # obs = super().render(*args, **kwargs)
+ # self.window.clear_text() # erase previous text
+ # self.window.set_caption(self.full_conversation)
+ #
+ # # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+ #
+ # if self.outcome_info:
+ # color = None
+ # if "SUCCESS" in self.outcome_info:
+ # color = "lime"
+ # elif "FAILURE" in self.outcome_info:
+ # color = "red"
+ # self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ # **{'fontsize':15, 'color':color, 'weight':"bold"})
+ #
+ # self.window.show_img(obs) # re-draw image to add changes to window
+ # return obs
+
+
+register(
+ id='SocialAI-MarblePushEnv-v1',
+ entry_point='gym_minigrid.social_ai_envs:MarblePushEnv'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/objectscollaborationenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/objectscollaborationenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..f354f516ec83790f1981d43318d68631c329405d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/objectscollaborationenv.py
@@ -0,0 +1,869 @@
+import time
+
+import numpy as np
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+from gym_minigrid.minigrid import *
+from gym_minigrid.register import register
+import time
+from collections import deque
+
+
+class Partner(NPC):
+ """
+ A simple NPC that knows who is telling the truth
+ """
+ def __init__(self, color, name, env):
+ super().__init__(color)
+ self.name = name
+ self.env = env
+ self.npc_dir = 1 # NPC initially looks downward
+ # todo: this should be id == name
+ self.npc_type = 0 # this will be put into the encoding
+
+ self.npc_side = "L" if self.env.agent_side == "R" else "R"
+ assert {self.npc_side, self.env.agent_side} == {"L", "R"}
+
+ self.target_obj = None
+
+ self.was_introduced_to = False
+
+ self.ate_an_apple = False
+ self.demo_over = False
+ self.demo_over_and_position_safe = False
+ self.apple_unlocked_for_agent = False
+
+ self.list_of_possible_utterances = [
+ *self.list_of_possible_utterances,
+ "Hot", # change to hot -> all with small letters
+ "Warm",
+ "Medium",
+ "Cold",
+ *COLOR_NAMES
+ ]
+
+ assert self.env.grammar.contains_utterance(self.introduction_statement)
+
+ def step(self, utterance):
+
+ reply, info = super().step()
+
+ if self.env.hidden_npc:
+ return reply, info
+
+ if self.npc_side == "L":
+ # the npc waits for the agent to open one of the right boxes, and then uses the object of the same color
+ action = None
+ if self.env.chosen_left_obj is not None:
+ self.target_obj = self.env.chosen_left_obj
+
+ if type(self.target_obj) == Switch and self.target_obj.is_on:
+ next_target_position = self.env.box.cur_pos
+
+ elif type(self.target_obj) == AppleGenerator and self.target_obj.is_pressed:
+ next_target_position = self.env.left_generator_platform.cur_pos
+
+ else:
+ next_target_position = self.target_obj.cur_pos
+
+ if type(self.target_obj) == AppleGenerator and not self.target_obj.is_pressed:
+ # we have to activate the generator
+ if not self.env.generator.marble_activation:
+ # push generator
+ action = self.path_to_pos(next_target_position)
+ else:
+ # find angle
+ if self.env.marble.moving_dir is None:
+ distance = (self.env.marble.cur_pos - self.target_obj.cur_pos)
+
+ diff = np.sign(distance)
+ if sum(abs(diff)) == 1:
+ push_pos = self.env.marble.cur_pos + diff
+ if all(self.cur_pos == push_pos):
+ next_target_position = self.env.marble.cur_pos
+ else:
+ next_target_position = push_pos
+
+ # go to loc in front of
+ # push
+ action = self.path_to_pos(next_target_position)
+
+ else:
+ action = None
+
+ else:
+ # toggle all other objects
+ action = self.path_to_toggle_pos(next_target_position)
+ else:
+ action = self.turn_to_see_agent()
+
+ else:
+ if self.ate_an_apple:
+ action = self.turn_to_see_agent()
+ else:
+ # toggle the chosen box then the apple
+ if self.target_obj is None:
+ self.target_obj = self.env._rand_elem([
+ self.env.right_box1,
+ self.env.right_box2
+ ])
+
+ action = self.path_to_toggle_pos(self.target_obj.cur_pos)
+
+ if self.npc_side == "R":
+ eaten_before = self.env.right_apple.eaten
+ else:
+ eaten_before = self.env.left_apple.eaten
+
+ if action is not None:
+ action()
+
+ if not self.ate_an_apple:
+ # check if the NPC ate the apple
+ if self.npc_side == "R":
+ self.ate_an_apple = not eaten_before and self.env.right_apple.eaten
+ else:
+ self.ate_an_apple = not eaten_before and self.env.left_apple.eaten
+
+ info = {
+ "prim_action": action.__name__ if action is not None else "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": self.was_introduced_to
+ }
+
+ reply = None
+
+ return reply, info
+
+ def is_point_from_loc(self, pos):
+ target_pos = self.target_obj.cur_pos
+ if self.distractor_obj is not None:
+ distractor_pos = self.distractor_obj.cur_pos
+ else:
+ distractor_pos = [None, None]
+
+ if self.env.is_in_marble_way(pos):
+ return False
+
+ if any(pos == target_pos):
+ same_ind = np.argmax(target_pos == pos)
+
+ if pos[same_ind] != distractor_pos[same_ind]:
+ return True
+
+ if pos[same_ind] == distractor_pos[same_ind]:
+ # if in between
+ if distractor_pos[1-same_ind] < pos[1-same_ind] < target_pos[1-same_ind]:
+ return True
+
+ if distractor_pos[1-same_ind] > pos[1-same_ind] > target_pos[1-same_ind]:
+ return True
+
+ return False
+
+ def find_point_from_loc(self):
+ reject_fn = lambda env, p: not self.is_point_from_loc(p)
+
+ point = self.env.find_loc(size=(self.env.wall_x, self.env.wall_y), reject_fn=reject_fn, reject_agent_pos=False)
+
+ assert all(point < np.array([self.env.wall_x, self.env.wall_y]))
+ assert all(point > np.array([0, 0]))
+
+ return point
+
+
+class ObjectsCollaborationEnv(MultiModalMiniGridEnv):
+ """
+ Environment in which the agent is instructed to go to a given object
+ named using an English text string
+ """
+
+ def __init__(
+ self,
+ size=10,
+ diminished_reward=True,
+ step_penalty=False,
+ knowledgeable=False,
+ max_steps=80,
+ hidden_npc=False,
+ switch_no_light=True,
+ reward_diminish_factor=0.1,
+ see_through_walls=False,
+ egocentric_observation=True,
+ ):
+ assert size >= 5
+ self.empty_symbol = "NA \n"
+ self.diminished_reward = diminished_reward
+ self.step_penalty = step_penalty
+ self.knowledgeable = knowledgeable
+ self.hidden_npc = hidden_npc
+ self.hear_yourself = False
+ self.switch_no_light = switch_no_light
+
+ self.grammar = SocialAIGrammar()
+
+ self.init_done = False
+ # parameters - to be set in reset
+ self.parameters = None
+
+ # encoding size should be 5
+ self.add_npc_direction = True
+ self.add_npc_point_direction = True
+ self.add_npc_last_prim_action = True
+
+ self.reward_diminish_factor = reward_diminish_factor
+
+ self.egocentric_observation = egocentric_observation
+ self.encoding_size = 3 + 2*bool(not self.egocentric_observation) + bool(self.add_npc_direction) + bool(self.add_npc_point_direction) + bool(self.add_npc_last_prim_action)
+
+ super().__init__(
+ grid_size=size,
+ max_steps=max_steps,
+ # Set this to True for maximum speed
+ see_through_walls=see_through_walls,
+ actions=SocialAIActions, # primitive actions
+ action_space=SocialAIActionSpace,
+ add_npc_direction=self.add_npc_direction,
+ add_npc_point_direction=self.add_npc_point_direction,
+ add_npc_last_prim_action=self.add_npc_last_prim_action,
+ reward_diminish_factor=self.reward_diminish_factor,
+ )
+ self.all_npc_utterance_actions = Partner.get_list_of_possible_utterances()
+ self.prim_actions_dict = SocialAINPCActionsDict
+
+ def revert(self):
+ self.put_objects_in_env(remove_objects=True)
+
+ def is_in_marble_way(self, pos):
+ target_pos = self.generator_current_pos
+
+ # generator distractor is in the same row / collumn as the marble and the generator
+ # if self.distractor_current_pos is not None:
+ # distractor_pos = self.distractor_current_pos
+ # else:
+ # distractor_pos = [None, None]
+
+ if self.problem in ["Marble"]:
+ # point can't be in the same row or column as both the marble and the generator
+ # all three: marble, generator, loc are in the same row or column
+ if any((pos == target_pos) * (pos == self.marble_current_pos)):
+ # all three: marble, generator, loc are in the same row or column -> is in its way
+ return True
+
+ # is it in the way for the distractor generator
+ if any((pos == self.distractor_current_pos) * (pos == self.marble_current_pos)):
+ # all three: marble, distractor generator, loc are in the same row or column -> is in its way
+ return True
+
+ # all good
+ return False
+
+ def _gen_grid(self, width_, height_):
+ # Create the grid
+ self.grid = Grid(width_, height_, nb_obj_dims=self.encoding_size)
+
+ # new
+ min_w = min(9, width_)
+ min_h = min(9, height_)
+ self.current_width = self._rand_int(min_w, width_+1)
+ self.current_height = self._rand_int(min_h, height_+1)
+
+ self.wall_x = self.current_width-1
+ self.wall_y = self.current_height-1
+
+ # Generate the surrounding walls
+ self.grid.wall_rect(0, 0, self.current_width, self.current_height)
+
+ # problem: Apples/Boxes/Switches/Generators/Marbles
+ self.problem = self.parameters["Problem"] if self.parameters else "Apples"
+ num_of_colors = self.parameters.get("Num_of_colors", None) if self.parameters else None
+ self.version = self.parameters["Version"] if self.parameters else "Asocial"
+ self.role = self.parameters["Role"] if self.parameters else "A"
+ assert self.role in ["A", "B", "Meta"]
+
+ if self.role in ["B", "Meta"]:
+ self.agent_side = "R" # starts on the right side
+ else:
+ self.agent_side = "L" # starts on the right side
+
+ self.add_obstacles()
+
+ # apple
+
+ # box
+ locked = self.problem == "Switches"
+
+ if num_of_colors is None:
+ POSSIBLE_COLORS = COLOR_NAMES.copy()
+
+ else:
+ POSSIBLE_COLORS = COLOR_NAMES[:int(num_of_colors)].copy()
+
+ self.left_half_size = (self.current_width//2, self.current_height)
+ self.left_half_top = (0, 0)
+
+ self.right_half_size = (self.current_width//2 - 1, self.current_height)
+ self.right_half_top = (self.current_width - self.current_width // 2 + 1, 0)
+
+ # add fence to grid
+ self.grid.vert_wall(
+ x=self.current_width//2 + 1, # one collumn to the right of the center
+ y=1,
+ length=self.current_height - 2,
+ obj_type=Fence
+ )
+
+ self.right_box1_color = self._rand_elem(POSSIBLE_COLORS)
+ POSSIBLE_COLORS.remove(self.right_box1_color)
+
+ self.right_box2_color = self._rand_elem(POSSIBLE_COLORS)
+
+ assert self.right_box1_color != self.right_box2_color
+
+ POSSIBLE_COLORS_LEFT = [self.right_box1_color, self.right_box2_color]
+
+ self.left_color_1 = self._rand_elem(POSSIBLE_COLORS_LEFT)
+ POSSIBLE_COLORS_LEFT.remove(self.left_color_1)
+ self.left_color_2 = self._rand_elem(POSSIBLE_COLORS_LEFT)
+
+
+ self.box_color = self.left_color_1
+ # find the position for the apple/box/generator_platform
+ self.left_apple_current_pos = self.find_loc(
+ size=self.left_half_size,
+ top=self.left_half_top,
+ reject_agent_pos=True
+ )
+
+ # right boxes
+ self.right_box1_current_pos = self.find_loc(
+ size=self.right_half_size,
+ top=self.right_half_top,
+ reject_agent_pos=True
+ )
+ self.right_box2_current_pos = self.find_loc(
+ size=self.right_half_size,
+ top=self.right_half_top,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [self.right_box1_current_pos]),
+ )
+ assert all(self.left_apple_current_pos < np.array([self.current_width - 1, self.current_height - 1]))
+
+ # switch
+ # self.switch_pos = (self.current_width, self.current_height)
+ self.switch_color = self.left_color_1
+ self.switch_current_pos = self.find_loc(
+ top=self.left_half_top,
+ size=self.left_half_size,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: tuple(pos) in map(tuple, [self.left_apple_current_pos]),
+ )
+
+ # generator
+ # self.generator_pos = (self.current_width, self.current_height)
+ self.generator_color = self.left_color_1
+ self.generator_current_pos = self.find_loc(
+ top=self.left_half_top,
+ size=self.left_half_size,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: (
+ tuple(pos) in map(tuple, [self.left_apple_current_pos])
+ or
+ (self.problem in ["Marbles", "Marble"] and tuple(pos) in [
+ # not in corners
+ (1, 1),
+ (self.current_width-2, 1),
+ (1, self.current_height-2),
+ (self.current_width-2, self.current_height-2),
+ ])
+ or
+ # not in the same row collumn as the platform
+ (self.problem in ["Marbles", "Marble"] and any(pos == self.left_apple_current_pos))
+ ),
+ )
+
+ # generator platform
+ self.left_generator_platform_color = self._rand_elem(POSSIBLE_COLORS)
+
+ # marbles
+ # self.marble_pos = (self.current_width, self.current_height)
+ self.marble_color = self._rand_elem(POSSIBLE_COLORS)
+ self.marble_current_pos = self.find_loc(
+ top=self.left_half_top,
+ size=self.left_half_size,
+ reject_agent_pos=True,
+ reject_fn=lambda _, pos: self.problem in ["Marbles", "Marble"] and (
+ tuple(pos) in map(tuple, [self.left_apple_current_pos, self.generator_current_pos])
+ or
+ all(pos != self.generator_current_pos) # reject if not in row or column as the generator
+ or
+ any(pos == 1) # next to a wall
+ or
+ pos[1] == self.current_height-2
+ or
+ pos[0] == self.current_width-2
+ ),
+ )
+
+ self.distractor_color = self.left_color_2
+ # self.distractor_pos = (self.current_width, self.current_height)
+
+ if self.problem in ["Apples", "Boxes"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.left_apple_current_pos])
+
+ elif self.problem in ["Switches"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.left_apple_current_pos, self.switch_current_pos])
+
+ elif self.problem in ["Generators"]:
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [self.left_apple_current_pos, self.generator_current_pos])
+
+ elif self.problem in ["Marbles", "Marble"]:
+ # problem is marbles
+ same_dim = (self.generator_current_pos == self.marble_current_pos).argmax()
+ distactor_same_dim = 1-same_dim
+ distractor_reject_fn = lambda _, pos: tuple(pos) in map(tuple, [
+ self.left_apple_current_pos,
+ self.generator_current_pos,
+ self.marble_current_pos
+ ]) or pos[distactor_same_dim] != self.marble_current_pos[distactor_same_dim]
+ # todo: not in corners -> but it's not that important
+ # or tuple(pos) in [
+ # # not in corners
+ # (1, 1),
+ # (self.current_width-2, 1),
+ # (1, self.current_height-2),
+ # (self.current_width-2, self.current_height-2),
+ # ])
+
+ else:
+ raise ValueError("Problem {} indefined.".format(self.problem))
+
+ self.distractor_current_pos = self.find_loc(
+ top=self.left_half_top,
+ size=self.left_half_size,
+ reject_agent_pos=True,
+ # todo: reject based on problem
+ reject_fn=distractor_reject_fn
+ )
+
+ self.put_objects_in_env()
+
+ # place agent
+ if self.agent_side == "L":
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+ else:
+ self.place_agent(size=self.right_half_size, top=self.right_half_top)
+
+ # NPC
+ if self.version == "Social":
+ self.npc_color = self._rand_elem(COLOR_NAMES)
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+
+ if self.agent_side == "L":
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top, reject_fn=ObjectsCollaborationEnv.is_in_marble_way)
+ else:
+ self.place_obj(self.caretaker, size=self.left_half_size, top=self.left_half_top, reject_fn=ObjectsCollaborationEnv.is_in_marble_way)
+
+ # Generate the mission string
+ self.mission = 'lets collaborate'
+
+ # Dummy beginning string
+ # self.beginning_string = "This is what you hear. \n"
+ self.beginning_string = "Conversation: \n" # todo: go back to "this what you hear?
+ self.utterance = self.beginning_string
+
+ # utterance appended at the end of each step
+ self.utterance_history = ""
+
+ # used for rendering
+ self.full_conversation = self.utterance
+ self.outcome_info = None
+
+ def put_objects_in_env(self, remove_objects=False):
+
+ assert self.left_apple_current_pos is not None
+ assert self.right_box1_current_pos is not None
+ assert self.right_box2_current_pos is not None
+ assert self.switch_current_pos is not None
+
+ self.switches_block_set = []
+ self.boxes_block_set = []
+ self.right_boxes_block_set = []
+ self.generators_block_set = []
+
+ self.other_box = None
+ self.other_switch = None
+ self.other_generator = None
+
+ # problem: Apples/Boxes/Switches/Generators
+ assert self.problem == self.parameters["Problem"] if self.parameters else "Apples"
+
+ # move objects (used only in revert), not in gen_grid
+ if remove_objects:
+ # remove apple or box
+ # assert type(self.grid.get(*self.apple_current_pos)) in [Apple, LockableBox]
+ # self.grid.set(*self.apple_current_pos, None)
+
+ # remove apple (after demo it must be an apple)
+ assert type(self.grid.get(*self.left_apple_current_pos)) in [Apple]
+ self.grid.set(*self.left_apple_current_pos, None)
+
+ self.grid.set(*self.right_apple_current_pos, None)
+
+ if self.problem in ["Switches"]:
+ # remove switch
+ assert type(self.grid.get(*self.switch_current_pos)) in [Switch]
+ self.grid.set(*self.switch.cur_pos, None)
+
+ elif self.problem in ["Generators", "Marbles", "Marble"]:
+ # remove generator
+ assert type(self.grid.get(*self.generator.cur_pos)) in [AppleGenerator]
+ self.grid.set(*self.generator.cur_pos, None)
+
+ if self.problem in ["Marbles", "Marble"]:
+ # remove generator
+ assert type(self.grid.get(*self.marble.cur_pos)) in [Marble]
+ self.grid.set(*self.marble.cur_pos, None)
+
+ if self.marble.tee_uncovered:
+ self.grid.set(*self.marble.tee.cur_pos, None)
+
+ elif self.problem in ["Apples", "Boxes"]:
+ pass
+
+ else:
+ raise ValueError("Undefined problem {}".format(self.problem))
+
+ # remove distractor
+ if self.problem in ["Boxes", "Switches", "Generators", "Marbles", "Marble"]:
+ assert type(self.grid.get(*self.distractor_current_pos)) in [LockableBox, Switch, AppleGenerator]
+ self.grid.set(*self.distractor_current_pos, None)
+
+ # apple
+ self.left_apple = Apple()
+ self.right_apple = Apple()
+
+ # right apple
+ self.right_box1 = LockableBox(
+ self.right_box1_color,
+ contains=self.right_apple,
+ is_locked=False,
+ block_set=self.right_boxes_block_set
+ )
+ self.right_boxes_block_set.append(self.right_box1)
+
+ # right apple
+ self.right_box2 = LockableBox(
+ self.right_box2_color,
+ contains=self.right_apple,
+ is_locked=False,
+ block_set=self.right_boxes_block_set
+ )
+ self.right_boxes_block_set.append(self.right_box2)
+
+ # Box
+ locked = self.problem == "Switches"
+
+ self.box = LockableBox(
+ self.box_color,
+ # contains=self.left_apple,
+ is_locked=locked,
+ block_set=self.boxes_block_set
+ )
+ self.boxes_block_set.append(self.box)
+
+ # Switch
+ self.switch = Switch(
+ color=self.switch_color,
+ # lockable_object=self.box,
+ locker_switch=True,
+ no_turn_off=True,
+ no_light=self.switch_no_light,
+ block_set=self.switches_block_set,
+ )
+
+ self.switches_block_set.append(self.switch)
+
+ # Generator
+ self.generator = AppleGenerator(
+ self.generator_color,
+ block_set=self.generators_block_set,
+ # on_push=lambda: self.grid.set(*self.left_apple_current_pos, self.left_apple),
+ marble_activation=self.problem in ["Marble"],
+ )
+ self.generators_block_set.append(self.generator)
+
+ self.left_generator_platform = GeneratorPlatform(self.left_generator_platform_color)
+
+ self.marble = Marble(self.marble_color, env=self)
+
+ # right side
+ self.put_obj_np(self.right_box1, self.right_box1_current_pos)
+ self.put_obj_np(self.right_box2, self.right_box2_current_pos)
+
+ self.candidate_objects=[]
+ # left side
+ if self.problem == "Apples":
+ self.put_obj_np(self.left_apple, self.left_apple_current_pos)
+ self.candidate_objects.append(self.left_apple)
+
+ elif self.problem in ["Boxes"]:
+ self.put_obj_np(self.box, self.left_apple_current_pos)
+ self.candidate_objects.append(self.box)
+
+ elif self.problem in ["Switches"]:
+ self.put_obj_np(self.box, self.left_apple_current_pos)
+ self.put_obj_np(self.switch, self.switch_current_pos)
+ self.candidate_objects.append(self.switch)
+
+ elif self.problem in ["Generators", "Marble"]:
+ self.put_obj_np(self.generator, self.generator_current_pos)
+ self.put_obj_np(self.left_generator_platform, self.left_apple_current_pos)
+ self.candidate_objects.append(self.generator)
+
+ if self.problem in ["Marble"]:
+ self.put_obj_np(self.marble, self.marble_current_pos)
+
+ else:
+ raise ValueError("Problem {} not defined. ".format(self.problem))
+
+ # Distractors
+ if self.problem == "Boxes":
+ assert not locked
+
+ self.other_box = LockableBox(
+ self.left_color_2,
+ is_locked=locked,
+ block_set=self.boxes_block_set,
+ )
+ self.boxes_block_set.append(self.other_box)
+
+ self.put_obj_np(self.other_box, self.distractor_current_pos)
+ self.candidate_objects.append(self.other_box)
+
+ elif self.problem == "Switches":
+ self.other_switch = Switch(
+ color=self.left_color_2,
+ locker_switch=True,
+ no_turn_off=True,
+ no_light=self.switch_no_light,
+ block_set=self.switches_block_set,
+ )
+ self.switches_block_set.append(self.other_switch)
+
+ self.put_obj_np(self.other_switch, self.distractor_current_pos)
+ self.candidate_objects.append(self.other_switch)
+
+ elif self.problem in ["Generators", "Marble"]:
+ self.other_generator = AppleGenerator(
+ color=self.left_color_2,
+ block_set=self.generators_block_set,
+ marble_activation=self.problem in ["Marble"],
+ )
+ self.generators_block_set.append(self.other_generator)
+
+ self.put_obj_np(self.other_generator, self.distractor_current_pos)
+ self.candidate_objects.append(self.other_generator)
+
+ def reset(
+ self, *args, **kwargs
+ ):
+ # This env must be used inside the parametric env
+ if not kwargs:
+ # The only place when kwargs can empty is during the class construction
+ # reset should be called again before using the env (paramenv does it in its constructor)
+ assert self.parameters is None
+ assert not self.init_done
+ self.init_done = True
+
+ obs = super().reset()
+ return obs
+
+ else:
+ assert self.init_done
+
+ self.parameters = dict(kwargs)
+
+ assert self.parameters is not None
+ assert len(self.parameters) > 0
+
+ obs = super().reset()
+
+ self.agent_ate_an_apple = False
+ self.chosen_right_box = None
+ self.chosen_left_obj = None
+
+ return obs
+
+ def step(self, action):
+ success = False
+ p_action = action[0]
+ utterance_action = action[1:]
+
+ left_apple_had_been_eaten = self.left_apple.eaten
+ right_apple_had_been_eaten = self.right_apple.eaten
+
+ # primitive actions
+ _, reward, done, info = super().step(p_action)
+
+ if self.problem in ["Marbles", "Marble"]:
+ # todo: create objects which can stepped automatically?
+ self.marble.step()
+
+ if not self.agent_ate_an_apple:
+ if self.agent_side == "L":
+ self.agent_ate_an_apple = self.left_apple.eaten and not left_apple_had_been_eaten
+ else:
+ self.agent_ate_an_apple = self.right_apple.eaten and not right_apple_had_been_eaten
+
+ if self.right_box1.is_open:
+ self.chosen_right_box = self.right_box1
+
+ if self.right_box2.is_open:
+ self.chosen_right_box = self.right_box2
+
+ if self.chosen_right_box is not None:
+ chosen_color = self.chosen_right_box.color
+ self.chosen_left_obj = [o for o in self.candidate_objects if o.color == chosen_color][0]
+
+ if type(self.chosen_left_obj) == LockableBox:
+ self.chosen_left_obj.contains = self.left_apple
+
+ elif type(self.chosen_left_obj) == Switch:
+ self.chosen_left_obj.lockable_object = self.box
+ self.box.contains = self.left_apple
+
+ elif type(self.chosen_left_obj) == AppleGenerator:
+ self.chosen_left_obj.on_push=lambda: self.grid.set(*self.left_apple_current_pos, self.left_apple)
+
+ else:
+ raise ValueError("Unknown target object.")
+
+ # utterances
+ agent_spoke = not all(np.isnan(utterance_action))
+ if agent_spoke:
+ utterance = self.grammar.construct_utterance(utterance_action)
+
+ if self.hear_yourself:
+ self.utterance += "YOU: {} \n".format(utterance)
+ self.full_conversation += "YOU: {} \n".format(utterance)
+ else:
+ utterance = None
+
+ if self.version == "Social":
+ reply, npc_info = self.caretaker.step(utterance)
+
+ if reply:
+ self.utterance += "{}: {} \n".format(self.caretaker.name, reply)
+ self.full_conversation += "{}: {} \n".format(self.caretaker.name, reply)
+ else:
+ npc_info = {
+ "prim_action": "no_op",
+ "utterance": "no_op",
+ "was_introduced_to": False,
+ }
+
+
+ # aftermath
+ if p_action == self.actions.done:
+ done = True
+
+ if (self.role in ["A", "B"] or self.version == "Asocial") and self.agent_ate_an_apple:
+ reward = self._reward()
+ success = True
+ done = True
+
+ elif self.role == "Meta" and self.version == "Social" and self.agent_ate_an_apple and self.caretaker.ate_an_apple:
+
+ if self.agent_side == "L":
+ reward = self._reward() / 2
+ success = True
+ done = True
+
+ else:
+ # revert and rotate
+ reward = self._reward() / 2
+ self.agent_ate_an_apple = False
+ self.caretaker.ate_an_apple = False
+ self.agent_side = "L"
+ self.put_objects_in_env(remove_objects=True)
+
+ # teleport the agent and the NPC
+ self.place_agent(size=self.left_half_size, top=self.left_half_top)
+
+ self.grid.set(*self.caretaker.cur_pos, None)
+
+ self.caretaker = Partner(self.npc_color, "Partner", self)
+ self.place_obj(self.caretaker, size=self.right_half_size, top=self.right_half_top, reject_fn=ObjectsCollaborationEnv.is_in_marble_way)
+
+ # discount
+ if self.step_penalty:
+ reward = reward - 0.01
+
+ # update obs with NPC movement
+ obs = self.gen_obs(full_obs=self.full_obs)
+
+ # fill observation with text
+ self.append_existing_utterance_to_history()
+ obs = self.add_utterance_to_observation(obs)
+ self.reset_utterance()
+
+ if done:
+ if reward > 0:
+ self.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ else:
+ self.outcome_info = "FAILURE: agent got {} reward \n".format(reward)
+
+ if self.version == "Social":
+ # is the npc seen by the agent
+ ag_view_npc = self.relative_coords(*self.caretaker.cur_pos)
+
+ if ag_view_npc is not None:
+ # in the agent's field of view
+ ag_view_npc_x, ag_view_npc_y = ag_view_npc
+
+ n_dims = obs['image'].shape[-1]
+ npc_encoding = self.caretaker.encode(n_dims)
+
+ # is it occluded
+ npc_observed = all(obs['image'][ag_view_npc_x, ag_view_npc_y] == npc_encoding)
+ else:
+ npc_observed = False
+ else:
+ npc_observed = False
+
+ info = {**info, **{"NPC_"+k: v for k, v in npc_info.items()}}
+
+ info["NPC_observed"] = npc_observed
+ info["success"] = success
+
+ return obs, reward, done, info
+
+ def _reward(self):
+ if self.diminished_reward:
+ return super()._reward()
+ else:
+ return 1.0
+
+ # def render(self, *args, **kwargs):
+ # obs = super().render(*args, **kwargs)
+ # self.window.clear_text() # erase previous text
+ # self.window.set_caption(self.full_conversation)
+ #
+ # # self.window.ax.set_title("correct color: {}".format(self.box.target_color), loc="left", fontsize=10)
+ #
+ # if self.outcome_info:
+ # color = None
+ # if "SUCCESS" in self.outcome_info:
+ # color = "lime"
+ # elif "FAILURE" in self.outcome_info:
+ # color = "red"
+ # self.window.add_text(*(0.01, 0.85, self.outcome_info),
+ # **{'fontsize': 15, 'color': color, 'weight': "bold"})
+ #
+ # self.window.show_img(obs) # re-draw image to add changes to window
+ # return obs
+
+register(
+ id='SocialAI-ObjectsCollaboration-v0',
+ entry_point='gym_minigrid.social_ai_envs:ObjectsCollaborationEnv'
+)
\ No newline at end of file
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/socialaigrammar.py b/gym-minigrid/gym_minigrid/social_ai_envs/socialaigrammar.py
new file mode 100644
index 0000000000000000000000000000000000000000..eaa88d5a70c9bafad255087c3a4a52dce94c5632
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/socialaigrammar.py
@@ -0,0 +1,52 @@
+import gym.spaces as spaces
+from enum import IntEnum
+
+# Enumeration of possible actions
+class SocialAIActions(IntEnum):
+ # Turn left, turn right, move forward
+ left = 0
+ right = 1
+ forward = 2
+
+ # no pickup-drop
+ # # Pick up an object
+ # pickup = 3
+ # # Drop an object
+ # drop = 4
+
+ # Toggle/activate an object
+ toggle = 3
+
+ # Done completing task
+ done = 4
+
+
+class SocialAIGrammar(object):
+
+ templates = ["Where is", "Help", "Close", "How are"]
+ things = [
+ "please", "the exit", "the wall", "you", "the ceiling", "the window", "the entrance", "the closet",
+ "the drawer", "the fridge", "the floor", "the lamp", "the trash can", "the chair", "the bed", "the sofa"
+ ]
+ assert len(templates)*len(things) == 64
+ print("language complexity {}:".format(len(templates)*len(things)))
+
+ grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+
+ @classmethod
+ def get_action(cls, template, thing):
+ return [cls.templates.index(template), cls.things.index(thing)]
+ @classmethod
+ def construct_utterance(cls, action):
+ return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + " "
+
+ @classmethod
+ def contains_utterance(cls, utterance):
+ for t in range(len(cls.templates)):
+ for th in range(len(cls.things)):
+ if utterance == cls.construct_utterance([t, th]):
+ return True
+ return False
+
+SocialAIActionSpace = spaces.MultiDiscrete([len(SocialAIActions),
+ *SocialAIGrammar.grammar_action_space.nvec])
diff --git a/gym-minigrid/gym_minigrid/social_ai_envs/socialaiparamenv.py b/gym-minigrid/gym_minigrid/social_ai_envs/socialaiparamenv.py
new file mode 100644
index 0000000000000000000000000000000000000000..49cb290848fa9a272cb9d4c2b579aa33ddb3c9de
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/social_ai_envs/socialaiparamenv.py
@@ -0,0 +1,430 @@
+import warnings
+from itertools import chain
+from gym_minigrid.minigrid import *
+from gym_minigrid.parametric_env import *
+from gym_minigrid.register import register
+from gym_minigrid.social_ai_envs import InformationSeekingEnv, MarblePassEnv, LeverDoorEnv, MarblePushEnv, AppleStealingEnv, ObjectsCollaborationEnv
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+from gym_minigrid.curriculums import *
+
+import inspect, importlib
+
+# for used for automatic registration of environments
+defined_classes = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+
+class SocialAIParamEnv(gym.Env):
+ """
+ Meta-Environment containing all other environment (multi-task learning)
+ """
+
+ def __init__(
+ self,
+ size=10,
+ hidden_npc=False,
+ see_through_walls=False,
+ max_steps=80, # before it was 50, 80 is maybe better because of emulation ?
+ switch_no_light=True,
+ lever_active_steps=10,
+ curriculum=None,
+ expert_curriculum_thresholds=(0.9, 0.8),
+ expert_curriculum_average_interval=100,
+ expert_curriculum_minimum_episodes=1000,
+ n_colors=3,
+ egocentric_observation=True,
+ ):
+ if n_colors != 3:
+ warnings.warn(f"You are ussing {n_colors} instead of the usual 3.")
+
+ self.lever_active_steps = lever_active_steps
+ self.egocentric_observation = egocentric_observation
+
+ # Number of cells (width and height) in the agent view
+ self.agent_view_size = 7
+
+ # Number of object dimensions (i.e. number of channels in symbolic image)
+ # if egocentric is not used absolute coordiantes are added to the encoding
+ self.encoding_size = 6 + 2*bool(not egocentric_observation)
+
+ self.max_steps = max_steps
+
+ self.switch_no_light = switch_no_light
+
+ # Observations are dictionaries containing an
+ # encoding of the grid and a textual 'mission' string
+ self.observation_space = spaces.Box(
+ low=0,
+ high=255,
+ shape=(self.agent_view_size, self.agent_view_size, self.encoding_size),
+ dtype='uint8'
+ )
+ self.observation_space = spaces.Dict({
+ 'image': self.observation_space
+ })
+
+ self.hidden_npc = hidden_npc
+
+ # construct the tree
+ self.parameter_tree = self.construct_tree()
+
+ # print tree for logging purposes
+ # self.parameter_tree.print_tree()
+
+ if curriculum in ["intro_seq", "intro_seq_scaf"]:
+ print("Scaffolding Expert")
+ self.expert_curriculum_thresholds = expert_curriculum_thresholds
+ self.expert_curriculum_average_interval = expert_curriculum_average_interval
+ self.expert_curriculum_minimum_episodes = expert_curriculum_minimum_episodes
+ self.curriculum = ScaffoldingExpertCurriculum(
+ phase_thresholds=self.expert_curriculum_thresholds,
+ average_interval=self.expert_curriculum_average_interval,
+ minimum_episodes=self.expert_curriculum_minimum_episodes,
+ type=curriculum,
+ )
+
+ else:
+ self.curriculum = curriculum
+
+ self.current_env = None
+
+ self.envs = {}
+
+ if self.parameter_tree.root.label == "Env_type":
+ for env_type in self.parameter_tree.root.children:
+ if env_type.label == "Information_seeking":
+ e = InformationSeekingEnv(
+ max_steps=max_steps,
+ size=size,
+ switch_no_light=self.switch_no_light,
+ see_through_walls=see_through_walls,
+ n_colors=n_colors,
+ hidden_npc=self.hidden_npc,
+ egocentric_observation=self.egocentric_observation,
+ )
+ self.envs["Info"] = e
+
+ elif env_type.label == "Collaboration":
+ e = MarblePassEnv(max_steps=max_steps, size=size, hidden_npc=self.hidden_npc, egocentric_observation=egocentric_observation)
+ self.envs["Collaboration_Marble_Pass"] = e
+
+ e = LeverDoorEnv(max_steps=max_steps, size=size, lever_active_steps=self.lever_active_steps, hidden_npc=self.hidden_npc, egocentric_observation=egocentric_observation)
+ self.envs["Collaboration_Lever_Door"] = e
+
+ e = MarblePushEnv(max_steps=max_steps, size=size, lever_active_steps=self.lever_active_steps, hidden_npc=self.hidden_npc, egocentric_observation=egocentric_observation)
+ self.envs["Collaboration_Marble_Push"] = e
+
+ e = ObjectsCollaborationEnv(max_steps=max_steps, size=size, hidden_npc=self.hidden_npc, switch_no_light=self.switch_no_light, egocentric_observation=egocentric_observation)
+ self.envs["Collaboration_Objects"] = e
+
+ elif env_type.label == "AppleStealing":
+ e = AppleStealingEnv(max_steps=max_steps, size=size, see_through_walls=see_through_walls,
+ hidden_npc=self.hidden_npc, egocentric_observation=egocentric_observation)
+ self.envs["OthersPerceptionInference"] = e
+
+ else:
+ raise ValueError(f"Undefined env type {env_type.label}.")
+
+ else:
+ raise ValueError("Env_type should be the root node")
+
+ self.all_npc_utterance_actions = sorted(list(set(chain(*[e.all_npc_utterance_actions for e in self.envs.values()]))))
+
+ self.grammar = SocialAIGrammar()
+
+ # set up the action space
+ self.action_space = SocialAIActionSpace
+ self.actions = SocialAIActions
+ self.npc_prim_actions_dict = SocialAINPCActionsDict
+
+ # all envs must have the same grammar
+ for env in self.envs.values():
+ assert isinstance(env.grammar, type(self.grammar))
+ assert env.actions is self.actions
+ assert env.action_space is self.action_space
+
+ # suggestion: encoding size is automatically set to max?
+ assert env.encoding_size is self.encoding_size
+ assert env.observation_space == self.observation_space
+ assert env.prim_actions_dict == self.npc_prim_actions_dict
+
+ self.reset()
+
+ def draw_tree(self, ignore_labels=[], savedir="viz"):
+ self.parameter_tree.draw_tree("{}/param_tree_{}".format(savedir, self.spec.id), ignore_labels=ignore_labels)
+
+ def print_tree(self):
+ self.parameter_tree.print_tree()
+
+ def construct_tree(self):
+ tree = ParameterTree()
+
+ env_type_nd = tree.add_node("Env_type", type="param")
+
+ # Information seeking
+ inf_seeking_nd = tree.add_node("Information_seeking", parent=env_type_nd, type="value")
+
+ prag_fr_compl_nd = tree.add_node("Pragmatic_frame_complexity", parent=inf_seeking_nd, type="param")
+ tree.add_node("No", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Eye_contact", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Ask", parent=prag_fr_compl_nd, type="value")
+ tree.add_node("Ask_Eye_contact", parent=prag_fr_compl_nd, type="value")
+
+ # scaffolding
+ scaffolding_nd = tree.add_node("Scaffolding", parent=inf_seeking_nd, type="param")
+ scaffolding_N_nd = tree.add_node("N", parent=scaffolding_nd, type="value")
+ scaffolding_Y_nd = tree.add_node("Y", parent=scaffolding_nd, type="value")
+
+ cue_type_nd = tree.add_node("Cue_type", parent=scaffolding_N_nd, type="param")
+ tree.add_node("Language_Color", parent=cue_type_nd, type="value")
+ tree.add_node("Language_Feedback", parent=cue_type_nd, type="value")
+ tree.add_node("Pointing", parent=cue_type_nd, type="value")
+ tree.add_node("Emulation", parent=cue_type_nd, type="value")
+
+
+ N_bo_nd = tree.add_node("N", parent=inf_seeking_nd, type="param")
+ tree.add_node("2", parent=N_bo_nd, type="value")
+ tree.add_node("1", parent=N_bo_nd, type="value")
+
+ problem_nd = tree.add_node("Problem", parent=inf_seeking_nd, type="param")
+
+ doors_nd = tree.add_node("Doors", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ boxes_nd = tree.add_node("Boxes", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=boxes_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=boxes_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ switches_nd = tree.add_node("Switches", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=switches_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=switches_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ generators_nd = tree.add_node("Generators", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=generators_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=generators_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ levers_nd = tree.add_node("Levers", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=levers_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=levers_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ doors_nd = tree.add_node("Marble", parent=problem_nd, type="value")
+ version_nd = tree.add_node("N", parent=doors_nd, type="param")
+ tree.add_node("2", parent=version_nd, type="value")
+ peer_nd = tree.add_node("Peer", parent=doors_nd, type="param")
+ tree.add_node("Y", parent=peer_nd, type="value")
+
+ # Collaboration
+ collab_nd = tree.add_node("Collaboration", parent=env_type_nd, type="value")
+
+ colab_type_nd = tree.add_node("Problem", parent=collab_nd, type="param")
+
+ problem_nd = tree.add_node("Boxes", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Switches", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Generators", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("Marble", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePass", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+ tree.add_node("Asocial", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("MarblePush", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ problem_nd = tree.add_node("LeverDoor", parent=colab_type_nd, type="value")
+ role_nd = tree.add_node("Role", parent=problem_nd, type="param")
+ tree.add_node("A", parent=role_nd, type="value")
+ tree.add_node("B", parent=role_nd, type="value")
+ role_nd = tree.add_node("Version", parent=problem_nd, type="param")
+ tree.add_node("Social", parent=role_nd, type="value")
+
+ # Perspective taking
+ collab_nd = tree.add_node("AppleStealing", parent=env_type_nd, type="value")
+
+ role_nd = tree.add_node("Version", parent=collab_nd, type="param")
+ tree.add_node("Asocial", parent=role_nd, type="value")
+ social_nd = tree.add_node("Social", parent=role_nd, type="value")
+
+ move_nd = tree.add_node("NPC_movement", parent=social_nd, type="param")
+ tree.add_node("Walking", parent=move_nd, type="value")
+ tree.add_node("Rotating", parent=move_nd, type="value")
+
+ obstacles_nd = tree.add_node("Obstacles", parent=collab_nd, type="param")
+ tree.add_node("No", parent=obstacles_nd, type="value")
+ tree.add_node("A_bit", parent=obstacles_nd, type="value")
+ tree.add_node("Medium", parent=obstacles_nd, type="value")
+ tree.add_node("A_lot", parent=obstacles_nd, type="value")
+
+ return tree
+
+ def construct_env_from_params(self, params):
+ params_labels = {k.label: v.label for k, v in params.items()}
+ if params_labels['Env_type'] == "Collaboration":
+
+ if params_labels["Problem"] == "MarblePass":
+ env = self.envs["Collaboration_Marble_Pass"]
+
+ elif params_labels["Problem"] == "LeverDoor":
+ env = self.envs["Collaboration_Lever_Door"]
+
+ elif params_labels["Problem"] == "MarblePush":
+ env = self.envs["Collaboration_Marble_Push"]
+
+ elif params_labels["Problem"] in ["Boxes", "Switches", "Generators", "Marble"]:
+ env = self.envs["Collaboration_Objects"]
+
+ else:
+ raise ValueError("params badly defined.")
+
+ elif params_labels['Env_type'] == "Information_seeking":
+ env = self.envs["Info"]
+
+ elif params_labels['Env_type'] == "AppleStealing":
+ env = self.envs["OthersPerceptionInference"]
+
+ else:
+ raise ValueError("params badly defined.")
+
+ reset_kwargs = params_labels
+
+ return env, reset_kwargs
+
+ def reset(self, with_info=False):
+ # select a new social environment at random, for each new episode
+
+ old_window = None
+ if self.current_env: # a previous env exists, save old window
+ old_window = self.current_env.window
+
+ self.current_params = self.parameter_tree.sample_env_params(ACL=self.curriculum)
+
+ self.current_env, reset_kwargs = self.construct_env_from_params(self.current_params)
+ assert reset_kwargs is not {}
+ assert reset_kwargs is not None
+
+ # print("Sampled parameters:")
+ # for k, v in reset_kwargs.items():
+ # print(f'\t{k}:{v}')
+
+ if with_info:
+ obs, info = self.current_env.reset_with_info(**reset_kwargs)
+ else:
+ obs = self.current_env.reset(**reset_kwargs)
+
+ # carry on window if this env is not the first
+ if old_window:
+ self.current_env.window = old_window
+
+ if with_info:
+ return obs, info
+ else:
+ return obs
+
+ def reset_with_info(self):
+ return self.reset(with_info=True)
+
+
+ def seed(self, seed=1337):
+ # Seed the random number generator
+ for env in self.envs.values():
+ env.seed(seed)
+
+ return [seed]
+
+ def set_curriculum_parameters(self, params):
+ if self.curriculum is not None:
+ self.curriculum.set_parameters(params)
+
+ def step(self, action):
+ assert self.current_env
+ assert self.current_env.parameters is not None
+
+ obs, reward, done, info = self.current_env.step(action)
+
+ info["parameters"] = self.current_params
+
+ if done:
+ if info["success"]:
+ # self.current_env.outcome_info = "SUCCESS: agent got {} reward \n".format(np.round(reward, 1))
+ self.current_env.outcome_info = "SUCCESS\n"
+ else:
+ self.current_env.outcome_info = "FAILURE\n"
+
+ if self.curriculum is not None:
+ for k, v in self.curriculum.get_info().items():
+ info["curriculum_info_"+k] = v
+
+ return obs, reward, done, info
+
+
+ @property
+ def window(self):
+ assert self.current_env
+ return self.current_env.window
+
+ @window.setter
+ def window(self, value):
+ self.current_env.window = value
+
+ def render(self, *args, **kwargs):
+ assert self.current_env
+ return self.current_env.render(*args, **kwargs)
+
+ @property
+ def step_count(self):
+ return self.current_env.step_count
+
+ def get_mission(self):
+ return self.current_env.get_mission()
+
+
+defined_classes_ = [name for name, _ in inspect.getmembers(importlib.import_module(__name__), inspect.isclass)]
+
+envs = list(set(defined_classes_) - set(defined_classes))
+assert all([e.endswith("Env") for e in envs])
+
+for env in envs:
+ register(
+ id='SocialAI-{}-v1'.format(env),
+ entry_point='gym_minigrid.social_ai_envs:{}'.format(env)
+ )
diff --git a/gym-minigrid/gym_minigrid/window.py b/gym-minigrid/gym_minigrid/window.py
new file mode 100644
index 0000000000000000000000000000000000000000..92cada4935115f890fb16e62648e2ed677663aca
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/window.py
@@ -0,0 +1,137 @@
+import sys
+import numpy as np
+
+# Only ask users to install matplotlib if they actually need it
+try:
+ import matplotlib.pyplot as plt
+except:
+ print('To display the environment in a window, please install matplotlib, eg:')
+ print('pip3 install --user matplotlib')
+ sys.exit(-1)
+
+class Window:
+ """
+ Window to draw a gridworld instance using Matplotlib
+ """
+
+ def __init__(self, title, figsize=(3, 3)):
+ self.fig = None
+
+ self.imshow_obj = None
+
+ # Create the figure and axes
+ self.fig, self.ax = plt.subplots(
+ # figsize=(10, 5),
+ figsize=figsize,
+ )
+
+ # Show the env name in the window title
+ self.fig.canvas.set_window_title(title)
+
+ # Turn off x/y axis numbering/ticks
+ self.ax.xaxis.set_ticks_position('none')
+ self.ax.yaxis.set_ticks_position('none')
+ _ = self.ax.set_xticklabels([])
+ _ = self.ax.set_yticklabels([])
+
+ # list of text handles
+ self.txt_handles = []
+
+ # Flag indicating the window was closed
+ self.closed = False
+
+ def close_handler(evt):
+ self.closed = True
+
+ self.fig.canvas.mpl_connect('close_event', close_handler)
+
+ def show_img(self, img):
+ """
+ Show an image or update the image being shown
+ """
+
+ # Show the first image of the environment
+ if self.imshow_obj is None:
+ self.imshow_obj = self.ax.imshow(img, interpolation='bilinear')
+
+ self.imshow_obj.set_data(img)
+ self.fig.canvas.draw()
+
+ # Let matplotlib process UI events
+ # This is needed for interactive mode to work properly
+ # plt.pause(0.001)
+
+ def set_caption(self, text, relevant_set=None):
+ """
+ Set/update the caption text below the image
+ """
+
+ # plt.xlabel(text)
+ # text = "All utterances:\n\n"+text
+ lines = text.split("\n")
+
+ if len(lines) > 8:
+ lines = ["..."]+lines[-8:]
+
+ text = "\n".join(lines)
+
+ if hasattr(self, "caption"):
+ self.caption.set_text(text)
+ else:
+ # self.caption = plt.text(400, 250, text, ha="left",wrap=True)
+ self.caption = plt.text(330, 250, text, ha="left", wrap=True)
+
+ if relevant_set is not None:
+ # if a line in the text has one of these strings it will be put in the relevant set
+
+ relevant_lines = ["Relevant utterances:\n"] + [
+ l for l in text.rsplit("\n") if any([r in l for r in relevant_set])
+ ] + ["\n"]
+ relevant_text = "\n".join(relevant_lines)
+
+
+ if hasattr(self, "relevant_caption"):
+ self.relevant_caption.set_text(relevant_text)
+ else:
+ self.relevant_caption = plt.text(-200, 250, relevant_text, ha="left")
+
+
+ def reg_key_handler(self, key_handler):
+ """
+ Register a keyboard event handler
+ """
+
+ # Keyboard handler
+ self.fig.canvas.mpl_connect('key_press_event', key_handler)
+
+ def show(self, block=True):
+ """
+ Show the window, and start an event loop
+ """
+
+ # If not blocking, trigger interactive mode
+ if not block:
+ plt.ion()
+
+ # Show the plot
+ # In non-interative mode, this enters the matplotlib event loop
+ # In interactive mode, this call does not block
+ plt.show()
+
+ def close(self):
+ """
+ Close the window
+ """
+
+ plt.close()
+
+ def add_text(self, *args, **kwargs):
+
+ kwargs['transform'] = self.ax.transAxes
+ self.txt_handles.append(self.ax.text(*args, **kwargs))
+
+ def clear_text(self):
+
+ if len(self.txt_handles) > 0:
+ while len(self.txt_handles) > 0:
+ self.txt_handles.pop().remove()
diff --git a/gym-minigrid/gym_minigrid/wrappers.py b/gym-minigrid/gym_minigrid/wrappers.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c0708616f08a260c1419d7f0d81636066307a8d
--- /dev/null
+++ b/gym-minigrid/gym_minigrid/wrappers.py
@@ -0,0 +1,357 @@
+import math
+import operator
+from functools import reduce
+
+import numpy as np
+import gym
+from gym import error, spaces, utils
+from .minigrid import OBJECT_TO_IDX, COLOR_TO_IDX, STATE_TO_IDX
+
+class ReseedWrapper(gym.core.Wrapper):
+ """
+ Wrapper to always regenerate an environment with the same set of seeds.
+ This can be used to force an environment to always keep the same
+ configuration when reset.
+ """
+
+ def __init__(self, env, seeds=[0], seed_idx=0):
+ self.seeds = list(seeds)
+ self.seed_idx = seed_idx
+ super().__init__(env)
+
+ def reset(self, **kwargs):
+ seed = self.seeds[self.seed_idx]
+ self.seed_idx = (self.seed_idx + 1) % len(self.seeds)
+ self.env.seed(seed)
+ return self.env.reset(**kwargs)
+
+ def step(self, action):
+ obs, reward, done, info = self.env.step(action)
+ return obs, reward, done, info
+
+class ActionBonus(gym.core.Wrapper):
+ """
+ Wrapper which adds an exploration bonus.
+ This is a reward to encourage exploration of less
+ visited (state,action) pairs.
+ """
+
+ def __init__(self, env):
+ super().__init__(env)
+ self.counts = {}
+
+ def step(self, action):
+ obs, reward, done, info = self.env.step(action)
+
+ env = self.unwrapped
+ tup = (tuple(env.agent_pos), env.agent_dir, action)
+
+ # Get the count for this (s,a) pair
+ pre_count = 0
+ if tup in self.counts:
+ pre_count = self.counts[tup]
+
+ # Update the count for this (s,a) pair
+ new_count = pre_count + 1
+ self.counts[tup] = new_count
+
+ bonus = 1 / math.sqrt(new_count)
+ reward += bonus
+
+ return obs, reward, done, info
+
+ def reset(self, **kwargs):
+ return self.env.reset(**kwargs)
+
+class StateBonus(gym.core.Wrapper):
+ """
+ Adds an exploration bonus based on which positions
+ are visited on the grid.
+ """
+
+ def __init__(self, env):
+ super().__init__(env)
+ self.counts = {}
+
+ def step(self, action):
+ obs, reward, done, info = self.env.step(action)
+
+ # Tuple based on which we index the counts
+ # We use the position after an update
+ env = self.unwrapped
+ tup = (tuple(env.agent_pos))
+
+ # Get the count for this key
+ pre_count = 0
+ if tup in self.counts:
+ pre_count = self.counts[tup]
+
+ # Update the count for this key
+ new_count = pre_count + 1
+ self.counts[tup] = new_count
+
+ bonus = 1 / math.sqrt(new_count)
+ reward += bonus
+
+ return obs, reward, done, info
+
+ def reset(self, **kwargs):
+ return self.env.reset(**kwargs)
+
+class ImgObsWrapper(gym.core.ObservationWrapper):
+ """
+ Use the image as the only observation output, no language/mission.
+ """
+
+ def __init__(self, env):
+ super().__init__(env)
+ self.observation_space = env.observation_space.spaces['image']
+
+ def observation(self, obs):
+ return obs['image']
+
+class OneHotPartialObsWrapper(gym.core.ObservationWrapper):
+ """
+ Wrapper to get a one-hot encoding of a partially observable
+ agent view as observation.
+ """
+
+ def __init__(self, env, tile_size=8):
+ super().__init__(env)
+
+ self.tile_size = tile_size
+
+ obs_shape = env.observation_space['image'].shape
+
+ # Number of bits per cell
+ num_bits = len(OBJECT_TO_IDX) + len(COLOR_TO_IDX) + len(STATE_TO_IDX)
+
+ self.observation_space.spaces["image"] = spaces.Box(
+ low=0,
+ high=255,
+ shape=(obs_shape[0], obs_shape[1], num_bits),
+ dtype='uint8'
+ )
+
+ def observation(self, obs):
+ img = obs['image']
+ out = np.zeros(self.observation_space.spaces['image'].shape, dtype='uint8')
+
+ for i in range(img.shape[0]):
+ for j in range(img.shape[1]):
+ type = img[i, j, 0]
+ color = img[i, j, 1]
+ state = img[i, j, 2]
+
+ out[i, j, type] = 1
+ out[i, j, len(OBJECT_TO_IDX) + color] = 1
+ out[i, j, len(OBJECT_TO_IDX) + len(COLOR_TO_IDX) + state] = 1
+
+ return {
+ 'mission': obs['mission'],
+ 'image': out
+ }
+
+class RGBImgObsWrapper(gym.core.ObservationWrapper):
+ """
+ Wrapper to use fully observable RGB image as the only observation output,
+ no language/mission. This can be used to have the agent to solve the
+ gridworld in pixel space.
+ """
+
+ def __init__(self, env, tile_size=8):
+ super().__init__(env)
+
+ self.tile_size = tile_size
+
+ self.observation_space.spaces['image'] = spaces.Box(
+ low=0,
+ high=255,
+ shape=(self.env.width * tile_size, self.env.height * tile_size, 3),
+ dtype='uint8'
+ )
+
+ def observation(self, obs):
+ env = self.unwrapped
+
+ rgb_img = env.render(
+ mode='rgb_array',
+ highlight=False,
+ tile_size=self.tile_size
+ )
+
+ return {
+ 'mission': obs['mission'],
+ 'image': rgb_img
+ }
+
+
+class RGBImgPartialObsWrapper(gym.core.ObservationWrapper):
+ """
+ Wrapper to use partially observable RGB image as the only observation output
+ This can be used to have the agent to solve the gridworld in pixel space.
+ """
+
+ def __init__(self, env, tile_size=8):
+ super().__init__(env)
+
+ self.tile_size = tile_size
+
+ obs_shape = env.observation_space.spaces['image'].shape
+ self.observation_space.spaces['image'] = spaces.Box(
+ low=0,
+ high=255,
+ shape=(obs_shape[0] * tile_size, obs_shape[1] * tile_size, 3),
+ dtype='uint8'
+ )
+
+ def observation(self, obs):
+ env = self.unwrapped
+
+ rgb_img_partial = env.get_obs_render(
+ obs['image'],
+ tile_size=self.tile_size
+ )
+
+ return {
+ 'mission': obs['mission'],
+ 'image': rgb_img_partial
+ }
+
+class FullyObsWrapper(gym.core.ObservationWrapper):
+ """
+ Fully observable gridworld using a compact grid encoding
+ """
+
+ def __init__(self, env):
+ super().__init__(env)
+
+ self.observation_space.spaces["image"] = spaces.Box(
+ low=0,
+ high=255,
+ shape=(self.env.width, self.env.height, 3), # number of cells
+ dtype='uint8'
+ )
+
+ def observation(self, obs):
+ env = self.unwrapped
+ full_grid = env.grid.encode()
+ full_grid[env.agent_pos[0]][env.agent_pos[1]] = np.array([
+ OBJECT_TO_IDX['agent'],
+ COLOR_TO_IDX['red'],
+ env.agent_dir
+ ])
+
+ return {
+ 'mission': obs['mission'],
+ 'image': full_grid
+ }
+
+class FlatObsWrapper(gym.core.ObservationWrapper):
+ """
+ Encode mission strings using a one-hot scheme,
+ and combine these with observed images into one flat array
+ """
+
+ def __init__(self, env, maxStrLen=96):
+ super().__init__(env)
+
+ self.maxStrLen = maxStrLen
+ self.numCharCodes = 27
+
+ imgSpace = env.observation_space.spaces['image']
+ imgSize = reduce(operator.mul, imgSpace.shape, 1)
+
+ self.observation_space = spaces.Box(
+ low=0,
+ high=255,
+ shape=(imgSize + self.numCharCodes * self.maxStrLen,),
+ dtype='uint8'
+ )
+
+ self.cachedStr = None
+ self.cachedArray = None
+
+ def observation(self, obs):
+ image = obs['image']
+ mission = obs['mission']
+
+ # Cache the last-encoded mission string
+ if mission != self.cachedStr:
+ assert len(mission) <= self.maxStrLen, 'mission string too long ({} chars)'.format(len(mission))
+ mission = mission.lower()
+
+ strArray = np.zeros(shape=(self.maxStrLen, self.numCharCodes), dtype='float32')
+
+ for idx, ch in enumerate(mission):
+ if ch >= 'a' and ch <= 'z':
+ chNo = ord(ch) - ord('a')
+ elif ch == ' ':
+ chNo = ord('z') - ord('a') + 1
+ assert chNo < self.numCharCodes, '%s : %d' % (ch, chNo)
+ strArray[idx, chNo] = 1
+
+ self.cachedStr = mission
+ self.cachedArray = strArray
+
+ obs = np.concatenate((image.flatten(), self.cachedArray.flatten()))
+
+ return obs
+
+class ViewSizeWrapper(gym.core.Wrapper):
+ """
+ Wrapper to customize the agent field of view size.
+ This cannot be used with fully observable wrappers.
+ """
+
+ def __init__(self, env, agent_view_size=7):
+ super().__init__(env)
+
+ assert agent_view_size % 2 == 1
+ assert agent_view_size >= 3
+
+ # Override default view size
+ env.unwrapped.agent_view_size = agent_view_size
+
+ # Compute observation space with specified view size
+ observation_space = gym.spaces.Box(
+ low=0,
+ high=255,
+ shape=(agent_view_size, agent_view_size, 3),
+ dtype='uint8'
+ )
+
+ # Override the environment's observation space
+ self.observation_space = spaces.Dict({
+ 'image': observation_space
+ })
+
+ def reset(self, **kwargs):
+ return self.env.reset(**kwargs)
+
+ def step(self, action):
+ return self.env.step(action)
+
+from .minigrid import Goal
+class DirectionObsWrapper(gym.core.ObservationWrapper):
+ """
+ Provides the slope/angular direction to the goal with the observations as modeled by (y2 - y2 )/( x2 - x1)
+ type = {slope , angle}
+ """
+ def __init__(self, env,type='slope'):
+ super().__init__(env)
+ self.goal_position = None
+ self.type = type
+
+ def reset(self):
+ obs = self.env.reset()
+ if not self.goal_position:
+ self.goal_position = [x for x,y in enumerate(self.grid.grid) if isinstance(y,(Goal) ) ]
+ if len(self.goal_position) >= 1: # in case there are multiple goals , needs to be handled for other env types
+ self.goal_position = (int(self.goal_position[0]/self.height) , self.goal_position[0]%self.width)
+ return obs
+
+ def observation(self, obs):
+ slope = np.divide( self.goal_position[1] - self.agent_pos[1] , self.goal_position[0] - self.agent_pos[0])
+ obs['goal_direction'] = np.arctan( slope ) if self.type == 'angle' else slope
+ return obs
diff --git a/gym-minigrid/manual_control.py b/gym-minigrid/manual_control.py
new file mode 100755
index 0000000000000000000000000000000000000000..b0745707fc12872c52a96f82eaf1ab1f204f9a40
--- /dev/null
+++ b/gym-minigrid/manual_control.py
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+raise DeprecationWarning("Use the one in ./scipts")
+
+import time
+import argparse
+import numpy as np
+import gym
+import gym_minigrid
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+
+def redraw(img):
+ if not args.agent_view:
+ img = env.render('rgb_array', tile_size=args.tile_size)
+
+ window.show_img(img)
+
+def reset():
+ if args.seed != -1:
+ env.seed(args.seed)
+
+ obs = env.reset()
+
+ if hasattr(env, 'mission'):
+ print('Mission: %s' % env.mission)
+ window.set_caption(env.mission)
+
+ redraw(obs)
+
+def step(action):
+ obs, reward, done, info = env.step(action)
+ print('step=%s, reward=%.2f' % (env.step_count, reward))
+
+ if done:
+ print('done!')
+ reset()
+ else:
+ redraw(obs)
+
+def key_handler(event):
+ print('pressed', event.key)
+
+ if event.key == 'escape':
+ window.close()
+ return
+
+ if event.key == 'backspace':
+ reset()
+ return
+
+ if event.key == 'left':
+ step(env.actions.left)
+ return
+ if event.key == 'right':
+ step(env.actions.right)
+ return
+ if event.key == 'up':
+ step(env.actions.forward)
+ return
+
+ # Spacebar
+ if event.key == ' ':
+ step(env.actions.toggle)
+ return
+ if event.key == 'pageup':
+ step(env.actions.pickup)
+ return
+ if event.key == 'pagedown':
+ step(env.actions.drop)
+ return
+
+ if event.key == 'enter':
+ step(env.actions.done)
+ return
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+ "--env",
+ help="gym environment to load",
+ default='MiniGrid-MultiRoom-N6-v0'
+)
+parser.add_argument(
+ "--seed",
+ type=int,
+ help="random seed to generate the environment with",
+ default=-1
+)
+parser.add_argument(
+ "--tile_size",
+ type=int,
+ help="size at which to render tiles",
+ default=32
+)
+parser.add_argument(
+ '--agent_view',
+ default=False,
+ help="draw the agent sees (partially observable view)",
+ action='store_true'
+)
+
+args = parser.parse_args()
+
+env = gym.make(args.env)
+
+if args.agent_view:
+ env = RGBImgPartialObsWrapper(env)
+ env = ImgObsWrapper(env)
+
+window = Window('gym_minigrid - ' + args.env)
+window.reg_key_handler(key_handler)
+
+reset()
+
+# Blocking event loop
+window.show(block=True)
diff --git a/gym-minigrid/run_tests.py b/gym-minigrid/run_tests.py
new file mode 100755
index 0000000000000000000000000000000000000000..434fb7b15fddafc4ed9f523c877d53e047a62e7d
--- /dev/null
+++ b/gym-minigrid/run_tests.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python3
+
+import random
+import numpy as np
+import gym
+from gym_minigrid.register import env_list
+from gym_minigrid.minigrid import Grid, OBJECT_TO_IDX
+
+# Test specifically importing a specific environment
+from gym_minigrid.envs import DoorKeyEnv
+
+# Test importing wrappers
+from gym_minigrid.wrappers import *
+
+##############################################################################
+
+print('%d environments registered' % len(env_list))
+
+for env_idx, env_name in enumerate(env_list):
+ print('testing {} ({}/{})'.format(env_name, env_idx+1, len(env_list)))
+
+ # Load the gym environment
+ env = gym.make(env_name)
+ env.max_steps = min(env.max_steps, 200)
+ env.reset()
+ env.render('rgb_array')
+
+ # Verify that the same seed always produces the same environment
+ for i in range(0, 5):
+ seed = 1337 + i
+ env.seed(seed)
+ grid1 = env.grid
+ env.seed(seed)
+ grid2 = env.grid
+ assert grid1 == grid2
+
+ env.reset()
+
+ # Run for a few episodes
+ num_episodes = 0
+ while num_episodes < 5:
+ # Pick a random action
+ action = random.randint(0, env.action_space.n - 1)
+
+ obs, reward, done, info = env.step(action)
+
+ # Validate the agent position
+ assert env.agent_pos[0] < env.width
+ assert env.agent_pos[1] < env.height
+
+ # Test observation encode/decode roundtrip
+ img = obs['image']
+ grid, vis_mask = Grid.decode(img)
+ img2 = grid.encode(vis_mask=vis_mask)
+ assert np.array_equal(img, img2)
+
+ # Test the env to string function
+ str(env)
+
+ # Check that the reward is within the specified range
+ assert reward >= env.reward_range[0], reward
+ assert reward <= env.reward_range[1], reward
+
+ if done:
+ num_episodes += 1
+ env.reset()
+
+ env.render('rgb_array')
+
+ # Test the close method
+ env.close()
+
+ env = gym.make(env_name)
+ env = ReseedWrapper(env)
+ for _ in range(10):
+ env.reset()
+ env.step(0)
+ env.close()
+
+ env = gym.make(env_name)
+ env = ImgObsWrapper(env)
+ env.reset()
+ env.step(0)
+ env.close()
+
+ # Test the fully observable wrapper
+ env = gym.make(env_name)
+ env = FullyObsWrapper(env)
+ env.reset()
+ obs, _, _, _ = env.step(0)
+ assert obs['image'].shape == env.observation_space.spaces['image'].shape
+ env.close()
+
+ # RGB image observation wrapper
+ env = gym.make(env_name)
+ env = RGBImgPartialObsWrapper(env)
+ env.reset()
+ obs, _, _, _ = env.step(0)
+ assert obs['image'].mean() > 0
+ env.close()
+
+ env = gym.make(env_name)
+ env = FlatObsWrapper(env)
+ env.reset()
+ env.step(0)
+ env.close()
+
+ env = gym.make(env_name)
+ env = ViewSizeWrapper(env, 5)
+ env.reset()
+ env.step(0)
+ env.close()
+
+ # Test the wrappers return proper observation spaces.
+ wrappers = [
+ RGBImgObsWrapper,
+ RGBImgPartialObsWrapper,
+ OneHotPartialObsWrapper
+ ]
+ for wrapper in wrappers:
+ env = wrapper(gym.make(env_name))
+ obs_space, wrapper_name = env.observation_space, wrapper.__name__
+ assert isinstance(
+ obs_space, spaces.Dict
+ ), "Observation space for {0} is not a Dict: {1}.".format(
+ wrapper_name, obs_space
+ )
+ # This should not fail either
+ ImgObsWrapper(env)
+ env.reset()
+ env.step(0)
+ env.close()
+
+##############################################################################
+
+print('testing agent_sees method')
+env = gym.make('MiniGrid-DoorKey-6x6-v0')
+goal_pos = (env.grid.width - 2, env.grid.height - 2)
+
+# Test the "in" operator on grid objects
+assert ('green', 'goal') in env.grid
+assert ('blue', 'key') not in env.grid
+
+# Test the env.agent_sees() function
+env.reset()
+for i in range(0, 500):
+ action = random.randint(0, env.action_space.n - 1)
+ obs, reward, done, info = env.step(action)
+
+ grid, _ = Grid.decode(obs['image'])
+ goal_visible = ('green', 'goal') in grid
+
+ agent_sees_goal = env.agent_sees(*goal_pos)
+ assert agent_sees_goal == goal_visible
+ if done:
+ env.reset()
diff --git a/gym-minigrid/setup.py b/gym-minigrid/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad120f7c645b5ae43616c76455cbbc16bd17fc0e
--- /dev/null
+++ b/gym-minigrid/setup.py
@@ -0,0 +1,15 @@
+from setuptools import setup
+
+setup(
+ name='gym_minigrid',
+ version='1.0.1',
+ keywords='memory, environment, agent, rl, openaigym, openai-gym, gym',
+ url='https://github.com/maximecb/gym-minigrid',
+ description='Minimalistic gridworld package for OpenAI Gym',
+ packages=['gym_minigrid', 'gym_minigrid.envs'],
+ install_requires=[
+ # 'gym>=0.9.6',
+ 'gym==0.18.0',
+ 'numpy==1.17.0'
+ ]
+)
diff --git a/hp_tuning_agent.txt b/hp_tuning_agent.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a611f95005b1aed6d9c75ec525f69a2738698ff4
--- /dev/null
+++ b/hp_tuning_agent.txt
@@ -0,0 +1,433 @@
+## PPO
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-05 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.01 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0001 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.0004 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.01 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1000 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 40 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 20 --*env MiniGrid-TalkItOutPolite-8x8-v0
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-CoinThief-8x8-v0 --env_args few_actions True
+--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 --save-interval 100 --log-interval 100 --dialogue --multi-modal-babyai11-agent --exploration-bonus --exploration-bonus-type rnd --clipped-rewards --arch original_endpool_res --*optim-eps 1e-07 --*entropy-coef 0.0005 --*intrinsic-reward-learning-rate 0.001 --*intrinsic-reward-epsilon 0.0001 --*intrinsic-reward-max-grad-norm 1 --*env MiniGrid-TalkItOutPolite-8x8-v0
diff --git a/llm_data/in_context_asocial_box.txt b/llm_data/in_context_asocial_box.txt
new file mode 100644
index 0000000000000000000000000000000000000000..1490203145f31442575597c1a66b2198d8f25e9e
--- /dev/null
+++ b/llm_data/in_context_asocial_box.txt
@@ -0,0 +1,89 @@
+New episode.
+Obs : 3 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : 2 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps to the right there is a closed green lockablebox
+Act : turn right
+Obs : 2 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps to the left there is a closed green lockablebox
+Act : turn left
+Obs : 2 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps in front of you and 2 steps to the right there is a closed blue lockablebox
+Act : move forward
+Obs : 1 steps in front of you and 2 steps to the right there is a closed blue lockablebox
+Act : move forward
+Obs : 2 steps to the right there is a closed blue lockablebox
+Act : turn right
+Obs : 2 steps in front of you there is a closed blue lockablebox
+Act : move forward
+Obs : Right in front of you there is a closed blue lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps in front of you and 1 steps to the left there is a closed green lockablebox
+Act : move forward
+Obs : 1 steps in front of you and 1 steps to the left there is a closed green lockablebox
+Act : move forward
+Obs : Just to the left of you there is a closed green lockablebox
+Act : turn left
+Obs : Right in front of you there is a closed green lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+Obs :
+Act : turn right
+Obs : 1 steps in front of you and 1 steps to the right there is a closed blue lockablebox
+Act : move forward
+Obs : Just to the right of you there is a closed blue lockablebox
+Act : turn right
+Obs : Right in front of you there is a closed blue lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
+New episode.
+turn right
+Obs :
+Act : turn right
+Obs :
+Act : turn right
+Obs :
+Act : move forward
+Obs :
+Act : turn right
+Obs : 4 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : 3 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : 2 steps in front of you there is a closed green lockablebox
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Act : toggle
+Obs : Right in front of you there is a red apple
+Act : toggle
+Success!
diff --git a/llm_data/in_context_color_boxes.txt b/llm_data/in_context_color_boxes.txt
new file mode 100644
index 0000000000000000000000000000000000000000..86585d7c3388894fce5e8060900382aebe47790a
--- /dev/null
+++ b/llm_data/in_context_color_boxes.txt
@@ -0,0 +1,157 @@
+New episode.
+Obs : 3 steps in front of you there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : 2 steps in front of you there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Caretaker: green
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps to the right there is a closed green lockablebox
+Caretaker: green
+Act : turn right
+Obs : 2 steps in front of you there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Caretaker: green
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps to the left there is a closed green lockablebox
+Caretaker: green
+Act : turn left
+Obs : 2 steps in front of you there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : Right in front of you there is a closed green lockablebox
+Caretaker: green
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps in front of you and 2 steps to the right there is a closed blue lockablebox
+Caretaker: blue
+Act : move forward
+Obs : 1 steps in front of you and 2 steps to the right there is a closed blue lockablebox
+Caretaker: blue
+Act : move forward
+Obs : 2 steps to the right there is a closed blue lockablebox
+Caretaker: blue
+Act : turn right
+Obs : 2 steps in front of you there is a closed blue lockablebox
+Caretaker: blue
+Act : move forward
+Obs : Right in front of you there is a closed blue lockablebox
+Caretaker: blue
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: blue
+Act : toggle
+Success!
+New episode.
+Obs : 2 steps in front of you and 1 steps to the left there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : 1 steps in front of you and 1 steps to the left there is a closed green lockablebox
+Caretaker: green
+Act : move forward
+Obs : Just to the left of you there is a closed green lockablebox
+Caretaker: green
+Act : turn left
+Obs : Right in front of you there is a closed green lockablebox
+Caretaker: green
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs :
+Caretaker: blue
+Act : turn right
+Obs : 1 steps in front of you and 1 steps to the right there is a closed blue lockablebox
+Caretaker: blue
+Act : move forward
+Obs : Just to the right of you there is a closed blue lockablebox
+Caretaker: blue
+Act : turn right
+Obs : Right in front of you there is a closed blue lockablebox
+Caretaker: blue
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: blue
+Act : toggle
+Success!
+# New episode.
+# turn right
+# Obs :
+# Caretaker: green
+# Act : turn right
+# Obs :
+# Caretaker: green
+# Act : turn right
+# Obs :
+# Caretaker: green
+# Act : move forward
+# Obs :
+# Caretaker: green
+# Act : move forward
+# Obs : 3 steps in front of you there is a closed green lockablebox
+# Caretaker: green
+# Act : move forward
+# Obs : 2 steps in front of you there is a closed green lockablebox
+# Caretaker: green
+# Act : move forward
+# Obs : Right in front of you there is a closed green lockablebox
+# Caretaker: green
+# Act : toggle
+# Obs : Right in front of you there is a red apple
+# Caretaker: green
+# Act : toggle
+# Success!
+# Obs : Right in front of you there is a closed brown lockablebox
+# Act : move forward
+New episode.
+Obs : 3 steps in front of you and 3 steps to the left there is a friendly yellow peer. It is looking towards you.
+1 steps to the right there is a closed brown lockablebox
+Caretaker: brown
+Act : turn right
+Obs : Right in front of you there is a closed brown lockablebox
+Caretaker: brown
+Act : toggle
+Obs : Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs : 1 steps to the right there is a closed blue lockablebox
+Right in front of you there is a closed brown lockablebox
+Caretaker: brown
+Act : toggle
+Obs : 1 steps to the right there is a closed blue lockablebox
+Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
+New episode.
+Obs : 1 steps to the right there is a closed blue lockablebox
+Right in front of you there is a closed brown lockablebox
+Caretaker: brown
+Act : toggle
+Obs : 1 steps to the right there is a closed blue lockablebox
+Right in front of you there is a red apple
+Caretaker: green
+Act : toggle
+Success!
diff --git a/llm_data/in_context_examples/in_context_SocialAI-ColorLLMCSParamEnv-v1_2023_09_18_17_24_24/episodes.pkl b/llm_data/in_context_examples/in_context_SocialAI-ColorLLMCSParamEnv-v1_2023_09_18_17_24_24/episodes.pkl
new file mode 100644
index 0000000000000000000000000000000000000000..132bf14e86efc4d4c7b68f4b9119c020adec3164
Binary files /dev/null and b/llm_data/in_context_examples/in_context_SocialAI-ColorLLMCSParamEnv-v1_2023_09_18_17_24_24/episodes.pkl differ
diff --git a/llm_data/in_context_examples/in_context_SocialAI-ELangColorBoxesTestInformationSeekingParamEnv-v1_2023_08_01_16_15_09/episodes.pkl b/llm_data/in_context_examples/in_context_SocialAI-ELangColorBoxesTestInformationSeekingParamEnv-v1_2023_08_01_16_15_09/episodes.pkl
new file mode 100644
index 0000000000000000000000000000000000000000..482c0e338367d0aeb4ddc43602ac727276f77052
Binary files /dev/null and b/llm_data/in_context_examples/in_context_SocialAI-ELangColorBoxesTestInformationSeekingParamEnv-v1_2023_08_01_16_15_09/episodes.pkl differ
diff --git a/llm_data/in_context_examples/in_context_asocialbox_SocialAI-AsocialBoxInformationSeekingParamEnv-v1_2023_07_19_19_28_48/episodes.pkl b/llm_data/in_context_examples/in_context_asocialbox_SocialAI-AsocialBoxInformationSeekingParamEnv-v1_2023_07_19_19_28_48/episodes.pkl
new file mode 100644
index 0000000000000000000000000000000000000000..5e7a71049572962b8650be024006bbfbb188ef6c
Binary files /dev/null and b/llm_data/in_context_examples/in_context_asocialbox_SocialAI-AsocialBoxInformationSeekingParamEnv-v1_2023_07_19_19_28_48/episodes.pkl differ
diff --git a/llm_data/in_context_examples/in_context_colorbox_SocialAI-ColorBoxesLLMCSParamEnv-v1_2023_07_20_13_11_54/episodes.pkl b/llm_data/in_context_examples/in_context_colorbox_SocialAI-ColorBoxesLLMCSParamEnv-v1_2023_07_20_13_11_54/episodes.pkl
new file mode 100644
index 0000000000000000000000000000000000000000..3893e461acd63de71ed50d84a800651214dc67d5
Binary files /dev/null and b/llm_data/in_context_examples/in_context_colorbox_SocialAI-ColorBoxesLLMCSParamEnv-v1_2023_07_20_13_11_54/episodes.pkl differ
diff --git a/models/__init__.py b/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a5f086fe4cdfbd9fe144023291f6d32f7b0e58c
--- /dev/null
+++ b/models/__init__.py
@@ -0,0 +1,10 @@
+from .ac import *
+from .refac import *
+from .blindtalkmultiheadedac import *
+from .multiheadedac import *
+from .randtalkmultiheadedac import *
+from .mm_memory_multiheadedac import *
+from .dialogue_memory_multiheadedac import *
+from .babyai11 import *
+from .multiheadedbabyai11 import *
+from .multimodalbabyai11 import *
diff --git a/models/ac.py b/models/ac.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a344a6f6cfb287beca410ae405bad6f19646b5c
--- /dev/null
+++ b/models/ac.py
@@ -0,0 +1,146 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+from utils.other import init_params
+
+class ACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False, input_size=3):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_memory = use_memory
+ self.env_action_space = action_space
+ self.model_raw_action_space = action_space
+ self.input_size = input_size
+
+ if use_dialogue:
+ raise NotImplementedError("This model does not support dialogue inputs yet")
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(self.input_size, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ # Define memory
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_embedding_size, self.semi_memory_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.semi_memory_size
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, action_space.nvec[0])
+ )
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.image_embedding_size
+
+ def forward(self, obs, memory, return_embeddings=False):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+ x = x.reshape(x.shape[0], -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ x = self.actor(embedding)
+ dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+
+ if return_embeddings:
+ return [dist], value, memory, None
+ else:
+ return [dist], value, memory
+
+ # def sample_action(self, dist):
+ # return dist.sample()
+ #
+ # def calculate_log_probs(self, dist, action):
+ # return dist.log_prob(action)
+
+ def calculate_action_gradient_masks(self, action):
+ """Always train"""
+ mask = torch.ones_like(action).detach()
+ assert action.shape == mask.shape
+
+ return mask
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ mask = torch.ones_like(action)
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
+
diff --git a/models/babyai11.py b/models/babyai11.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d5df7cebc0491bb9a23eef54a9ce3906104df2e
--- /dev/null
+++ b/models/babyai11.py
@@ -0,0 +1,354 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.distributions.categorical import Categorical
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+from utils.babyai_utils.supervised_losses import required_heads
+import torch_ac
+
+
+
+# From https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/model.py
+def initialize_parameters(m):
+ classname = m.__class__.__name__
+ if classname.find('Linear') != -1:
+ m.weight.data.normal_(0, 1)
+ m.weight.data *= 1 / torch.sqrt(m.weight.data.pow(2).sum(1, keepdim=True))
+ if m.bias is not None:
+ m.bias.data.fill_(0)
+
+
+# Inspired by FiLMedBlock from https://arxiv.org/abs/1709.07871
+class FiLM(nn.Module):
+ def __init__(self, in_features, out_features, in_channels, imm_channels):
+ super().__init__()
+ self.conv1 = nn.Conv2d(
+ in_channels=in_channels, out_channels=imm_channels,
+ kernel_size=(3, 3), padding=1)
+ self.bn1 = nn.BatchNorm2d(imm_channels)
+ self.conv2 = nn.Conv2d(
+ in_channels=imm_channels, out_channels=out_features,
+ kernel_size=(3, 3), padding=1)
+ self.bn2 = nn.BatchNorm2d(out_features)
+
+ self.weight = nn.Linear(in_features, out_features)
+ self.bias = nn.Linear(in_features, out_features)
+
+ self.apply(initialize_parameters)
+
+ def forward(self, x, y):
+ x = F.relu(self.bn1(self.conv1(x)))
+ x = self.conv2(x)
+ weight = self.weight(y).unsqueeze(2).unsqueeze(3)
+ bias = self.bias(y).unsqueeze(2).unsqueeze(3)
+ out = x * weight + bias
+ return F.relu(self.bn2(out))
+
+
+class ImageBOWEmbedding(nn.Module):
+ def __init__(self, space, embedding_dim):
+ super().__init__()
+ self.max_value = max(space)
+ self.space = space
+ self.embedding_dim = embedding_dim
+ self.embedding = nn.Embedding(len(self.space) * self.max_value, embedding_dim)
+ self.apply(initialize_parameters)
+
+ def forward(self, inputs):
+ offsets = torch.Tensor([x * self.max_value for x in range(self.space[-1])]).to(inputs.device)
+ inputs = (inputs + offsets[None, :, None, None]).long()
+ return self.embedding(inputs).sum(1).permute(0, 3, 1, 2)
+
+#notes: what they call instr is what we call text
+
+#class ACModel(nn.Module, babyai.rl.RecurrentACModel):
+class Baby11ACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space,
+ image_dim=128, memory_dim=128, instr_dim=128,
+ use_instr=False, lang_model="gru", use_memory=False,
+ arch="bow_endpool_res", aux_info=None):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ endpool = 'endpool' in arch
+ use_bow = 'bow' in arch
+ pixel = 'pixel' in arch
+ self.res = 'res' in arch
+
+ # Decide which components are enabled
+ self.use_instr = use_instr
+ self.use_memory = use_memory
+ self.arch = arch
+ self.lang_model = lang_model
+ self.aux_info = aux_info
+ self.env_action_space = action_space
+ self.model_raw_action_space = action_space
+ if self.res and image_dim != 128:
+ raise ValueError(f"image_dim is {image_dim}, expected 128")
+ self.image_dim = image_dim
+ self.memory_dim = memory_dim
+ self.instr_dim = instr_dim
+
+ self.obs_space = obs_space
+ # transform given 3d obs_space into what babyai11 baseline uses, i.e. 1d embedding size
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ nb_img_channels = self.obs_space['image'][2]
+ self.obs_space = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ for part in self.arch.split('_'):
+ if part not in ['original', 'bow', 'pixels', 'endpool', 'res']:
+ raise ValueError("Incorrect architecture name: {}".format(self.arch))
+
+ # if not self.use_instr:
+ # raise ValueError("FiLM architecture can be used when instructions are enabled")
+ self.image_conv = nn.Sequential(*[
+ *([ImageBOWEmbedding(obs_space['image'], 128)] if use_bow else []),
+ *([nn.Conv2d(
+ in_channels=nb_img_channels, out_channels=128, kernel_size=(8, 8),
+ stride=8, padding=0)] if pixel else []),
+ nn.Conv2d(
+ in_channels=128 if use_bow or pixel else nb_img_channels, out_channels=128,
+ kernel_size=(3, 3) if endpool else (2, 2), stride=1, padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)]),
+ nn.Conv2d(in_channels=128, out_channels=128, kernel_size=(3, 3), padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)])
+ ])
+ self.film_pool = nn.MaxPool2d(kernel_size=(7, 7) if endpool else (2, 2), stride=2)
+
+ # Define instruction embedding
+ if self.use_instr:
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ #self.word_embedding = nn.Embedding(obs_space["instr"], self.instr_dim)
+ self.word_embedding = nn.Embedding(obs_space["text"], self.instr_dim)
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ gru_dim = self.instr_dim
+ if self.lang_model in ['bigru', 'attgru']:
+ gru_dim //= 2
+ self.instr_rnn = nn.GRU(
+ self.instr_dim, gru_dim, batch_first=True,
+ bidirectional=(self.lang_model in ['bigru', 'attgru']))
+ self.final_instr_dim = self.instr_dim
+ else:
+ kernel_dim = 64
+ kernel_sizes = [3, 4]
+ self.instr_convs = nn.ModuleList([
+ nn.Conv2d(1, kernel_dim, (K, self.instr_dim)) for K in kernel_sizes])
+ self.final_instr_dim = kernel_dim * len(kernel_sizes)
+
+ if self.lang_model == 'attgru':
+ self.memory2key = nn.Linear(self.memory_size, self.final_instr_dim)
+
+ num_module = 2
+ self.controllers = []
+ for ni in range(num_module):
+ mod = FiLM(
+ in_features=self.final_instr_dim,
+ out_features=128 if ni < num_module-1 else self.image_dim,
+ in_channels=128, imm_channels=128)
+ self.controllers.append(mod)
+ self.add_module('FiLM_' + str(ni), mod)
+
+ # Define memory and resize image embedding
+ self.embedding_size = self.image_dim
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_dim, self.memory_dim)
+ self.embedding_size = self.semi_memory_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, action_space.nvec[0])
+ )
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(initialize_parameters)
+
+ # Define head for extra info
+ if self.aux_info:
+ self.extra_heads = None
+ self.add_heads()
+
+ def add_heads(self):
+ '''
+ When using auxiliary tasks, the environment yields at each step some binary, continous, or multiclass
+ information. The agent needs to predict those information. This function add extra heads to the model
+ that output the predictions. There is a head per extra information (the head type depends on the extra
+ information type).
+ '''
+ self.extra_heads = nn.ModuleDict()
+ for info in self.aux_info:
+ if required_heads[info] == 'binary':
+ self.extra_heads[info] = nn.Linear(self.embedding_size, 1)
+ elif required_heads[info].startswith('multiclass'):
+ n_classes = int(required_heads[info].split('multiclass')[-1])
+ self.extra_heads[info] = nn.Linear(self.embedding_size, n_classes)
+ elif required_heads[info].startswith('continuous'):
+ if required_heads[info].endswith('01'):
+ self.extra_heads[info] = nn.Sequential(nn.Linear(self.embedding_size, 1), nn.Sigmoid())
+ else:
+ raise ValueError('Only continous01 is implemented')
+ else:
+ raise ValueError('Type not supported')
+ # initializing these parameters independently is done in order to have consistency of results when using
+ # supervised-loss-coef = 0 and when not using any extra binary information
+ self.extra_heads[info].apply(initialize_parameters)
+
+ def add_extra_heads_if_necessary(self, aux_info):
+ '''
+ This function allows using a pre-trained model without aux_info and add aux_info to it and still make
+ it possible to finetune.
+ '''
+ try:
+ if not hasattr(self, 'aux_info') or not set(self.aux_info) == set(aux_info):
+ self.aux_info = aux_info
+ self.add_heads()
+ except Exception:
+ raise ValueError('Could not add extra heads')
+
+ @property
+ def memory_size(self):
+ return 2 * self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.memory_dim
+
+ def forward(self, obs, memory, instr_embedding=None):
+ if self.use_instr and instr_embedding is None:
+ #instr_embedding = self._get_instr_embedding(obs.instr)
+ instr_embedding = self._get_instr_embedding(obs.text)
+ if self.use_instr and self.lang_model == "attgru":
+ # outputs: B x L x D
+ # memory: B x M
+ #mask = (obs.instr != 0).float()
+ mask = (obs.text != 0).float()
+ # The mask tensor has the same length as obs.instr, and
+ # thus can be both shorter and longer than instr_embedding.
+ # It can be longer if instr_embedding is computed
+ # for a subbatch of obs.instr.
+ # It can be shorter if obs.instr is a subbatch of
+ # the batch that instr_embeddings was computed for.
+ # Here, we make sure that mask and instr_embeddings
+ # have equal length along dimension 1.
+ mask = mask[:, :instr_embedding.shape[1]]
+ instr_embedding = instr_embedding[:, :mask.shape[1]]
+
+ keys = self.memory2key(memory)
+ pre_softmax = (keys[:, None, :] * instr_embedding).sum(2) + 1000 * mask
+ attention = F.softmax(pre_softmax, dim=1)
+ instr_embedding = (instr_embedding * attention[:, :, None]).sum(1)
+
+ x = torch.transpose(torch.transpose(obs.image, 1, 3), 2, 3)
+
+ if 'pixel' in self.arch:
+ x /= 256.0
+ x = self.image_conv(x)
+ if self.use_instr:
+ for controller in self.controllers:
+ out = controller(x, instr_embedding)
+ if self.res:
+ out += x
+ x = out
+ x = F.relu(self.film_pool(x))
+ x = x.reshape(x.shape[0], -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if hasattr(self, 'aux_info') and self.aux_info:
+ extra_predictions = {info: self.extra_heads[info](embedding) for info in self.extra_heads}
+ else:
+ extra_predictions = dict()
+
+ x = self.actor(embedding)
+ dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+
+ #return {'dist': dist, 'value': value, 'memory': memory, 'extra_predictions': extra_predictions}
+ return [dist], value, memory
+
+ def _get_instr_embedding(self, instr):
+ lengths = (instr != 0).sum(1).long()
+ if self.lang_model == 'gru':
+ out, _ = self.instr_rnn(self.word_embedding(instr))
+ hidden = out[range(len(lengths)), lengths-1, :]
+ return hidden
+
+ elif self.lang_model in ['bigru', 'attgru']:
+ masks = (instr != 0).float()
+
+ if lengths.shape[0] > 1:
+ seq_lengths, perm_idx = lengths.sort(0, descending=True)
+ iperm_idx = torch.LongTensor(perm_idx.shape).fill_(0)
+ if instr.is_cuda: iperm_idx = iperm_idx.cuda()
+ for i, v in enumerate(perm_idx):
+ iperm_idx[v.data] = i
+
+ inputs = self.word_embedding(instr)
+ inputs = inputs[perm_idx]
+
+ inputs = pack_padded_sequence(inputs, seq_lengths.data.cpu().numpy(), batch_first=True)
+
+ outputs, final_states = self.instr_rnn(inputs)
+ else:
+ instr = instr[:, 0:lengths[0]]
+ outputs, final_states = self.instr_rnn(self.word_embedding(instr))
+ iperm_idx = None
+ final_states = final_states.transpose(0, 1).contiguous()
+ final_states = final_states.view(final_states.shape[0], -1)
+ if iperm_idx is not None:
+ outputs, _ = pad_packed_sequence(outputs, batch_first=True)
+ outputs = outputs[iperm_idx]
+ final_states = final_states[iperm_idx]
+
+ return outputs if self.lang_model == 'attgru' else final_states
+
+ else:
+ ValueError("Undefined instruction architecture: {}".format(self.use_instr))
+
+ # add action sampling to fit our interaction pipeline
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ # add construct final action to fit our interaction pipeline
+ def construct_final_action(self, action):
+ return action
+
+ # add calculate log probs to fit our interaction pipeline
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ # add calculate action masks to fit our interaction pipeline
+ def calculate_action_masks(self, action):
+ mask = torch.ones_like(action)
+ assert action.shape == mask.shape
+ return mask
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
diff --git a/models/blindtalkmultiheadedac.py b/models/blindtalkmultiheadedac.py
new file mode 100644
index 0000000000000000000000000000000000000000..e233947250525e5321142255d04ca047bc9bf571
--- /dev/null
+++ b/models/blindtalkmultiheadedac.py
@@ -0,0 +1,181 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+
+from utils.other import init_params
+
+
+
+
+class BlindTalkingMultiHeadedACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False):
+ super().__init__()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1 # for talk
+ self.talk_action = int(self.n_primitive_actions) - 1
+
+ self.n_utterance_actions = action_space.nvec[1:]
+
+ # in this model the talking is just finding one right thing to say
+ self.utterance_actions_params = [
+ torch.nn.Parameter(torch.ones(n)) for n in self.n_utterance_actions
+ ]
+ for i, p in enumerate(self.utterance_actions_params):
+ self.register_parameter(
+ name="utterance_p_{}".format(i),
+ param=p
+ )
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(3, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ # Define memory
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_embedding_size, self.semi_memory_size)
+
+ if self.use_text or self.use_dialogue:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Define dialogue embedding
+ if self.use_dialogue:
+ self.dialogue_embedding_size = 128
+ self.dialogue_rnn = nn.GRU(self.word_embedding_size, self.dialogue_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.semi_memory_size
+
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ if self.use_dialogue:
+ self.embedding_size += self.dialogue_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.image_embedding_size
+
+ def forward(self, obs, memory):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+
+ batch_size = x.shape[0]
+ x = x.reshape(batch_size, -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ if self.use_dialogue:
+ embed_dial = self._get_embed_dialogue(obs.dialogue)
+ embedding = torch.cat((embedding, embed_dial), dim=1)
+
+ x = self.actor(embedding)
+ primtive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+
+ # construct utterance action distributions, for this model they are radndom
+ utterance_actions_dists = [Categorical(logits=p.repeat(batch_size, 1)) for p in self.utterance_actions_params]
+ # print("utterance params argmax: ", list(map(lambda x: int(x.argmax()), self.utterance_actions_params)))
+ # print("utterance params", self.utterance_actions_params)
+
+ dist = [primtive_actions_dist] + utterance_actions_dists
+
+ return dist, value, memory
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+
+ assert action.shape == mask.shape
+
+ return mask
+ # return torch.ones_like(mask).detach()
+
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def _get_embed_dialogue(self, dial):
+ _, hidden = self.dialogue_rnn(self.word_embedding(dial))
+ return hidden[-1]
+
+
diff --git a/models/dialogue_memory_multiheadedac.py b/models/dialogue_memory_multiheadedac.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c053f49218115f745b967b44cf769fef7a0ae6c
--- /dev/null
+++ b/models/dialogue_memory_multiheadedac.py
@@ -0,0 +1,170 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+
+
+from utils.other import init_params
+
+
+class DialogueMemoryMultiHeadedACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False):
+ super().__init__()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+
+ if not self.use_memory:
+ raise ValueError("You should not be using this model. Use MultiHeadedACModel instead")
+
+ if not self.use_dialogue:
+ raise ValueError("You should not be using this model. Use ACModel instead")
+
+ if self.use_text:
+ raise ValueError("You should not use text but dialogue.")
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1 # for talk
+ self.talk_action = int(self.n_primitive_actions) - 1
+
+ self.n_utterance_actions = action_space.nvec[1:]
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(3, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ if self.use_text or self.use_dialogue:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Define dialogue embedding
+ if self.use_dialogue:
+ self.dialogue_embedding_size = 128
+ self.dialogue_rnn = nn.GRU(self.word_embedding_size, self.dialogue_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.image_embedding_size
+
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ if self.use_dialogue:
+ self.embedding_size += self.dialogue_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+ self.talker = nn.ModuleList([
+ nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, n)
+ ) for n in self.n_utterance_actions])
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return self.dialogue_embedding_size
+
+ def forward(self, obs, memory):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+
+ batch_size = x.shape[0]
+ x = x.reshape(batch_size, -1)
+
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ if self.use_dialogue:
+ embed_dial, memory = self._get_embed_dialogue(obs.dialogue, memory)
+ embedding = torch.cat((embedding, embed_dial), dim=1)
+
+ x = self.actor(embedding)
+ primitive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+ utterance_actions_dists = [
+ Categorical(logits=F.log_softmax(
+ tal(embedding),
+ dim=1,
+ )) for tal in self.talker
+ ]
+
+ dist = [primitive_actions_dist] + utterance_actions_dists
+
+ return dist, value, memory
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+
+ return hidden[-1]
+
+ def _get_embed_dialogue(self, dial, memory):
+ _, hidden = self.dialogue_rnn(self.word_embedding(dial), )
+ return hidden[-1], hidden[-1]
diff --git a/models/mm_memory_multiheadedac.py b/models/mm_memory_multiheadedac.py
new file mode 100644
index 0000000000000000000000000000000000000000..34d34f1c4cb5309ee68e670b04485de158f65c52
--- /dev/null
+++ b/models/mm_memory_multiheadedac.py
@@ -0,0 +1,179 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+
+
+from utils.other import init_params
+
+
+class MMMemoryMultiHeadedACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False):
+ super().__init__()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+
+ if not self.use_memory:
+ raise ValueError("You should not be using this model. Use MultiHeadedACModel instead")
+
+ if self.use_text:
+ raise ValueError("You should not use text but dialogue.")
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1 # for talk
+ self.talk_action = int(self.n_primitive_actions) - 1
+
+ self.n_utterance_actions = action_space.nvec[1:]
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(3, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ if self.use_text or self.use_dialogue:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Define dialogue embedding
+ if self.use_dialogue:
+ self.dialogue_embedding_size = 128
+ self.dialogue_rnn = nn.GRU(self.word_embedding_size, self.dialogue_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.image_embedding_size
+
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ if self.use_dialogue:
+ self.embedding_size += self.dialogue_embedding_size
+
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.embedding_size, self.embedding_size)
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+ self.talker = nn.ModuleList([
+ nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, n)
+ ) for n in self.n_utterance_actions])
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.embedding_size
+
+ def forward(self, obs, memory):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+
+ batch_size = x.shape[0]
+ x = x.reshape(batch_size, -1)
+
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ if self.use_dialogue:
+ embed_dial = self._get_embed_dialogue(obs.dialogue)
+ embedding = torch.cat((embedding, embed_dial), dim=1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(embedding, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+
+ x = self.actor(embedding)
+ primitive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+ utterance_actions_dists = [
+ Categorical(logits=F.log_softmax(
+ tal(embedding),
+ dim=1,
+ )) for tal in self.talker
+ ]
+
+ dist = [primitive_actions_dist] + utterance_actions_dists
+
+ return dist, value, memory
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def _get_embed_dialogue(self, dial):
+ _, hidden = self.dialogue_rnn(self.word_embedding(dial))
+ return hidden[-1]
diff --git a/models/multiheadedac.py b/models/multiheadedac.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecb7525a4de977fbaaf153f3183b95810ddb2803
--- /dev/null
+++ b/models/multiheadedac.py
@@ -0,0 +1,198 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+
+
+from utils.other import init_params
+
+
+
+
+
+class MultiHeadedACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+
+ if self.use_text:
+ raise ValueError("You should not use text but dialogue. --text is cheating.")
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1 # for talk
+ self.talk_action = int(self.n_primitive_actions) - 1
+
+ self.n_utterance_actions = action_space.nvec[1:]
+
+ self.env_action_space = action_space
+ self.model_raw_action_space = spaces.MultiDiscrete([self.n_primitive_actions, *self.n_utterance_actions])
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(3, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ # Define memory
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_embedding_size, self.semi_memory_size)
+
+ if self.use_text or self.use_dialogue:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Define dialogue embedding
+ if self.use_dialogue:
+ self.dialogue_embedding_size = 128
+ self.dialogue_rnn = nn.GRU(self.word_embedding_size, self.dialogue_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.semi_memory_size
+
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ if self.use_dialogue:
+ self.embedding_size += self.dialogue_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+ self.talker = nn.ModuleList([
+ nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, n)
+ ) for n in self.n_utterance_actions])
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.image_embedding_size
+
+ def forward(self, obs, memory):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+
+ batch_size = x.shape[0]
+ x = x.reshape(batch_size, -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ if self.use_dialogue:
+ if not hasattr(obs, "utterance_history"):
+ raise ValueError("The environment need's to be updated to 'utterance' and 'utterance_history' keys'")
+
+ embed_dial = self._get_embed_dialogue(obs.utterance_history)
+
+ embedding = torch.cat((embedding, embed_dial), dim=1)
+
+ x = self.actor(embedding)
+ primtive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+ utterance_actions_dists = [
+ Categorical(logits=F.log_softmax(
+ tal(embedding),
+ dim=1,
+ )) for tal in self.talker
+ ]
+
+ dist = [primtive_actions_dist] + utterance_actions_dists
+
+ return dist, value, memory
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def _get_embed_dialogue(self, dial):
+ _, hidden = self.dialogue_rnn(self.word_embedding(dial))
+ return hidden[-1]
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
+
+
diff --git a/models/multiheadedbabyai11.py b/models/multiheadedbabyai11.py
new file mode 100644
index 0000000000000000000000000000000000000000..9db104efa34e5ae023608d50965226fb46eed006
--- /dev/null
+++ b/models/multiheadedbabyai11.py
@@ -0,0 +1,419 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.distributions.categorical import Categorical
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+from utils.babyai_utils.supervised_losses import required_heads
+import torch_ac
+
+
+
+
+
+# From https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/model.py
+def initialize_parameters(m):
+ classname = m.__class__.__name__
+ if classname.find('Linear') != -1:
+ m.weight.data.normal_(0, 1)
+ m.weight.data *= 1 / torch.sqrt(m.weight.data.pow(2).sum(1, keepdim=True))
+ if m.bias is not None:
+ m.bias.data.fill_(0)
+
+
+# Inspired by FiLMedBlock from https://arxiv.org/abs/1709.07871
+class FiLM(nn.Module):
+ def __init__(self, in_features, out_features, in_channels, imm_channels):
+ super().__init__()
+ self.conv1 = nn.Conv2d(
+ in_channels=in_channels, out_channels=imm_channels,
+ kernel_size=(3, 3), padding=1)
+ self.bn1 = nn.BatchNorm2d(imm_channels)
+ self.conv2 = nn.Conv2d(
+ in_channels=imm_channels, out_channels=out_features,
+ kernel_size=(3, 3), padding=1)
+ self.bn2 = nn.BatchNorm2d(out_features)
+
+ self.weight = nn.Linear(in_features, out_features)
+ self.bias = nn.Linear(in_features, out_features)
+
+ self.apply(initialize_parameters)
+
+ def forward(self, x, y):
+ x = F.relu(self.bn1(self.conv1(x)))
+ x = self.conv2(x)
+ weight = self.weight(y).unsqueeze(2).unsqueeze(3)
+ bias = self.bias(y).unsqueeze(2).unsqueeze(3)
+ out = x * weight + bias
+ return F.relu(self.bn2(out))
+
+
+class ImageBOWEmbedding(nn.Module):
+ def __init__(self, max_value, embedding_dim):
+ super().__init__()
+ self.max_value = max_value
+ self.embedding_dim = embedding_dim
+ self.embedding = nn.Embedding(3 * max_value, embedding_dim)
+ self.apply(initialize_parameters)
+
+ def forward(self, inputs):
+ offsets = torch.Tensor([0, self.max_value, 2 * self.max_value]).to(inputs.device)
+ inputs = (inputs + offsets[None, :, None, None]).long()
+ return self.embedding(inputs).sum(1).permute(0, 3, 1, 2)
+
+#notes: what they call instr is what we call text
+
+#class ACModel(nn.Module, babyai.rl.RecurrentACModel):
+
+# instr (them) == text (us)
+class MultiHeadedBaby11ACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space,
+ image_dim=128, memory_dim=128, text_dim=128, dialog_dim=128,
+ use_text=False, use_dialogue=False, use_current_dialogue_only=False, lang_model="gru", use_memory=False,
+ arch="bow_endpool_res", aux_info=None):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ if use_current_dialogue_only:
+ raise NotImplementedError("current dialogue only")
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ if use_text: # for now we do not consider goal conditioned policies
+ raise ValueError("You should not use text but dialogue. --text is cheating.")
+
+ endpool = 'endpool' in arch
+ use_bow = 'bow' in arch
+ pixel = 'pixel' in arch
+ self.res = 'res' in arch
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+ self.arch = arch
+ self.lang_model = lang_model
+ self.aux_info = aux_info
+ if self.res and image_dim != 128:
+ raise ValueError(f"image_dim is {image_dim}, expected 128")
+ self.image_dim = image_dim
+ self.memory_dim = memory_dim
+ self.text_dim = text_dim
+ self.dialog_dim = dialog_dim
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1 # for talk
+ self.talk_action = int(self.n_primitive_actions) - 1
+ self.n_utterance_actions = action_space.nvec[1:]
+
+ self.env_action_space = action_space
+ self.model_raw_action_space = spaces.MultiDiscrete([self.n_primitive_actions, *self.n_utterance_actions])
+
+ self.obs_space = obs_space
+ # transform given 3d obs_space into what babyai11 baseline uses, i.e. 1d embedding size
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.obs_space = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ for part in self.arch.split('_'):
+ if part not in ['original', 'bow', 'pixels', 'endpool', 'res']:
+ raise ValueError("Incorrect architecture name: {}".format(self.arch))
+
+ # if not self.use_text:
+ # raise ValueError("FiLM architecture can be used when textuctions are enabled")
+ self.image_conv = nn.Sequential(*[
+ *([ImageBOWEmbedding(obs_space['image'], 128)] if use_bow else []),
+ *([nn.Conv2d(
+ in_channels=3, out_channels=128, kernel_size=(8, 8),
+ stride=8, padding=0)] if pixel else []),
+ nn.Conv2d(
+ in_channels=128 if use_bow or pixel else 3, out_channels=128,
+ kernel_size=(3, 3) if endpool else (2, 2), stride=1, padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)]),
+ nn.Conv2d(in_channels=128, out_channels=128, kernel_size=(3, 3), padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)])
+ ])
+ self.film_pool = nn.MaxPool2d(kernel_size=(7, 7) if endpool else (2, 2), stride=2)
+
+ # Define DIALOGUE embedding
+ if self.use_dialogue:
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ #self.word_embedding = nn.Embedding(obs_space["instr"], self.dialog_dim)
+ self.word_embedding = nn.Embedding(obs_space["text"], self.dialog_dim)
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ gru_dim = self.dialog_dim
+ if self.lang_model in ['bigru', 'attgru']:
+ gru_dim //= 2
+ self.dialog_rnn = nn.GRU(
+ self.dialog_dim, gru_dim, batch_first=True,
+ bidirectional=(self.lang_model in ['bigru', 'attgru']))
+ self.final_dialog_dim = self.dialog_dim
+ else:
+ kernel_dim = 64
+ kernel_sizes = [3, 4]
+ self.dialog_convs = nn.ModuleList([
+ nn.Conv2d(1, kernel_dim, (K, self.dialog_dim)) for K in kernel_sizes])
+ self.final_dialog_dim = kernel_dim * len(kernel_sizes)
+
+ if self.lang_model == 'attgru':
+ self.memory2key = nn.Linear(self.memory_size, self.final_dialog_dim)
+
+ num_module = 2
+ self.controllers = []
+ for ni in range(num_module):
+ mod = FiLM(
+ in_features=self.final_dialog_dim,
+ out_features=128 if ni < num_module-1 else self.image_dim,
+ in_channels=128, imm_channels=128)
+ self.controllers.append(mod)
+ self.add_module('FiLM_' + str(ni), mod)
+
+ # Define memory and resize image embedding
+ self.embedding_size = self.image_dim
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_dim, self.memory_dim)
+ self.embedding_size = self.semi_memory_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+
+ self.talker = nn.ModuleList([
+ nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, n)
+ ) for n in self.n_utterance_actions])
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(initialize_parameters)
+
+ # Define head for extra info
+ if self.aux_info:
+ self.extra_heads = None
+ self.add_heads()
+
+ def add_heads(self):
+ '''
+ When using auxiliary tasks, the environment yields at each step some binary, continous, or multiclass
+ information. The agent needs to predict those information. This function add extra heads to the model
+ that output the predictions. There is a head per extra information (the head type depends on the extra
+ information type).
+ '''
+ self.extra_heads = nn.ModuleDict()
+ for info in self.aux_info:
+ if required_heads[info] == 'binary':
+ self.extra_heads[info] = nn.Linear(self.embedding_size, 1)
+ elif required_heads[info].startswith('multiclass'):
+ n_classes = int(required_heads[info].split('multiclass')[-1])
+ self.extra_heads[info] = nn.Linear(self.embedding_size, n_classes)
+ elif required_heads[info].startswith('continuous'):
+ if required_heads[info].endswith('01'):
+ self.extra_heads[info] = nn.Sequential(nn.Linear(self.embedding_size, 1), nn.Sigmoid())
+ else:
+ raise ValueError('Only continous01 is implemented')
+ else:
+ raise ValueError('Type not supported')
+ # initializing these parameters independently is done in order to have consistency of results when using
+ # supervised-loss-coef = 0 and when not using any extra binary information
+ self.extra_heads[info].apply(initialize_parameters)
+
+ def add_extra_heads_if_necessary(self, aux_info):
+ '''
+ This function allows using a pre-trained model without aux_info and add aux_info to it and still make
+ it possible to finetune.
+ '''
+ try:
+ if not hasattr(self, 'aux_info') or not set(self.aux_info) == set(aux_info):
+ self.aux_info = aux_info
+ self.add_heads()
+ except Exception:
+ raise ValueError('Could not add extra heads')
+
+ @property
+ def memory_size(self):
+ return 2 * self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.memory_dim
+
+ def forward(self, obs, memory, dialog_embedding=None):
+ if self.use_dialogue and dialog_embedding is None:
+ #instr_embedding = self._get_instr_embedding(obs.instr)
+
+ if not hasattr(obs, "utterance_history"):
+ raise ValueError("The environment need's to be updated to 'utterance' and 'utterance_history' keys'")
+
+ dialog_embedding = self._get_dialog_embedding(obs.utterance_history)
+ if self.use_dialogue and self.lang_model == "attgru":
+ # outputs: B x L x D
+ # memory: B x M
+ #mask = (obs.instr != 0).float()
+ mask = (obs.utterance_history != 0).float()
+ # The mask tensor has the same length as obs.instr, and
+ # thus can be both shorter and longer than instr_embedding.
+ # It can be longer if instr_embedding is computed
+ # for a subbatch of obs.instr.
+ # It can be shorter if obs.instr is a subbatch of
+ # the batch that instr_embeddings was computed for.
+ # Here, we make sure that mask and instr_embeddings
+ # have equal length along dimension 1.
+ mask = mask[:, :dialog_embedding.shape[1]]
+ dialog_embedding = dialog_embedding[:, :mask.shape[1]]
+
+ keys = self.memory2key(memory)
+ pre_softmax = (keys[:, None, :] * dialog_embedding).sum(2) + 1000 * mask
+ attention = F.softmax(pre_softmax, dim=1)
+ dialog_embedding = (dialog_embedding * attention[:, :, None]).sum(1)
+
+ x = torch.transpose(torch.transpose(obs.image, 1, 3), 2, 3)
+
+ if 'pixel' in self.arch:
+ x /= 256.0
+ x = self.image_conv(x)
+ if self.use_dialogue:
+ for controller in self.controllers:
+ out = controller(x, dialog_embedding)
+ if self.res:
+ out += x
+ x = out
+ x = F.relu(self.film_pool(x))
+ x = x.reshape(x.shape[0], -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if hasattr(self, 'aux_info') and self.aux_info:
+ extra_predictions = {info: self.extra_heads[info](embedding) for info in self.extra_heads}
+ else:
+ extra_predictions = dict()
+
+ # x = self.actor(embedding)
+ # dist = Categorical(logits=F.log_softmax(x, dim=1))
+ x = self.actor(embedding)
+ primitive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+ utterance_actions_dists = [
+ Categorical(logits=F.log_softmax(
+ tal(embedding),
+ dim=1,
+ )) for tal in self.talker
+ ]
+
+ dist = [primitive_actions_dist] + utterance_actions_dists
+ #return {'dist': dist, 'value': value, 'memory': memory, 'extra_predictions': extra_predictions}
+ return dist, value, memory
+
+ def _get_dialog_embedding(self, dialog):
+ lengths = (dialog != 0).sum(1).long()
+ if self.lang_model == 'gru':
+ out, _ = self.dialog_rnn(self.word_embedding(dialog))
+ hidden = out[range(len(lengths)), lengths-1, :]
+ return hidden
+
+ elif self.lang_model in ['bigru', 'attgru']:
+ masks = (dialog != 0).float()
+
+ if lengths.shape[0] > 1:
+ seq_lengths, perm_idx = lengths.sort(0, descending=True)
+ iperm_idx = torch.LongTensor(perm_idx.shape).fill_(0)
+ if dialog.is_cuda: iperm_idx = iperm_idx.cuda()
+ for i, v in enumerate(perm_idx):
+ iperm_idx[v.data] = i
+
+ inputs = self.word_embedding(dialog)
+ inputs = inputs[perm_idx]
+
+ inputs = pack_padded_sequence(inputs, seq_lengths.data.cpu().numpy(), batch_first=True)
+
+ outputs, final_states = self.dialog_rnn(inputs)
+ else:
+ dialog = dialog[:, 0:lengths[0]]
+ outputs, final_states = self.dialog_rnn(self.word_embedding(dialog))
+ iperm_idx = None
+ final_states = final_states.transpose(0, 1).contiguous()
+ final_states = final_states.view(final_states.shape[0], -1)
+ if iperm_idx is not None:
+ outputs, _ = pad_packed_sequence(outputs, batch_first=True)
+ outputs = outputs[iperm_idx]
+ final_states = final_states[iperm_idx]
+
+ return outputs if self.lang_model == 'attgru' else final_states
+
+ else:
+ ValueError("Undefined dialoguction architecture: {}".format(self.use_dialogue))
+
+ # add action sampling to fit our interaction pipeline
+ ## baby ai [[Categorical(logits: torch.Size([16, 8])), Categorical(logits: torch.Size([16, 2])), Categorical(logits: torch.Size([16, 2]))]]
+ ## mh ac [Categorical(logits: torch.Size([16, 8])), Categorical(logits: torch.Size([16, 2])), Categorical(logits: torch.Size([16, 2]))]
+ def sample_action(self, dist):
+ # print(dist)
+ # raise
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ # # add construct final action to fit our interaction pipeline
+ # def construct_final_action(self, action):
+ # return action
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ # add calculate log probs to fit our interaction pipeline
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ # add calculate action masks to fit our interaction pipeline
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+ assert action.shape == mask.shape
+
+ return mask
+ # def calculate_action_masks(self, action):
+ # mask = torch.ones_like(action)
+ # assert action.shape == mask.shape
+ # return mask
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
diff --git a/models/multimodalbabyai11.py b/models/multimodalbabyai11.py
new file mode 100644
index 0000000000000000000000000000000000000000..96afdf943b65428633ea91949fc0e04b892a0894
--- /dev/null
+++ b/models/multimodalbabyai11.py
@@ -0,0 +1,471 @@
+import numpy as np
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.distributions.categorical import Categorical
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+
+import torch_ac
+
+from utils.babyai_utils.supervised_losses import required_heads
+import gym.spaces as spaces
+
+
+
+
+def safe_relu(x):
+ return torch.maximum(x, torch.zeros_like(x))
+
+# From https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/model.py
+def initialize_parameters(m):
+ classname = m.__class__.__name__
+ if classname.find('Linear') != -1:
+ m.weight.data.normal_(0, 1)
+ m.weight.data *= 1 / torch.sqrt(m.weight.data.pow(2).sum(1, keepdim=True))
+ if m.bias is not None:
+ m.bias.data.fill_(0)
+
+
+# Inspired by FiLMedBlock from https://arxiv.org/abs/1709.07871
+class FiLM(nn.Module):
+ def __init__(self, in_features, out_features, in_channels, imm_channels):
+ super().__init__()
+ self.conv1 = nn.Conv2d(
+ in_channels=in_channels, out_channels=imm_channels,
+ kernel_size=(3, 3), padding=1)
+ self.bn1 = nn.BatchNorm2d(imm_channels)
+ self.conv2 = nn.Conv2d(
+ in_channels=imm_channels, out_channels=out_features,
+ kernel_size=(3, 3), padding=1)
+ self.bn2 = nn.BatchNorm2d(out_features)
+
+ self.weight = nn.Linear(in_features, out_features)
+ self.bias = nn.Linear(in_features, out_features)
+
+ self.apply(initialize_parameters)
+
+ def forward(self, x, y):
+ x = F.relu(self.bn1(self.conv1(x)))
+ x = self.conv2(x)
+ weight = self.weight(y).unsqueeze(2).unsqueeze(3)
+ bias = self.bias(y).unsqueeze(2).unsqueeze(3)
+ out = x * weight + bias
+
+ # return F.relu(self.bn2(out)) # this causes an error in the new version of pytorch -> replaced by safe_relu
+ return safe_relu(self.bn2(out))
+
+class ImageBOWEmbedding(nn.Module):
+ def __init__(self, space, embedding_dim):
+ super().__init__()
+ # self.max_value = max(space)
+ self.max_value = 255 # 255, because of "no_point" encoding, which is encoded as 255
+ self.space = space
+ self.embedding_dim = embedding_dim
+ self.embedding = nn.Embedding(self.space[-1] * self.max_value, embedding_dim)
+ self.apply(initialize_parameters)
+
+ def forward(self, inputs):
+ offsets = torch.Tensor([x * self.max_value for x in range(self.space[-1])]).to(inputs.device)
+ inputs = (inputs + offsets[None, :, None, None]).long()
+ return self.embedding(inputs).sum(1).permute(0, 3, 1, 2)
+
+#notes: what they call instr is what we call text
+
+#class ACModel(nn.Module, babyai.rl.RecurrentACModel):
+
+# instr (them) == text (us)
+class MultiModalBaby11ACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space,
+ image_dim=128, memory_dim=128, text_dim=128, dialog_dim=128,
+ use_text=False, use_dialogue=False, use_current_dialogue_only=False, lang_model="gru", use_memory=False,
+ arch="bow_endpool_res", aux_info=None, num_films=2):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ if use_text: # for now we do not consider goal conditioned policies
+ raise ValueError("You should not use text but dialogue. --text is cheating.")
+
+ endpool = 'endpool' in arch
+ use_bow = 'bow' in arch
+ pixel = 'pixel' in arch
+ self.res = 'res' in arch
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_current_dialogue_only = use_current_dialogue_only
+ self.use_memory = use_memory
+ self.arch = arch
+ self.lang_model = lang_model
+ self.aux_info = aux_info
+ if self.res and image_dim != 128:
+ raise ValueError(f"image_dim is {image_dim}, expected 128")
+ self.image_dim = image_dim
+ self.memory_dim = memory_dim
+ self.text_dim = text_dim
+ self.dialog_dim = dialog_dim
+
+ self.num_module = num_films
+ self.n_primitive_actions = action_space.nvec[0] + 1 # not move action added
+ self.move_switch_action = int(self.n_primitive_actions) - 1
+
+ self.n_utterance_actions = np.concatenate(([2], action_space.nvec[1:])) # binary to not speak
+ self.talk_switch_subhead = 0
+
+ self.env_action_space = action_space
+ self.model_raw_action_space = spaces.MultiDiscrete([self.n_primitive_actions, *self.n_utterance_actions])
+
+ self.obs_space = obs_space
+
+ # transform given 3d obs_space into what babyai11 baseline uses, i.e. 1d embedding size
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ nb_img_channels = self.obs_space['image'][2]
+ self.obs_space = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ for part in self.arch.split('_'):
+ if part not in ['original', 'bow', 'pixels', 'endpool', 'res']:
+ raise ValueError("Incorrect architecture name: {}".format(self.arch))
+
+ # if not self.use_text:
+ # raise ValueError("FiLM architecture can be used when textuctions are enabled")
+ self.image_conv = nn.Sequential(*[
+ *([ImageBOWEmbedding(obs_space['image'], 128)] if use_bow else []),
+ *([nn.Conv2d(
+ in_channels=nb_img_channels, out_channels=128, kernel_size=(8, 8),
+ stride=8, padding=0)] if pixel else []),
+ nn.Conv2d(
+ in_channels=128 if use_bow or pixel else nb_img_channels, out_channels=128,
+ kernel_size=(3, 3) if endpool else (2, 2), stride=1, padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)]),
+ nn.Conv2d(in_channels=128, out_channels=128, kernel_size=(3, 3), padding=1),
+ nn.BatchNorm2d(128),
+ nn.ReLU(),
+ *([] if endpool else [nn.MaxPool2d(kernel_size=(2, 2), stride=2)])
+ ])
+ self.film_pool = nn.MaxPool2d(kernel_size=(7, 7) if endpool else (2, 2), stride=2)
+
+ # Define DIALOGUE embedding
+ if self.use_dialogue or self.use_current_dialogue_only:
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ #self.word_embedding = nn.Embedding(obs_space["instr"], self.dialog_dim)
+ self.word_embedding = nn.Embedding(obs_space["text"], self.dialog_dim)
+ if self.lang_model in ['gru', 'bigru', 'attgru']:
+ gru_dim = self.dialog_dim
+ if self.lang_model in ['bigru', 'attgru']:
+ gru_dim //= 2
+ self.dialog_rnn = nn.GRU(
+ self.dialog_dim, gru_dim, batch_first=True,
+ bidirectional=(self.lang_model in ['bigru', 'attgru']))
+ self.final_dialog_dim = self.dialog_dim
+ else:
+ kernel_dim = 64
+ kernel_sizes = [3, 4]
+ self.dialog_convs = nn.ModuleList([
+ nn.Conv2d(1, kernel_dim, (K, self.dialog_dim)) for K in kernel_sizes])
+ self.final_dialog_dim = kernel_dim * len(kernel_sizes)
+
+ if self.lang_model == 'attgru':
+ self.memory2key = nn.Linear(self.memory_size, self.final_dialog_dim)
+
+ self.controllers = []
+ for ni in range(self.num_module):
+ mod = FiLM(
+ in_features=self.final_dialog_dim,
+ out_features=128 if ni < self.num_module-1 else self.image_dim,
+ in_channels=128, imm_channels=128)
+ self.controllers.append(mod)
+ self.add_module('FiLM_' + str(ni), mod)
+
+ # Define memory and resize image embedding
+ self.embedding_size = self.image_dim
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_dim, self.memory_dim)
+ self.embedding_size = self.semi_memory_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+
+ self.talker = nn.ModuleList([
+ nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, n)
+ ) for n in self.n_utterance_actions])
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+ # Initialize parameters correctly
+ self.apply(initialize_parameters)
+
+ # Define head for extra info
+ if self.aux_info:
+ self.extra_heads = None
+ self.add_heads()
+
+ def add_heads(self):
+ '''
+ When using auxiliary tasks, the environment yields at each step some binary, continous, or multiclass
+ information. The agent needs to predict those information. This function add extra heads to the model
+ that output the predictions. There is a head per extra information (the head type depends on the extra
+ information type).
+ '''
+ self.extra_heads = nn.ModuleDict()
+ for info in self.aux_info:
+ if required_heads[info] == 'binary':
+ self.extra_heads[info] = nn.Linear(self.embedding_size, 1)
+ elif required_heads[info].startswith('multiclass'):
+ n_classes = int(required_heads[info].split('multiclass')[-1])
+ self.extra_heads[info] = nn.Linear(self.embedding_size, n_classes)
+ elif required_heads[info].startswith('continuous'):
+ if required_heads[info].endswith('01'):
+ self.extra_heads[info] = nn.Sequential(nn.Linear(self.embedding_size, 1), nn.Sigmoid())
+ else:
+ raise ValueError('Only continous01 is implemented')
+ else:
+ raise ValueError('Type not supported')
+ # initializing these parameters independently is done in order to have consistency of results when using
+ # supervised-loss-coef = 0 and when not using any extra binary information
+ self.extra_heads[info].apply(initialize_parameters)
+
+ def add_extra_heads_if_necessary(self, aux_info):
+ '''
+ This function allows using a pre-trained model without aux_info and add aux_info to it and still make
+ it possible to finetune.
+ '''
+ try:
+ if not hasattr(self, 'aux_info') or not set(self.aux_info) == set(aux_info):
+ self.aux_info = aux_info
+ self.add_heads()
+ except Exception:
+ raise ValueError('Could not add extra heads')
+
+ @property
+ def memory_size(self):
+ return 2 * self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.memory_dim
+
+ def forward(self, obs, memory, dialog_embedding=None, return_embeddings=False):
+ if self.use_dialogue and dialog_embedding is None:
+ if not hasattr(obs, "utterance_history"):
+ raise ValueError("The environment need's to be updated to 'utterance' and 'utterance_history' keys'")
+
+ dialog_embedding = self._get_dialog_embedding(obs.utterance_history)
+
+ elif self.use_current_dialogue_only and dialog_embedding is None:
+ if not hasattr(obs, "utterance"):
+ raise ValueError("The environment need's to be updated to 'utterance' and 'utterance_history' keys'")
+
+ dialog_embedding = self._get_dialog_embedding(obs.utterance)
+
+ if (self.use_dialogue or self.use_current_dialogue_only) and self.lang_model == "attgru":
+ # outputs: B x L x D
+ # memory: B x M
+ #mask = (obs.instr != 0).float()
+ mask = (obs.utterance_history != 0).float()
+ # The mask tensor has the same length as obs.instr, and
+ # thus can be both shorter and longer than instr_embedding.
+ # It can be longer if instr_embedding is computed
+ # for a subbatch of obs.instr.
+ # It can be shorter if obs.instr is a subbatch of
+ # the batch that instr_embeddings was computed for.
+ # Here, we make sure that mask and instr_embeddings
+ # have equal length along dimension 1.
+ mask = mask[:, :dialog_embedding.shape[1]]
+ dialog_embedding = dialog_embedding[:, :mask.shape[1]]
+
+ keys = self.memory2key(memory)
+ pre_softmax = (keys[:, None, :] * dialog_embedding).sum(2) + 1000 * mask
+ attention = F.softmax(pre_softmax, dim=1)
+ dialog_embedding = (dialog_embedding * attention[:, :, None]).sum(1)
+
+ x = torch.transpose(torch.transpose(obs.image, 1, 3), 2, 3)
+
+ if 'pixel' in self.arch:
+ x /= 256.0
+ x = self.image_conv(x)
+ if (self.use_dialogue or self.use_current_dialogue_only):
+ for controller in self.controllers:
+ out = controller(x, dialog_embedding)
+ if self.res:
+ out += x
+ x = out
+ x = F.relu(self.film_pool(x))
+ x = x.reshape(x.shape[0], -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if hasattr(self, 'aux_info') and self.aux_info:
+ extra_predictions = {info: self.extra_heads[info](embedding) for info in self.extra_heads}
+ else:
+ extra_predictions = dict()
+
+ # x = self.actor(embedding)
+ # dist = Categorical(logits=F.log_softmax(x, dim=1))
+ x = self.actor(embedding)
+ primitive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+ utterance_actions_dists = [
+ Categorical(logits=F.log_softmax(
+ tal(embedding),
+ dim=1,
+ )) for tal in self.talker
+ ]
+
+ dist = [primitive_actions_dist] + utterance_actions_dists
+ #return {'dist': dist, 'value': value, 'memory': memory, 'extra_predictions': extra_predictions}
+
+ if return_embeddings:
+ return dist, value, memory, embedding
+ else:
+ return dist, value, memory
+
+ def _get_dialog_embedding(self, dialog):
+ lengths = (dialog != 0).sum(1).long()
+ if self.lang_model == 'gru':
+ out, _ = self.dialog_rnn(self.word_embedding(dialog))
+ hidden = out[range(len(lengths)), lengths-1, :]
+ return hidden
+
+ elif self.lang_model in ['bigru', 'attgru']:
+ masks = (dialog != 0).float()
+
+ if lengths.shape[0] > 1:
+ seq_lengths, perm_idx = lengths.sort(0, descending=True)
+ iperm_idx = torch.LongTensor(perm_idx.shape).fill_(0)
+ if dialog.is_cuda: iperm_idx = iperm_idx.cuda()
+ for i, v in enumerate(perm_idx):
+ iperm_idx[v.data] = i
+
+ inputs = self.word_embedding(dialog)
+ inputs = inputs[perm_idx]
+
+ inputs = pack_padded_sequence(inputs, seq_lengths.data.cpu().numpy(), batch_first=True)
+
+ outputs, final_states = self.dialog_rnn(inputs)
+ else:
+ dialog = dialog[:, 0:lengths[0]]
+ outputs, final_states = self.dialog_rnn(self.word_embedding(dialog))
+ iperm_idx = None
+ final_states = final_states.transpose(0, 1).contiguous()
+ final_states = final_states.view(final_states.shape[0], -1)
+ if iperm_idx is not None:
+ outputs, _ = pad_packed_sequence(outputs, batch_first=True)
+ outputs = outputs[iperm_idx]
+ final_states = final_states[iperm_idx]
+
+ return outputs if self.lang_model == 'attgru' else final_states
+
+ else:
+ ValueError("Undefined lang_model architecture: {}".format(self.lang_model))
+
+ # add action sampling to fit our interaction pipeline
+ ## baby ai [[Categorical(logits: torch.Size([16, 8])), Categorical(logits: torch.Size([16, 2])), Categorical(logits: torch.Size([16, 2]))]]
+ ## mh ac [Categorical(logits: torch.Size([16, 8])), Categorical(logits: torch.Size([16, 2])), Categorical(logits: torch.Size([16, 2]))]
+
+ def det_action(self, dist):
+ return torch.stack([d.probs.argmax(dim=-1) for d in dist], dim=1)
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+
+ def is_raw_action_speaking(self, action):
+ is_speaking = action[:, 1:][:, self.talk_switch_subhead] == 1 # talking heads are [1:]
+ return is_speaking
+
+ def no_speak_to_speak_action(self, action):
+ action[:, 1] = 1 # set speaking action to speak (1)
+
+ assert all(self.is_raw_action_speaking(action))
+
+ return action
+
+ def raw_action_to_act_speak_mask(self, action):
+ """
+ Defines how the final action to be sent to the environment is computed
+ Does NOT define how gradients are propagated, see calculate_action_gradient_masks() for that
+ """
+
+ assert action.shape[-1] == 4
+ assert self.model_raw_action_space.shape[0] == action.shape[-1]
+
+ act_mask = action[:, 0] != self.move_switch_action # acting head is [0]
+ # speak_mask = action[:, 1:][:, self.talk_switch_subhead] == 1 # talking heads are [1:]
+ speak_mask = self.is_raw_action_speaking(action)
+ return act_mask, speak_mask
+
+ def construct_final_action(self, action):
+ act_mask, speak_mask = self.raw_action_to_act_speak_mask(action)
+
+ nan_mask = np.stack((act_mask, speak_mask, speak_mask), axis=1).astype(float)
+ nan_mask[nan_mask == 0] = np.nan
+
+ assert self.talk_switch_subhead == 0
+ final_action = action[:, [True, False, True, True]] # we drop the talk_switch_subhead
+ final_action = nan_mask*final_action
+
+ assert self.env_action_space.shape[0] == final_action.shape[-1]
+
+ return final_action
+
+ # add calculate log probs to fit our interaction pipeline
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ # add calculate action masks to fit our interaction pipeline
+ def calculate_action_gradient_masks(self, action):
+ """
+ Defines how the gradients are propagated.
+ Moving head is always trained.
+ Speak switch is always trained.
+ Grammar heads are trained only when speak switch is ON
+ """
+ _, speak_mask = self.raw_action_to_act_speak_mask(action)
+
+ mask = torch.stack(
+ (
+ torch.ones_like(speak_mask), # always train
+ torch.ones_like(speak_mask), # always train
+ speak_mask, # train only when speaking
+ speak_mask, # train only when speaking
+ ), dim=1).detach()
+ assert action.shape == mask.shape
+
+ return mask
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
diff --git a/models/randtalkmultiheadedac.py b/models/randtalkmultiheadedac.py
new file mode 100644
index 0000000000000000000000000000000000000000..f23a07583b6ef27f38ef6ec7f38eeada9ad9e765
--- /dev/null
+++ b/models/randtalkmultiheadedac.py
@@ -0,0 +1,169 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+
+
+from utils.other import init_params
+
+
+
+
+class RandomTalkingMultiHeadedACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False):
+ super().__init__()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_dialogue = use_dialogue
+ self.use_memory = use_memory
+
+ # multi dim
+ if action_space.shape == ():
+ raise ValueError("The action space is not multi modal. Use ACModel instead.")
+
+ self.n_primitive_actions = action_space.nvec[0] + 1
+ self.talk_action = int(self.n_primitive_actions) - 1
+ self.n_utterance_actions = action_space.nvec[1:]
+ self.env_action_space = action_space
+ self.model_raw_action_space = spaces.MultiDiscrete([self.n_primitive_actions, *self.n_utterance_actions])
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(3, 16, (2, 2)),
+ nn.ReLU(),
+ nn.MaxPool2d((2, 2)),
+ nn.Conv2d(16, 32, (2, 2)),
+ nn.ReLU(),
+ nn.Conv2d(32, 64, (2, 2)),
+ nn.ReLU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ # Define memory
+ if self.use_memory:
+ self.memory_rnn = nn.LSTMCell(self.image_embedding_size, self.semi_memory_size)
+
+ if self.use_text or self.use_dialogue:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+
+ # Define text embedding
+ if self.use_text:
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Define dialogue embedding
+ if self.use_dialogue:
+ self.dialogue_embedding_size = 128
+ self.dialogue_rnn = nn.GRU(self.word_embedding_size, self.dialogue_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.semi_memory_size
+
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ if self.use_dialogue:
+ self.embedding_size += self.dialogue_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, self.n_primitive_actions)
+ )
+
+ # Define critic's model
+ self.critic = nn.Sequential(
+ nn.Linear(self.embedding_size, 64),
+ nn.Tanh(),
+ nn.Linear(64, 1)
+ )
+
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.image_embedding_size
+
+ def forward(self, obs, memory):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+
+ batch_size = x.shape[0]
+ x = x.reshape(batch_size, -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ if self.use_dialogue:
+ embed_dial = self._get_embed_dialogue(obs.dialogue)
+ embedding = torch.cat((embedding, embed_dial), dim=1)
+
+ x = self.actor(embedding)
+ primtive_actions_dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+
+ # construct utterance action distributions, for this model they are radndom
+ utterance_actions_dists = [Categorical(logits=torch.ones((batch_size, n), requires_grad=False)) for n in self.n_utterance_actions]
+
+ dist = [primtive_actions_dist] + utterance_actions_dists
+
+ return dist, value, memory
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ talk_mask = action[:, 0] == self.talk_action
+ mask = torch.stack(
+ (torch.ones_like(talk_mask), talk_mask, talk_mask),
+ dim=1).detach()
+
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ act_mask = action[:, 0] != self.n_primitive_actions - 1
+
+ nan_mask = np.array([
+ np.array([1, np.nan, np.nan]) if t else np.array([np.nan, 1, 1]) for t in act_mask
+ ])
+
+ action = nan_mask*action
+
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def _get_embed_dialogue(self, dial):
+ _, hidden = self.dialogue_rnn(self.word_embedding(dial))
+ return hidden[-1]
\ No newline at end of file
diff --git a/models/refac.py b/models/refac.py
new file mode 100644
index 0000000000000000000000000000000000000000..26f285ac8f5c115bcca085f7d258a3ebcab06f22
--- /dev/null
+++ b/models/refac.py
@@ -0,0 +1,141 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.categorical import Categorical
+import torch_ac
+from utils.other import init_params
+
+class RefACModel(nn.Module, torch_ac.RecurrentACModel):
+ def __init__(self, obs_space, action_space, use_memory=False, use_text=False, use_dialogue=False, input_size=3):
+ super().__init__()
+
+ # store config
+ self.config = locals()
+
+ # Decide which components are enabled
+ self.use_text = use_text
+ self.use_memory = use_memory
+ self.env_action_space = action_space
+ self.model_raw_action_space = action_space
+ self.input_size = input_size
+
+ if use_dialogue:
+ raise NotImplementedError("This model does not support dialogue inputs yet")
+
+ # Define image embedding
+ self.image_conv = nn.Sequential(
+ nn.Conv2d(self.input_size, 32, (3, 3), stride=2, padding=1),
+ nn.ELU(),
+ nn.Conv2d(32, 32, (3, 3), stride=2, padding=1),
+ nn.ELU(),
+ nn.Conv2d(32, 32, (3, 3), stride=2, padding=1),
+ nn.ELU()
+ )
+ n = obs_space["image"][0]
+ m = obs_space["image"][1]
+ # self.image_embedding_size = ((n-1)//2-2)*((m-1)//2-2)*64
+
+ # Define memory
+ assert self.use_memory
+ if self.use_memory:
+ assert self.semi_memory_size == 256
+ # image gets flattened by 3 consecutive convolutions
+ self.memory_rnn = nn.LSTMCell(32, self.semi_memory_size)
+
+ # Define text embedding
+ assert not self.use_text
+ if self.use_text:
+ self.word_embedding_size = 32
+ self.word_embedding = nn.Embedding(obs_space["text"], self.word_embedding_size)
+ self.text_embedding_size = 128
+ self.text_rnn = nn.GRU(self.word_embedding_size, self.text_embedding_size, batch_first=True)
+
+ # Resize image embedding
+ self.embedding_size = self.semi_memory_size
+ if self.use_text:
+ self.embedding_size += self.text_embedding_size
+
+ # Define actor's model
+ self.actor = nn.Sequential(nn.Linear(self.embedding_size, action_space.nvec[0]))
+
+ # Define critic's model
+ self.critic = nn.Sequential(nn.Linear(self.embedding_size, 1))
+
+ # Initialize parameters correctly
+ self.apply(init_params)
+
+ @property
+ def memory_size(self):
+ return 2*self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return 256
+
+ def forward(self, obs, memory, return_embeddings=False):
+ x = obs.image.transpose(1, 3).transpose(2, 3)
+ x = self.image_conv(x)
+ x = x.reshape(x.shape[0], -1)
+
+ if self.use_memory:
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(x, hidden)
+ embedding = hidden[0]
+ memory = torch.cat(hidden, dim=1)
+ else:
+ embedding = x
+
+ if self.use_text:
+ embed_text = self._get_embed_text(obs.text)
+ embedding = torch.cat((embedding, embed_text), dim=1)
+
+ x = self.actor(embedding)
+ dist = Categorical(logits=F.log_softmax(x, dim=1))
+
+ x = self.critic(embedding)
+ value = x.squeeze(1)
+
+ if return_embeddings:
+ return [dist], value, memory, None
+ else:
+ return [dist], value, memory
+
+ # def sample_action(self, dist):
+ # return dist.sample()
+ #
+ # def calculate_log_probs(self, dist, action):
+ # return dist.log_prob(action)
+
+ def calculate_action_gradient_masks(self, action):
+ """Always train"""
+ mask = torch.ones_like(action).detach()
+ assert action.shape == mask.shape
+
+ return mask
+
+ def sample_action(self, dist):
+ return torch.stack([d.sample() for d in dist], dim=1)
+
+ def calculate_log_probs(self, dist, action):
+ return torch.stack([d.log_prob(action[:, i]) for i, d in enumerate(dist)], dim=1)
+
+ def calculate_action_masks(self, action):
+ mask = torch.ones_like(action)
+ assert action.shape == mask.shape
+
+ return mask
+
+ def construct_final_action(self, action):
+ return action
+
+ def _get_embed_text(self, text):
+ _, hidden = self.text_rnn(self.word_embedding(text))
+ return hidden[-1]
+
+ def get_config_dict(self):
+ del self.config['__class__']
+ self.config['self'] = str(self.config['self'])
+ self.config['action_space'] = self.config['action_space'].nvec.tolist()
+ return self.config
+
diff --git a/n_tokens.py b/n_tokens.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ea71b8a12455549f5348018252a6fd034c5b96
--- /dev/null
+++ b/n_tokens.py
@@ -0,0 +1,60 @@
+
+def estimate_price(num_of_episodes, in_context_len,tokens_per_step, n_steps, last_n, model, feed_episode_history):
+ max_context_size = 10e10
+ price_per_1k_big_context = 0
+
+ if model == "text-ada-001":
+ price_per_1k = 0.0004
+ elif model == "text-curie-001":
+ price_per_1k = 0.003
+ elif model == "text-davinci-003":
+ price_per_1k = 0.02
+ elif model == "gpt-3.5-turbo-0301":
+ price_per_1k = 0.0015
+ max_context_size = 4000
+ price_per_1k_big_context = 0.003
+ elif model == "gpt-3.5-turbo-instruct-0914":
+ price_per_1k = 0.0015
+ max_context_size = 4000
+ price_per_1k_big_context = 0.003
+ elif model == "gpt-3.5-turbo-0613":
+ price_per_1k = 0.0015
+ max_context_size = 4000
+ price_per_1k_big_context = 0.003
+ elif model == "gpt-4-0314":
+ price_per_1k = 0.03
+ max_context_size = 8000
+ price_per_1k_big_context = 0.06
+ elif model == "gpt-4-0613":
+ price_per_1k = 0.03
+ max_context_size = 8000
+ price_per_1k_big_context = 0.06
+
+ else:
+ print(f"Price for model {model} not found.")
+ price_per_1k = 0
+
+ # check if the maximum context size if bigger than default (4k gpt-3.5;8k gpt-4) and update the price accordingly
+ if (
+ feed_episode_history and in_context_len+n_steps*tokens_per_step
+ ) or (
+ not feed_episode_history and in_context_len+last_n * tokens_per_step > max_context_size
+ ):
+ # context is bigger, update the price
+ assert "gpt-4" in model or "gpt-3.5" in model
+ price_per_1k = price_per_1k_big_context
+
+ if feed_episode_history:
+ total_tokens = num_of_episodes*(in_context_len + tokens_per_step*sum(range(n_steps)))
+
+ else:
+ total_tokens = num_of_episodes*n_steps*(in_context_len + last_n*tokens_per_step)
+
+ price = (total_tokens/1000)*price_per_1k
+ return total_tokens, price
+
+
+if __name__ == "__main__":
+ total_tokens, price = estimate_price()
+ print("tokens:", total_tokens)
+ print("price:", price)
\ No newline at end of file
diff --git a/param_tree_demo.py b/param_tree_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..a352f6faf4e28f74eff6103c1ad7726d8135f2be
--- /dev/null
+++ b/param_tree_demo.py
@@ -0,0 +1,234 @@
+import streamlit as st
+import copy
+import streamlit.components.v1 as components
+import streamlit.caching as caching
+import time
+import argparse
+import numpy as np
+import gym
+import gym_minigrid
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+import matplotlib.pyplot as plt
+from gym_minigrid.social_ai_envs.socialaigrammar import SocialAIGrammar, SocialAIActions, SocialAIActionSpace
+
+default_params = {
+ "Pointing": 0,
+ "Emulation": 1,
+ "Language_grounding": 2,
+ "Pragmatic_frame_complexity": 1,
+}
+
+class InteractiveACL:
+
+ def choose(self, node, chosen_parameters):
+
+ options = [n.label for n in node.children]
+
+ box_name = f'{node.label} ({node.id})'
+ ret = st.sidebar.selectbox(
+ box_name,
+ options,
+ index=default_params.get(node.label, 0)
+ )
+
+ for ind, (c, c_lab) in enumerate(zip(node.children, options)):
+ if c_lab == ret:
+ return c
+
+ def get_info(self):
+ return {}
+
+@st.cache(allow_output_mutation=True, suppress_st_warning=True)
+def load_env():
+ env = gym.make("SocialAI-SocialAIParamEnv-v1")
+ env.curriculum=InteractiveACL()
+
+ return env
+
+
+
+
+st.title("SocialAI interactive demo")
+
+
+env = load_env()
+
+st.subheader("Primitive actions")
+
+# moving buttons
+columns = st.columns([1]*(len(SocialAIActions)+1))
+action_names = [a.name for a in list(SocialAIActions)] + ["no_op"]
+# keys = ["Left arrow", "Right arrow", "Up arrow", "t", "q", "Shift"]
+keys = ["a", "d", "w", "t", "q", "Shift"]
+
+# actions = [st.button(a.name) for a in list(SocialAIActions)] + [st.button("none")]
+actions = []
+for a_name, col, key in zip(action_names, columns, keys):
+ with col:
+ actions.append(st.button(a_name+f" ({key})", help=f"Shortcut: {key}"))
+
+
+st.subheader("Speaking actions")
+# talking buttons
+t, w, b = st.columns([1, 1, 1])
+
+changes = [False, False]
+
+with t:
+ templ = st.selectbox("Template", options=SocialAIGrammar.templates, index=1)
+with w:
+ word = st.selectbox("Word", options=SocialAIGrammar.things, index=0)
+
+speak = st.button("Speak (s)", help="Shortcut s")
+
+# utterance change detection
+utt_changed = False
+
+if "template" in st.session_state:
+ utt_changed = st.session_state.template != templ
+
+if "word" in st.session_state:
+ utt_changed = utt_changed or st.session_state.word != word
+
+st.session_state["template"] = templ
+st.session_state["word"] = word
+
+st.sidebar.subheader("Select the parameters:")
+
+play = st.button("Play (Enter)", help="Generate the env. Shortcut: Enter")
+
+components.html(
+ """
+
+""",
+ height=0,
+ width=0,
+)
+
+# no action
+done_ind = len(actions) - 2
+actions[done_ind] = False
+
+# was agent controlled
+no_action = not any(actions) and not speak
+
+done = False
+info = None
+
+if not no_action or play or utt_changed:
+ # agent is controlled
+ if any(actions):
+ p_act = np.argmax(actions)
+ if p_act == len(actions) - 1:
+ p_act = np.nan
+
+ action = [p_act, np.nan, np.nan]
+
+ elif speak:
+ templ_ind = SocialAIGrammar.templates.index(templ)
+ word_ind = SocialAIGrammar.things.index(word)
+ action = [np.nan, templ_ind, word_ind]
+
+ else:
+ action = None
+
+ if action:
+ obs, reward, done, info = env.step(action)
+
+ env.render(mode='human')
+ st.pyplot(env.window.fig)
+
+
+# if done or no_action:
+if done or (no_action and not play and not utt_changed):
+ env.reset()
+
+else:
+ env.parameter_tree.sample_env_params(ACL=env.curriculum)
+
+
+with st.expander("Parametric tree", True):
+ # draw tree
+ current_param_labels = env.current_env.parameters if env.current_env.parameters else {}
+ folded_nodes = [
+ "Information_seeking",
+ "Collaboration",
+ "OthersPerceptionInference"
+ ]
+ # print(current_param_labels["Env_type"])
+ folded_nodes.remove(current_param_labels["Env_type"])
+ env.parameter_tree.draw_tree(
+ filename="viz/streamlit_temp_tree",
+ ignore_labels=["Num_of_colors"],
+ selected_parameters=current_param_labels,
+ folded_nodes=folded_nodes,
+ # save=False
+ )
+ # st.graphviz_chart(env.parameter_tree.tree)
+ st.image("viz/streamlit_temp_tree.png")
+
+# if not no_action or play or utt_changed:
+# # agent is controlled
+# if any(actions):
+# p_act = np.argmax(actions)
+# if p_act == len(actions) - 1:
+# p_act = np.nan
+#
+# action = [p_act, np.nan, np.nan]
+#
+# elif speak:
+# templ_ind = SocialAIGrammar.templates.index(templ)
+# word_ind = SocialAIGrammar.things.index(word)
+# action = [np.nan, templ_ind, word_ind]
+#
+# else:
+# action = None
+#
+# if action:
+# obs, reward, done, info = env.step(action)
+#
+# env.render(mode='human')
+# st.pyplot(env.window.fig)
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..fec65e0c3acd6876c7fe012dbb4b4ddfaa47c520
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,14 @@
+tensorboardX>=1.6
+numpy>=1.3
+jedi>=0.17
+ipython
+imageio==2.9.0
+matplotlib==3.3.4
+graphviz==0.19.1
+termcolor
+astar==0.93
+scikit-learn==0.24.2
+pandas==1.1.5
+transformers==4.25.1
+openai
+cchardet==2.1.7
\ No newline at end of file
diff --git a/run.txt b/run.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0c574fdf42fdd53355b61d8c5a3a24f79e32ae45
--- /dev/null
+++ b/run.txt
@@ -0,0 +1,24 @@
+# squeue -u utu57ed -i 1 -o "%.18i %.9P %.130j %.8u %.2t %.10M %.6D %R"
+
+
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 1000000 --model FPS_testing_c32 --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 40 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 1000000 --model FPS_testing_c40 --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 40 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 1000000 --model FPS_testing_c40 --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# 1 seed - 1 gpu
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model FPS_testing_c32_1s1g --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 40 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model FPS_testing_c40_1s1g --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 64 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model FPS_testing_c64_1s1g --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model FPS_testing_c80_1s1g --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+
+# testing
+#--slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 2000000 --model testing_Adversarial_2M_PPO_CB_asoc --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AsocialAppleStealingObst_NoParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+
+
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model test_Feedback_CB_heldout_doors_20M --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangFeedbackHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model test_Color_CB_heldout_doors --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangColorHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model test_Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.25 50 --exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model test_Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model test_Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
diff --git a/run_NeurIPS.txt b/run_NeurIPS.txt
new file mode 100644
index 0000000000000000000000000000000000000000..911741901eae10910720e1a10e2afa3baaadbb05
--- /dev/null
+++ b/run_NeurIPS.txt
@@ -0,0 +1,106 @@
+# Experiment for NeurIPS
+# Make sure you modify campain_launcher.py to fit your cluster configuration
+# Uncomment each line you want to run, then launch "python3 campain_launcher.py run_NeurIPS.txt" on your slurm cluster
+#
+#
+# NeurIPS Polite
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_BONUS_NoLiar -cs --algo ppo --*env MiniGrid-TalkItOutNoLiarPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 7 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_BONUS -cs --algo ppo --*env MiniGrid-TalkItOutPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 7 50 --*exploration-bonus-tanh 0.6
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_NO_BONUS_NoLiar -cs --algo ppo --*env MiniGrid-TalkItOutNoLiarPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_NO_BONUS -cs --algo ppo --*env MiniGrid-TalkItOutPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# unsocial
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_NoSocial_NO_BONUS_NoLiar -cs --algo ppo --*env MiniGrid-TalkItOutNoLiarPolite-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_NoSocial_NO_BONUS -cs --algo ppo --*env MiniGrid-TalkItOutPolite-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_RND_NoLiar -cs --algo ppo --*env MiniGrid-TalkItOutNoLiarPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_RND -cs --algo ppo --*env MiniGrid-TalkItOutPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_RIDE_NoLiar -cs --algo ppo --*env MiniGrid-TalkItOutNoLiarPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_TalkItOutPolite_RIDE -cs --algo ppo --*env MiniGrid-TalkItOutPolite-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
+#
+# NeurIPS ShowME
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_NO_BONUS_ABL --compact-save --algo ppo --*env MiniGrid-ShowMeNoSocial-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_NO_BONUS --compact-save --algo ppo --*env MiniGrid-ShowMe-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_BONUS_ABL_ --compact-save --algo ppo --*env MiniGrid-ShowMeNoSocial-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 3 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_BONUS --compact-save --algo ppo --*env MiniGrid-ShowMe-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 3 50 --*exploration-bonus-tanh 0.6
+# unsocial
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_NoSocial_NO_BONUS_ABL --compact-save --algo ppo --*env MiniGrid-ShowMeNoSocial-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_NoSocial_NO_BONUS --compact-save --algo ppo --*env MiniGrid-ShowMe-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_RND_ABL_ --compact-save --algo ppo --*env MiniGrid-ShowMeNoSocial-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_RND --compact-save --algo ppo --*env MiniGrid-ShowMe-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_RIDE_ABL_ --compact-save --algo ppo --*env MiniGrid-ShowMeNoSocial-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_ShowMe_RIDE --compact-save --algo ppo --*env MiniGrid-ShowMe-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
+#
+# NeurIPS Help (Exiter role)
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_Help_NO_BONUS --compact-save --algo ppo --*env MiniGrid-Exiter-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 5000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_Help_BONUS --compact-save --algo ppo --*env MiniGrid-Exiter-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 5000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 3 50 --*exploration-bonus-tanh 0.6
+# unsocial
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_Help_NoSocial_NO_BONUS --compact-save --algo ppo --*env MiniGrid-Exiter-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 5000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_Help_RND --compact-save --algo ppo --*env MiniGrid-Exiter-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 5000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_Help_RIDE --compact-save --algo ppo --*env MiniGrid-Exiter-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 5000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
+# DiverseExit
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_DiverseExit_NO_BONUS --compact-save --algo ppo --*env MiniGrid-DiverseExit-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_DiverseExit_BONUS --compact-save --algo ppo --*env MiniGrid-DiverseExit-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 20 50 --*exploration-bonus-tanh 0.6
+# unsocial
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_DiverseExit_NoSocial_NO_BONUS --compact-save --algo ppo --*env MiniGrid-DiverseExit-8x8-v0 --*env_args hidden_npc True --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*ppo-hp-tuning
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_DiverseExit_RND --compact-save --algo ppo --*env MiniGrid-DiverseExit-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model NeurIPS_DiverseExit_RIDE --compact-save --algo ppo --*env MiniGrid-DiverseExit-8x8-v0 --dialogue --save-interval 100 --log-interval 100 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
+#
+# NeurIPS CoinThief
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True npc_view_size 5 npc_look_around True
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True npc_view_size 5 npc_look_around True --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --*env_args few_actions True npc_view_size 5 npc_look_around True --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --*env_args few_actions True npc_view_size 5 npc_look_around True --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+# unsocial PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True hidden_npc True npc_view_size 5 npc_look_around True
+# PPO on easy version - visible coin tags
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True tag_visible_coins True npc_view_size 5 npc_look_around True
+# PPO + explo bonus on easy version - visible coin tags
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model coinThief --algo ppo -cs --env MiniGrid-CoinThief-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True tag_visible_coins True npc_view_size 5 npc_look_around True --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#
+#
+# NeurIPS Dance
+# PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model dance --algo ppo -cs --env MiniGrid-DanceWithOneNPC-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True dance_len 3
+# PPO + explo bonus
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model dance --algo ppo -cs --env MiniGrid-DanceWithOneNPC-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args few_actions True dance_len 3 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --*exploration-bonus-params 3 50 --exploration-bonus-tanh 0.6
+# PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model dance --algo ppo -cs --env MiniGrid-DanceWithOneNPC-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --*env_args few_actions True dance_len 3 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+# unsocial PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model dance --algo ppo -cs --env MiniGrid-DanceWithOneNPC-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env_args hidden_npc True few_actions True dance_len 3
+# PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model dance --algo ppo -cs --env MiniGrid-DanceWithOneNPC-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --*env_args few_actions True dance_len 3 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
+#
+# NeurIPS SocialEnv
+### PPO
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model socialEnv --algo ppo -cs --env MiniGrid-SocialEnv-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning
+### PPO + explo tests
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model socialEnv --algo ppo -cs --env MiniGrid-SocialEnv-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+### PPO + RND
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model socialEnv --algo ppo -cs --env MiniGrid-SocialEnv-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --exploration-bonus --*exploration-bonus-type rnd --clipped-rewards
+### PPO + RIDE
+#--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model socialEnv --algo ppo -cs --env MiniGrid-SocialEnv-8x8-v0 --frames 30000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --custom-ppo-2 --exploration-bonus --*exploration-bonus-type ride --clipped-rewards
+#
diff --git a/run_SAI_GS_case_studies_GS.txt b/run_SAI_GS_case_studies_GS.txt
new file mode 100644
index 0000000000000000000000000000000000000000..45955b3f87ef0609eaf8046d001cfc5509de636b
--- /dev/null
+++ b/run_SAI_GS_case_studies_GS.txt
@@ -0,0 +1,103 @@
+# no exploration bonus
+
+# phase 1
+# lr [1e-4, 1e-5]
+# recurrence [5, 10, 20, 80]
+
+# vs rnd ride reference
+
+# # rec 5
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+# # rec 10
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+# # rec 20
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# reference
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --*custom-ppo-ride-reference --test-set-name SocialAIGSTestSet
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --*custom-ppo-ride-reference --test-set-name SocialAIGSTestSet
+
+# best from phase 1, with expl bonuses
+# --slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+## CB
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.8
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+#
+## CBL
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.8
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 5 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 50 50 --*exploration-bonus-tanh 0.6
+#
+## RND
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 1.0
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.001
+#
+# RIDE
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 1.0
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_GS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAIGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.001
+#
+# other envs -> with more problems and with emulation
+
+# best: lr 1e-5, rec 10
+
+# vs rnd ride reference
+
+## rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 20
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --test-set-name SocialAICuesGSTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## reference
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_CGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-CuesGridSearchParamEnv-v1 --*custom-ppo-ride-reference --test-set-name SocialAICuesGSTestSet
+
+
+# Emulation
+# vs rnd ride reference
+
+# rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 20
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## reference
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_EGS_PPO --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EmulationGridSearchParamEnv-v1 --*custom-ppo-ride-reference
diff --git a/run_SAI_case_studies.txt b/run_SAI_case_studies.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c9fae9493e1ac6dbe11007ad2a9de70b882be5e9
--- /dev/null
+++ b/run_SAI_case_studies.txt
@@ -0,0 +1,31 @@
+# dummy formats case study
+# max_steps provjeri, model, formats test set, intervali itd
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_formats_CBL_A_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ALangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name AFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_formats_CBL_E_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name EFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_formats_CBL_AE_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-AELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+
+# # to run: no CBL
+# --slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_jz_formats_A_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ALangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name AFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+# --slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_jz_formats_E_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name EFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+# --slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_jz_formats_AE_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-AELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --*test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# classinc exp
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 30000000 --model dummy_cs_NEW_Pointing_sm_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EPointingTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 15000000 --model dummy_cs_NEW_Color_CBL --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangColorTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 10000000 --model dummy_cs_NEW2_Feedback_CBL --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangFeedbackTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 30000000 --model dummy_cs_NEW2_Pointing_sm_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EPointingTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 30000000 --model dummy_cs_NEW2_Pointing_sm_CB_very_small --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EPointingTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 0.25 50 --exploration-bonus-tanh 0.6
+
+# Emulation
+# rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_emulation_no_distr_rec_10 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 40000000 --model dummy_cs_emulation_distr_rec_10 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name DistrEmulationTestSet --continue-train auto
+
+# rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_emulation_no_distr_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 40000000 --model dummy_cs_emulation_distr_rec_5 --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name DistrEmulationTestSet --continue-trian auto
+
+
+# CB
+#--slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_emulation_no_distr_rec_5_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --exploration-bonus-params 0.25 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained --nb_seeds 1 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model dummy_cs_emulation_no_distr_rec_5_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
\ No newline at end of file
diff --git a/run_SAI_final_case_studies.txt b/run_SAI_final_case_studies.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a5c28c2c236e4155a0d6cf0a25d08ee483bb4814
--- /dev/null
+++ b/run_SAI_final_case_studies.txt
@@ -0,0 +1,84 @@
+######################
+## Scaffolding + Formats
+######################
+
+
+#--slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model formats_50M_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model scaffolding_50M_no_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AELangFeedbackTrainFormatsCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model scaffolding_50M_acl_4 --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AELangFeedbackTrainScaffoldingCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --acl --*acl-type intro_seq --acl-thresholds 0.90 0.90 0.90 0.90 --acl-average-interval 500 --acl-minimum-episodes 1000
+#--slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model scaffolding_50M_acl_8 --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AELangFeedbackTrainScaffoldingCSParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name AEFormatsTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --acl --*acl-type intro_seq_scaf --acl-thresholds 0.90 0.90 0.90 0.90 --acl-average-interval 500 --acl-minimum-episodes 1000
+
+###############
+## Pointing
+###############
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model Pointing_CB_heldout_doors --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EPointingHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 0.25 50 --exploration-bonus-tanh 0.6
+
+###############
+## Feedback
+###############
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model Feedback_CB_heldout_doors_20M --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangFeedbackHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+
+###############
+## Color
+###############
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model Color_CB_heldout_doors --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-ELangColorHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+
+###############
+## Joint attention
+###############
+# JA - Color
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model JA_Color_CB_heldout_doors --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-JAELangColorHeldoutDoorsTrainInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name JALangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type lang --exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+
+
+###############
+## Imitation
+###############
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.25 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 20000000 --model Imitation_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --test-set-name NoDistrEmulationTestSet --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+
+##################
+## Role Reversal
+##################
+
+## SINGLE
+##################
+
+# pretrain
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 4000000 --model RR_single_CB_marble_pass_B_exp --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassBCollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 4000000 --model RR_single_CB_marble_pass_asoc_contr --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AsocialMarbleCollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+
+# finetune
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model RR_ft_single_CB_marble_pass_A_soc_exp --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/02-01_RR_single_CB_marble_pass_B_exp
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 1000000 --model RR_ft_single_CB_marble_pass_A_asoc_contr --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/02-01_RR_single_CB_marble_pass_asoc_contr
+
+## GROUP
+##################
+
+# pretrain
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model RR_group_CB_marble_pass_B_exp --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-RoleReversalGroupExperimentalCollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained_a100 --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 50000000 --model RR_group_CB_marble_pass_asoc_contr --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-RoleReversalGroupControlCollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+
+# finetune
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 500000 --model RR_ft_group_20M_CB_marble_pass_A_soc_exp --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/02-01_RR_group_CB_marble_pass_B_exp
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 500000 --model RR_ft_group_20M_CB_marble_pass_A_asoc_contr --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/02-01_RR_group_CB_marble_pass_asoc_contr
+
+# finetune - 50M
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 500000 --model RR_ft_group_50M_CB_marble_pass_A_soc_exp --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/03-01_RR_group_CB_marble_pass_B_exp
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 500000 --model RR_ft_group_50M_CB_marble_pass_A_asoc_contr --algo ppo --dialogue --save-interval 1 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-MarblePassACollaborationParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --finetune-train storage/03-01_RR_group_CB_marble_pass_asoc_contr
+
+##################
+## Adversarial type - AppleStealing
+##################
+
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 2000000 --model Adversarial_2M_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AppleStealingObst_NoParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 2000000 --model Adversarial_2M_PPO_CB_hidden_npc --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AppleStealingObst_NoParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --env-args hidden_npc True
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 2000000 --model Adversarial_2M_PPO_CB_asoc --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AsocialAppleStealingObst_NoParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+
+##################
+# Adversarial type - AppleStealing - more stumps
+##################
+
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 5000000 --model Adversarial_5M_Stumps_PPO_CB --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AppleStealingObst_MediumParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 5000000 --model Adversarial_5M_Stumps_PPO_CB_hidden_npc --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AppleStealingObst_MediumParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6 --env-args hidden_npc True
+# --slurm_conf jz_short_gpu_chained_a100_4h --nb_seeds 8 --cpu_cores_per_seed 80 --gpus_per_seed 1 --seeds_per_launch 1 --frames 5000000 --model Adversarial_5M_Stumps_PPO_CB_asoc --algo ppo --dialogue --save-interval 10 --log-interval 10 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-AsocialAppleStealingObst_MediumParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-4 --entropy-coef 0.00001 --test-set-name RoleReversalTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
diff --git a/run_SAI_pilot_case_studies.txt b/run_SAI_pilot_case_studies.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ef9b9a102c639b14f1989d32ac69b6b36ef2fb65
--- /dev/null
+++ b/run_SAI_pilot_case_studies.txt
@@ -0,0 +1,62 @@
+
+## Pointing Case study
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_Pointing_CS_PPO_no --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_Pointing_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_Pointing_CS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_Pointing_CS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_Pointing_CS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name PointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#
+## Lang Color Case study
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_PPO_no --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+
+# 3 and 5 colors - CBL
+# --slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_5C_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False n_colors 5 --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_LangColor_CS_3C_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangColorTestSet --env-args see_through_walls False n_colors 3 --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+
+# Lang Feedback Case study - 20M
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 20000000 --model SAI_LangFeedback_CS_PPO_no --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 20000000 --model SAI_LangFeedback_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 20000000 --model SAI_LangFeedback_CS_PPO_RND --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 20000000 --model SAI_LangFeedback_CS_PPO_RIDE --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 20000000 --model SAI_LangFeedback_CS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name LangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+
+
+## Joint attention experiments
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_JA_Pointing_CS_PPO_CB_less_ --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-JAEPointingInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name JAPointingTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_JA_LangColor_CS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-JAELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name JALangColorTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_JA_LangFeedback_CS_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-JAELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name JALangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#
+## 3 and 5 colors - CBL
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_JA_LangColor_CS_5C_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-JAELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name JALangColorTestSet --env-args see_through_walls False n_colors 5 --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_cpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_JA_LangColor_CS_3C_PPO_CBL_cpu --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-JAELangColorInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name JALangColorTestSet --env-args see_through_walls False n_colors 3 --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+
+## Imitation
+## rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationDistr_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name DistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationNoDistr_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NoDistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+## rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationDistr_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name DistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationNoDistr_CS_PPO_CB --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NoDistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+## Imitation - less
+## rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationDistr_CS_PPO_CB_small --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name DistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationNoDistr_CS_PPO_CB_small --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NoDistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+## rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationDistr_CS_PPO_CB_small --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name DistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model SAI_ImitationNoDistr_CS_PPO_CB_small --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NoDistrEmulationTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+
+## Formats - CBL
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_NO_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-NLangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NLangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+##--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_EYE_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name ELangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_ASK_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ALangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name ALangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_ASK_EYE_PPO_CBL --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-AELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name AELangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type lang --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+
+## Formats - NO
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_NO_PPO_NO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-NLangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name NLangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+##--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_EYE_PPO_NO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name ELangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_ASK_PPO_NO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-ALangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name ALangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 40000000 --model SAI_LangFeedback_CS_F_ASK_EYE_PPO_NO --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-AELangFeedbackInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --test-set-name AELangFeedbackTestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
\ No newline at end of file
diff --git a/run_bAI_GS_ppo.txt b/run_bAI_GS_ppo.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d869766fd4e76e290a36104a914ade22f539910b
--- /dev/null
+++ b/run_bAI_GS_ppo.txt
@@ -0,0 +1,46 @@
+# no exploration bonus
+
+# GS phase one - recurrence and LR
+#lr [1e-4, 5e-5, 1e-5]
+# recurrence [5, 10, 20, 80]
+
+## rec 5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 10
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 20
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#
+## rec 80
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# best: lr 1e-5, rec 10
+
+# phase two
+#ACL --acl-thresholds [0.75, 0.8, 0.9]
+#ACL --acl-average-interval [100, 500, 1000]
+
+# 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.75 --*acl-average-interval 100 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.8 --*acl-average-interval 100 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.9 --*acl-average-interval 100 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# 500
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 500 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.75 --*acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 500 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.8 --*acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 500 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.9 --*acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# 1000
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 1000 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.75 --*acl-average-interval 1000 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 1000 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.8 --*acl-average-interval 1000 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_GS_acl --algo ppo --dialogue --save-interval 1000 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --*acl --*acl-thresholds 0.9 --*acl-average-interval 1000 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
\ No newline at end of file
diff --git a/run_bAI_GS_ppo_cb.txt b/run_bAI_GS_ppo_cb.txt
new file mode 100644
index 0000000000000000000000000000000000000000..751a05efd965697092e824d8ab3a9023ad0c6b80
--- /dev/null
+++ b/run_bAI_GS_ppo_cb.txt
@@ -0,0 +1,11 @@
+# Phase one
+# param [1, 1.5, 2], tanh 0.6 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1.5 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 1 --frames 100000000 --model bAI_cb_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+
+# phase two ACL
\ No newline at end of file
diff --git a/run_bAI_GS_ppo_ride.txt b/run_bAI_GS_ppo_ride.txt
new file mode 100644
index 0000000000000000000000000000000000000000..747287a384e45dcc845918072959c93043ccd56c
--- /dev/null
+++ b/run_bAI_GS_ppo_ride.txt
@@ -0,0 +1,42 @@
+# Phase one
+# ir coef intr
+# 1.0, 0.5, 0.1, 0.05, 0.01, 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 1.0
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.001
+
+
+# phase two ACL
+# 0.5, 0.3, 0.1, 0.05, 0.01, 0.005
+# ACL --acl-thresholds [0.75, 0.8, 0.9]
+
+# 75
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.3
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.001
+
+# 80
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.3
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.001
+
+# 90
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.3
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_ride_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type ride --*intrinsic-reward-coef 0.001
\ No newline at end of file
diff --git a/run_bAI_GS_ppo_rnd.txt b/run_bAI_GS_ppo_rnd.txt
new file mode 100644
index 0000000000000000000000000000000000000000..10162129e0b48e644174f8f224d64e5a509b5a13
--- /dev/null
+++ b/run_bAI_GS_ppo_rnd.txt
@@ -0,0 +1,41 @@
+# Phase one
+# ir coef intr
+# 1.0, 0.5, 0.1, 0.05, 0.01, 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 1.0
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.5
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.005
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_param_tanh --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.001
+
+# phase two ACL
+# ACL --acl-thresholds [0.75, 0.8, 0.9, 0.95]
+# ir coef intr [0.01, 0.05, 0.1, 0.3]
+
+## acl 75
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.3
+##--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+##--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+##--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+#
+## acl 80
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.3
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.8 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+#
+## acl 90
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.3
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+#
+## acl 95
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.95 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.3
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.95 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.1
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.95 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.05
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_GS_coef_acl --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.95 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01
+
+# test reset of rnd
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_rnd_reset_at_phase --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --*acl-thresholds 0.9 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type rnd --*intrinsic-reward-coef 0.01 --*reset-rnd-ride-at-phase
\ No newline at end of file
diff --git a/run_bAI_GS_ppo_soc_inf.txt b/run_bAI_GS_ppo_soc_inf.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9eddd16cf27a4b4dd154757f314783e1418a7500
--- /dev/null
+++ b/run_bAI_GS_ppo_soc_inf.txt
@@ -0,0 +1,15 @@
+# Phase one
+# ir loss coef: 0.1, 1, 10, 100
+# ir coef: 0.5, 0.3, 0.1, 0.01
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.03
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 0.03
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 0.03
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 100000000 --model bAI_soc_inf_GS_small_coefs --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 1e-5 --entropy-coef 0.00001 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64 --exploration-bonus --exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 0.03
\ No newline at end of file
diff --git a/run_ppo_cb_cell_gs.txt b/run_ppo_cb_cell_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..33eaf518a00000c9601d1f37b6e1b12071bb3601
--- /dev/null
+++ b/run_ppo_cb_cell_gs.txt
@@ -0,0 +1,141 @@
+# we selected the parameters
+# PPO: 5 0.0001 0.001 -> (543)
+# PPO: 543, 544,555,843,844
+# CB: 1,2,5,10,20,50,100 # should be 1-10 ?
+# 35 combinations
+#
+### Emulation Marble
+## 543
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 544
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 555
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 843
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 844
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_emu_marble --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+#
+### Language Switches
+## 543
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 544
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 555
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 843
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+## 844
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 5 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 20 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 50 50 --exploration-bonus-tanh 0.6
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_cb_cell_grid_search_ask_eye_lang_switches --algo pcs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 100 50 --exploration-bonus-tanh 0.6
+
+
+# CB grid search
+# Ask pointing boxes
+
+# 545
+# 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+# 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.8
+
+# 845
+# 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+# 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.8
+
+# 544
+# 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.6
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.6
+# 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 0.5 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 1 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 2 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 4 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 8 50 --*exploration-bonus-tanh 0.8
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_cb_cell_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type cell --*exploration-bonus-params 10 50 --*exploration-bonus-tanh 0.8
\ No newline at end of file
diff --git a/run_ppo_gs.txt b/run_ppo_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..1818a3e129db8c4082879c4bddff35c16f4d0da0
--- /dev/null
+++ b/run_ppo_gs.txt
@@ -0,0 +1,155 @@
+## Emulation Marble
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.00001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_emu_marble --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactEmulationMarbleInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.00001
+#
+## Language Switches
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.00001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_eye_lang_switches --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskEyeContactLanguageSwitchesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.00001
+#
+## NoIntro Pointing Boxes
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.00001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_no_intro_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.00001
+
+# Ask Pointing Boxes
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.000001 --*entropy-coef 0.00001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.0001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.0001
+#
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.00001 --*entropy-coef 0.00001
+#--slurm_conf jz_short_2gpus_chained --nb_seeds 4 --frames 50000000 --model PPO_raw_grid_search_ask_point_boxes --algo ppo -cs --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.000001 --*entropy-coef 0.00001
+
+
+# GS phase one - recurrence and LR
+#lr [1e-4, 5e-5, 1e-5]
+# recurrence [5, 10, 20, 80]
+
+# rec 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# rec 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 10 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# rec 20
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 20 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+# rec 80
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-4 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 5e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 32 --gpus_per_seed 1 --seeds_per_launch 2 --frames 80000000 --model bAI_GS_rec_lr --algo ppo --dialogue --save-interval 100 --log-interval 100 --test-interval 1000 --frames-per-proc 40 --multi-modal-babyai11-agent --*env SocialAI-SocialAIParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --*recurrence 80 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --*lr 1e-5 --entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --exploration-bonus-type cell --exploration-bonus-params 1 50 --exploration-bonus-tanh 0.6 --acl --acl-thresholds 0.75 --acl-average-interval 500 --acl-minimum-episodes 1000 --test-set-name SocialAITestSet --env-args see_through_walls False --arch bow_endpool_res --bAI-lang-model attgru --memory-dim 2048 --procs 64
+
+
+# what to search
+#ACL --acl-thresholds [0.75, 0.8, 0.9]
+#ACL --acl-average-interval [100, 500, 1000]
+
+
diff --git a/run_ppo_ride_gs.txt b/run_ppo_ride_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5b5182880280e361bad89e68c633cc0210df007a
--- /dev/null
+++ b/run_ppo_ride_gs.txt
@@ -0,0 +1,32 @@
+# we selected the parameters
+# PPO: 543, 845, 544
+
+# CB grid search
+# Ask pointing boxes
+
+# 545
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.001
+
+# 845
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.001
+
+# 544
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_ride_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type ride --*intrinsic-reward-loss-coef 0.001
diff --git a/run_ppo_rnd_gs.txt b/run_ppo_rnd_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2f161b7301420811857f6fd9ff7a8c310fae4f66
--- /dev/null
+++ b/run_ppo_rnd_gs.txt
@@ -0,0 +1,34 @@
+# we selected the parameters
+# PPO: 543, 845, 544
+
+# Ask pointing boxes
+
+# take a look at custom ppo-rnd reference (clipped rewards etc?)
+# --custom-ppo-rnd-reference --exploration-bonus --exploration-bonus-type rnd
+
+# 545
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.001
+
+# 845
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.001
+
+# 544
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 1.0
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.05
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.01
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.005
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_rnd_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --*exploration-bonus-type rnd --*intrinsic-reward-loss-coef 0.001
diff --git a/run_ppo_soc_inf_gs.txt b/run_ppo_soc_inf_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..06dc19c69c607dd08b446c5e5fc7114cad6a5c06
--- /dev/null
+++ b/run_ppo_soc_inf_gs.txt
@@ -0,0 +1,53 @@
+# we selected the parameters
+# PPO: 543, 845, 544
+
+# soc inf grid search
+# Ask pointing boxes
+
+# 545
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 0.1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 0.1
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 1
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 5
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 5
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 10
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 10
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 30
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 30
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 30
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 30
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 30
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 30
+
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 5 --*intrinsic-reward-coef 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 30 --*intrinsic-reward-coef 100
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 100
+
+## 845
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-08 --*lr 0.0001 --*entropy-coef 0.00001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.1
+## 544
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --frames 60000000 --model PPO_soc_inf_grid_search_ask_point_boxes --algo ppo --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --env SocialAI-AskPointingBoxesInformationSeekingParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --*optim-eps 1e-05 --*lr 0.0001 --*entropy-coef 0.0001 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-type soc_inf --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.1
\ No newline at end of file
diff --git a/run_soc_inf_gs.txt b/run_soc_inf_gs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9720bb54d06c8addf8873ff4492ae8d9ffac7885
--- /dev/null
+++ b/run_soc_inf_gs.txt
@@ -0,0 +1,108 @@
+# soc inf
+#--slurm_conf jz_long_2gpus --nb_seeds 4 --model Social_influence_experiments --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env SocialAI-DummyParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards
+#--slurm_conf jz_long_2gpus --nb_seeds 4 --model Social_influence_experiments --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --multi-modal-babyai11-agent --arch original_endpool_res --ppo-hp-tuning --*env SocialAI-DummyParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 10 0.5 --clip-eps 0.2 --recurrence 10 --*max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 0.1 --*optim-eps 1e-05 --*epochs 4 --*lr 0.0001
+
+# no bonus
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search_no_bonus --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search_no_bonus --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search_no_bonus --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search_no_bonus --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+# grid search
+
+# loss coef = 0.1
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 0.1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+## loss coef = 1
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 5 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 1 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+## loss coef = 10
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 10 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+# loss coef = 100
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 10 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 100 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
+
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.01
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.001
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0005
+--slurm_conf jz_super_short_gpu --nb_seeds 4 --model Social_influence_Boxes_Pointing_grid_search --algo ppo -cs --frames 100000000 --dialogue --save-interval 100 --log-interval 100 --frames-per-proc 40 --multi-modal-babyai11-agent --arch original_endpool_res --*env SocialAI-EyeContactBoxes2PointingInformationSeekingParamEnv-v1 --exploration-bonus --*exploration-bonus-type soc_inf --clipped-rewards --entropy-coef 0.01 --optim-eps 1e-5 --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --*intrinsic-reward-loss-coef 100 --*intrinsic-reward-coef 1000 --optim-eps 1e-05 --epochs 4 --*lr 0.0001
\ No newline at end of file
diff --git a/run_test_rnd_ride.txt b/run_test_rnd_ride.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ec3922ba53b5496993b701dc10947f52022fc8d2
--- /dev/null
+++ b/run_test_rnd_ride.txt
@@ -0,0 +1,80 @@
+## squeue -u utu57ed -i 1 -o "%.18i %.9P %.130j %.8u %.2t %.10M %.6D %R"
+##
+### RND
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_rnd_mask --compact-save --algo ppo --*env MiniGrid-MultiRoom-N2-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference --exploration-bonus --exploration-bonus-type rnd
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_rnd_mask --compact-save --algo ppo --*env MiniGrid-MultiRoom-N4-S5-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference --exploration-bonus --exploration-bonus-type rnd
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_rnd_mask --compact-save --algo ppo --*env MiniGrid-MultiRoomNoisyTV-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference --exploration-bonus --exploration-bonus-type rnd
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_rnd_mask --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference --exploration-bonus --exploration-bonus-type rnd
+### Basic
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_raw --compact-save --algo ppo --*env MiniGrid-MultiRoom-N2-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_raw --compact-save --algo ppo --*env MiniGrid-MultiRoom-N4-S5-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_raw --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_raw --compact-save --algo ppo --*env MiniGrid-MultiRoomNoisyTV-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_raw --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-rnd-reference
+### RIDE
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_ride --compact-save --algo ppo --*env MiniGrid-MultiRoom-N2-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-ride-reference --exploration-bonus --exploration-bonus-type ride
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_ride --compact-save --algo ppo --*env MiniGrid-MultiRoom-N4-S5-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-ride-reference --exploration-bonus --exploration-bonus-type ride
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_ride --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-ride-reference --exploration-bonus --exploration-bonus-type ride
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_ride --compact-save --algo ppo --*env MiniGrid-MultiRoomNoisyTV-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-ride-reference --exploration-bonus --exploration-bonus-type ride
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model test_ride --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --custom-ppo-ride-reference --exploration-bonus --exploration-bonus-type ride
+#
+
+# old
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model ref_rnd --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --*custom-ppo-rnd --exploration-bonus --exploration-bonus-type rnd
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model ref_ride --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --*custom-ppo-ride --exploration-bonus --exploration-bonus-type ride
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model kc_ref_rnd --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --*custom-ppo-rnd --exploration-bonus --exploration-bonus-type rnd
+#--slurm_conf jz_medium_gpu --nb_seeds 8 --model kc_ref_ride --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --frames 50000000 --arch original_endpool_res --*custom-ppo-ride --exploration-bonus --exploration-bonus-type ride
+
+# MultiRoom N7 S4
+# with vs ride parametres
+
+## memore and Ref model
+## no
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_ref_no --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference
+## rnd
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_ref_rnd --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+## ride
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_ref_ride --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+##
+#
+## with vs ride parametres: key corridor
+## no
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_kc_ref_no --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference
+## rnd
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_kc_ref_rnd --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+## ride
+#--slurm_conf jz_short_gpu_chained --nb_seeds 8 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model new_kc_ref_ride --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+#
+## conclution: it doesn't work with our params, we have to use reference params
+## NOTE: also here recurrence was 1 because it's their envs, for our envs we have to change the recurrence
+
+# testing
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model testing_ref_rnd_agent --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model testing_ref_rnd_preproc --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+
+# testing
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model testing_kc_ref_ride_agent --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model testing_kc_ref_ride_preproc --ride-ref-preprocessor --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+
+
+# Ref model
+# no
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_REF_no --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference
+# rnd
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_REF_rnd --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+# ride
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_REF_ride --compact-save --algo ppo --*env MiniGrid-MultiRoom-N7-S4-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+
+##
+#
+## with vs ride parametres: key corridor
+## no
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_kc_REF_no --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference
+## rnd
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_kc_REF_rnd --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd
+## ride
+--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_kc_REF_ride --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride
+## ride small reward
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_kc_ref_ride_small_rew --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-ride-reference --exploration-bonus --*exploration-bonus-type ride --intrinsic-reward-coef 0.05
+## rnd small reward
+#--slurm_conf jz_short_gpu_chained --nb_seeds 4 --cpu_cores_per_seed 16 --gpus_per_seed 0.5 --seeds_per_launch 2 --frames 60000000 --model agent_kc_ref_rnd_small_rew --compact-save --algo ppo --*env MiniGrid-KeyCorridorS3R3-v0 --save-interval 10 --ride-ref-agent --*custom-ppo-rnd-reference --exploration-bonus --*exploration-bonus-type rnd --intrinsic-reward-coef 0.05
diff --git a/run_vigil.txt b/run_vigil.txt
new file mode 100644
index 0000000000000000000000000000000000000000..14132570994b8bd0e97453fba413c6551ec9354f
--- /dev/null
+++ b/run_vigil.txt
@@ -0,0 +1,52 @@
+## squeue -u utu57ed -i 1 -o "%.18i %.9P %.130j %.8u %.2t %.10M %.6D %R"
+#
+#
+## Basic
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+#
+#
+## DEAF
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_deaf_no_explo_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_deaf_no_explo_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2
+#
+#
+#
+### No exploration bonus
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_no_explo_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_no_explo_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2
+#
+#
+#
+### BOW
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_bow_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch bow_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_bow_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch bow_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+#
+#
+#
+### No memory
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_no_mem_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-3 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_no_mem_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-3 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+#
+#
+##
+### Bigru
+--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_bigru_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50 --bAI-lang-model bigru
+--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_bigru_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50 --bAI-lang-model bigru
+### Attgru
+--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_attgru_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50 --bAI-lang-model attgru
+--slurm_conf jz_long_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_attgru_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOut-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50 --bAI-lang-model attgru
+#
+#
+#
+### Nameless
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_nameless_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiarNameless-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_nameless_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNameless-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+#
+### Nameless no memory
+--slurm_conf jz_medium_2gpus --nb_seeds 16 --model RERUN_WizardGuide_lang64_nameless_no_mem_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiarNameless-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-3 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+--slurm_conf jz_medium_2gpus --nb_seeds 16 --model RERUN_WizardTwoGuides_lang64_nameless_no_mem_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNameless-8x8-v0 --dialogue --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-3 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
+#
+#
+### Current dialogue Only
+--slurm_conf jz_medium_2gpus_32g --nb_seeds 16 --model RERUN_WizardGuide_lang64_curr_dial_only_mm_baby_short_rec --algo ppo --*env MiniGrid-TalkItOutNoLiar-8x8-v0 --current-dialogue-only --save-interval 10 --frames 30000000 --*multi-modal-babyai11-agent --*arch original_endpool_res --*custom-ppo-2 --exploration-bonus --episodic-exploration-bonus --*exploration-bonus-params 5 50
\ No newline at end of file
diff --git a/scripts/LLM_test.py b/scripts/LLM_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd4a693709b856ef7f521d9538b7e5f0932c8d79
--- /dev/null
+++ b/scripts/LLM_test.py
@@ -0,0 +1,926 @@
+import argparse
+import json
+import requests
+import time
+import warnings
+from n_tokens import estimate_price
+import pickle
+
+import numpy as np
+import torch
+from pathlib import Path
+
+from utils.babyai_utils.baby_agent import load_agent
+from utils import *
+from models import *
+import subprocess
+import os
+
+from matplotlib import pyplot as plt
+
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+from datetime import datetime
+
+from imageio import mimsave
+
+def new_episode_marker():
+ return "New episode.\n"
+
+def success_marker():
+ return "Success!\n"
+
+def failure_marker():
+ return "Failure!\n"
+
+def action_query():
+ return "Act :"
+
+def get_parsed_action(text_action):
+ """
+ Parses the text generated by a model and extracts the action
+ """
+
+ if "move forward" in text_action:
+ return "move forward"
+
+ elif "done" in text_action:
+ return "done"
+
+ elif "turn left" in text_action:
+ return "turn left"
+
+ elif "turn right" in text_action:
+ return "turn right"
+
+ elif "toggle" in text_action:
+ return "toggle"
+
+ elif "no_op" in text_action:
+ return "no_op"
+ else:
+ warnings.warn(f"Undefined action {text_action}")
+ return "no_op"
+
+
+def action_to_prompt_action_text(action):
+ if np.allclose(action, [int(env.actions.forward), np.nan, np.nan], equal_nan=True):
+ # 2
+ text_action = "move forward"
+
+ elif np.allclose(action, [int(env.actions.left), np.nan, np.nan], equal_nan=True):
+ # 0
+ text_action = "turn left"
+
+ elif np.allclose(action, [int(env.actions.right), np.nan, np.nan], equal_nan=True):
+ # 1
+ text_action = "turn right"
+
+ elif np.allclose(action, [int(env.actions.toggle), np.nan, np.nan], equal_nan=True):
+ # 3
+ text_action = "toggle"
+
+ elif np.allclose(action, [int(env.actions.done), np.nan, np.nan], equal_nan=True):
+ # 4
+ text_action = "done"
+
+ elif np.allclose(action, [np.nan, np.nan, np.nan], equal_nan=True):
+ text_action = "no_op"
+
+ else:
+ warnings.warn(f"Undefined action {action}")
+ return "no_op"
+
+ return f"{action_query()} {text_action}\n"
+
+
+
+def text_action_to_action(text_action):
+
+ # text_action = get_parsed_action(text_action)
+
+ if "move forward" == text_action:
+ action = [int(env.actions.forward), np.nan, np.nan]
+
+ elif "turn left" == text_action:
+ action = [int(env.actions.left), np.nan, np.nan]
+
+ elif "turn right" == text_action:
+ action = [int(env.actions.right), np.nan, np.nan]
+
+ elif "toggle" == text_action:
+ action = [int(env.actions.toggle), np.nan, np.nan]
+
+ elif "done" == text_action:
+ action = [int(env.actions.done), np.nan, np.nan]
+
+ elif "no_op" == text_action:
+ action = [np.nan, np.nan, np.nan]
+
+ return action
+
+
+def prompt_preprocessor(llm_prompt):
+ # remove peer observations
+ lines = llm_prompt.split("\n")
+ new_lines = []
+ for line in lines:
+ if line.startswith("#"):
+ continue
+
+ elif line.startswith("Conversation"):
+ continue
+
+ elif "peer" in line:
+ caretaker = True
+ if caretaker:
+ # show only the location of the caretaker
+
+ # this is very ugly, todo: refactor this
+ assert "there is a" in line
+ start_index = line.index('there is a') + 11
+ new_line = line[:start_index] + 'caretaker'
+
+ new_lines.append(new_line)
+
+ else:
+ # no caretaker at all
+ if line.startswith("Obs :") and "peer" in line:
+ # remove only the peer descriptions
+ line = "Obs :"
+ new_lines.append(line)
+ else:
+ assert "peer" in line
+
+ elif "Caretaker:" in line:
+ line = line.replace("Caretaker:", "Caretaker says: ")
+ new_lines.append(line)
+
+ else:
+ new_lines.append(line)
+
+ return "\n".join(new_lines)
+
+def generate_text_obs(obs, info):
+
+ text_observation = obs_to_text(info)
+
+ llm_prompt = "Obs : "
+ llm_prompt += "".join(text_observation)
+
+ # add utterances
+ if obs["utterance_history"] != "Conversation: \n":
+ utt_hist = obs['utterance_history']
+ utt_hist = utt_hist.replace("Conversation: \n","")
+ llm_prompt += utt_hist
+
+ return llm_prompt
+
+def obs_to_text(info):
+ image, vis_mask = info["image"], info["vis_mask"]
+ carrying = info["carrying"]
+ agent_pos_vx, agent_pos_vy = info["agent_pos_vx"], info["agent_pos_vy"]
+ npc_actions_dict = info["npc_actions_dict"]
+
+ # (OBJECT_TO_IDX[self.type], COLOR_TO_IDX[self.color], state)
+ # State, 0: open, 1: closed, 2: locked
+ IDX_TO_COLOR = dict(zip(COLOR_TO_IDX.values(), COLOR_TO_IDX.keys()))
+ IDX_TO_OBJECT = dict(zip(OBJECT_TO_IDX.values(), OBJECT_TO_IDX.keys()))
+
+ list_textual_descriptions = []
+
+ if carrying is not None:
+ list_textual_descriptions.append("You carry a {} {}".format(carrying.color, carrying.type))
+
+ # agent_pos_vx, agent_pos_vy = self.get_view_coords(self.agent_pos[0], self.agent_pos[1])
+
+ view_field_dictionary = dict()
+
+ for i in range(image.shape[0]):
+ for j in range(image.shape[1]):
+ if image[i][j][0] != 0 and image[i][j][0] != 1 and image[i][j][0] != 2:
+ if i not in view_field_dictionary.keys():
+ view_field_dictionary[i] = dict()
+ view_field_dictionary[i][j] = image[i][j]
+ else:
+ view_field_dictionary[i][j] = image[i][j]
+
+ # Find the wall if any
+ # We describe a wall only if there is no objects between the agent and the wall in straight line
+
+ # Find wall in front
+ add_wall_descr = False
+ if add_wall_descr:
+ j = agent_pos_vy - 1
+ object_seen = False
+ while j >= 0 and not object_seen:
+ if image[agent_pos_vx][j][0] != 0 and image[agent_pos_vx][j][0] != 1:
+ if image[agent_pos_vx][j][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {agent_pos_vy - j} steps in front of you. \n") # forward
+ object_seen = True
+ else:
+ object_seen = True
+ j -= 1
+ # Find wall left
+ i = agent_pos_vx - 1
+ object_seen = False
+ while i >= 0 and not object_seen:
+ if image[i][agent_pos_vy][0] != 0 and image[i][agent_pos_vy][0] != 1:
+ if image[i][agent_pos_vy][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {agent_pos_vx - i} steps to the left. \n") # left
+ object_seen = True
+ else:
+ object_seen = True
+ i -= 1
+ # Find wall right
+ i = agent_pos_vx + 1
+ object_seen = False
+ while i < image.shape[0] and not object_seen:
+ if image[i][agent_pos_vy][0] != 0 and image[i][agent_pos_vy][0] != 1:
+ if image[i][agent_pos_vy][0] == 2:
+ list_textual_descriptions.append(
+ f"A wall is {i - agent_pos_vx} steps to the right. \n") # right
+ object_seen = True
+ else:
+ object_seen = True
+ i += 1
+
+ # list_textual_descriptions.append("You see the following objects: ")
+ # returns the position of seen objects relative to you
+ for i in view_field_dictionary.keys():
+ for j in view_field_dictionary[i].keys():
+ if i != agent_pos_vx or j != agent_pos_vy:
+ object = view_field_dictionary[i][j]
+
+ # # don't show npc
+ # if IDX_TO_OBJECT[object[0]] == "npc":
+ # continue
+
+ front_dist = agent_pos_vy - j
+ left_right_dist = i - agent_pos_vx
+
+ loc_descr = ""
+ if front_dist == 1 and left_right_dist == 0:
+ loc_descr += "Right in front of you "
+
+ elif left_right_dist == 1 and front_dist == 0:
+ loc_descr += "Just to the right of you"
+
+ elif left_right_dist == -1 and front_dist == 0:
+ loc_descr += "Just to the left of you"
+
+ else:
+ front_str = str(front_dist) + " steps in front of you " if front_dist > 0 else ""
+
+ loc_descr += front_str
+
+ suff = "s" if abs(left_right_dist) > 0 else ""
+ and_ = "and" if loc_descr != "" else ""
+
+ if left_right_dist < 0:
+ left_right_str = f"{and_} {-left_right_dist} step{suff} to the left"
+ loc_descr += left_right_str
+
+ elif left_right_dist > 0:
+ left_right_str = f"{and_} {left_right_dist} step{suff} to the right"
+ loc_descr += left_right_str
+
+ else:
+ left_right_str = ""
+ loc_descr += left_right_str
+
+ loc_descr += f" there is a "
+
+ obj_type = IDX_TO_OBJECT[object[0]]
+ if obj_type == "npc":
+ IDX_TO_STATE = {0: 'friendly', 1: 'antagonistic'}
+
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} peer. "
+
+ # gaze
+ gaze_dir = {
+ 0: "towards you",
+ 1: "to the left of you",
+ 2: "in the same direction as you",
+ 3: "to the right of you",
+ }
+ description += f"It is looking {gaze_dir[object[3]]}. "
+
+ # point
+ point_dir = {
+ 0: "towards you",
+ 1: "to the left of you",
+ 2: "in the same direction as you",
+ 3: "to the right of you",
+ }
+
+ if object[4] != 255:
+ description += f"It is pointing {point_dir[object[4]]}. "
+
+ # last action
+ last_action = {v: k for k, v in npc_actions_dict.items()}[object[5]]
+
+ last_action = {
+ "go_forward": "foward",
+ "rotate_left": "turn left",
+ "rotate_right": "turn right",
+ "toggle_action": "toggle",
+ "point_stop_point": "stop pointing",
+ "point_E": "",
+ "point_S": "",
+ "point_W": "",
+ "point_N": "",
+ "stop_point": "stop pointing",
+ "no_op": ""
+ }[last_action]
+
+ if last_action not in ["no_op", ""]:
+ description += f"It's last action is {last_action}. "
+
+ elif obj_type in ["switch", "apple", "generatorplatform", "marble", "marbletee", "fence"]:
+ # todo: this assumes that Switch.no_light == True
+ description = f"{IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[2:].mean() == 0
+
+ elif obj_type == "lockablebox":
+ IDX_TO_STATE = {0: 'open', 1: 'closed', 2: 'locked'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "applegenerator":
+ IDX_TO_STATE = {1: 'square', 2: 'round'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "remotedoor":
+ IDX_TO_STATE = {0: 'open', 1: 'closed'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "door":
+ IDX_TO_STATE = {0: 'open', 1: 'closed', 2: 'locked'}
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} "
+ assert object[3:].mean() == 0
+
+ elif obj_type == "lever":
+ IDX_TO_STATE = {1: 'activated', 0: 'unactivated'}
+ if object[3] == 255:
+ countdown_txt = ""
+ else:
+ countdown_txt = f"with {object[3]} timesteps left. "
+
+ description = f"{IDX_TO_STATE[object[2]]} {IDX_TO_COLOR[object[1]]} {IDX_TO_OBJECT[object[0]]} {countdown_txt}"
+
+ assert object[4:].mean() == 0
+ else:
+ raise ValueError(f"Undefined object type {obj_type}")
+
+ full_destr = loc_descr + description + "\n"
+
+ list_textual_descriptions.append(full_destr)
+
+ if len(list_textual_descriptions) == 0:
+ list_textual_descriptions.append("\n")
+
+ return list_textual_descriptions
+
+def plt_2_rgb(env):
+ # data = np.frombuffer(env.window.fig.canvas.tostring_rgb(), dtype=np.uint8)
+ # data = data.reshape(env.window.fig.canvas.get_width_height()[::-1] + (3,))
+
+ width, height = env.window.fig.get_size_inches() * env.window.fig.get_dpi()
+ data = np.fromstring(env.window.fig.canvas.tostring_rgb(), dtype='uint8').reshape(int(height), int(width), 3)
+ return data
+
+
+def reset(env):
+ env.reset()
+ # a dirty trick just to get obs and info
+ return env.step([np.nan, np.nan, np.nan])
+ # return step("no_op")
+
+def generate(text_input, model):
+ # return "(a) move forward"
+ if model == "dummy":
+ print("dummy action forward")
+ return "move forward"
+
+ elif model == "interactive":
+ return input("Enter action:")
+
+ elif model == "random":
+ print("random agent")
+
+ print("PROMPT:")
+ print(text_input)
+ return random.choice([
+ "move forward",
+ "turn left",
+ "turn right",
+ "toggle",
+ ])
+
+ elif model in ["gpt-3.5-turbo-0301", "gpt-3.5-turbo-0613", "gpt-4-0613", "gpt-4-0314"]:
+ while True:
+ try:
+ c = openai.ChatCompletion.create(
+ model=model,
+ messages=[
+ # {"role": "system", "content": ""},
+ # {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
+ # {"role": "user", "content": "Continue the following text in the most logical way.\n"+text_input}
+
+ # {"role": "system", "content":
+ # "You are an agent and can use the following actions: 'move forward', 'toggle', 'turn left', 'turn right', 'done'."
+ # # "The caretaker will say the color of the box which you should open. Turn until you find this box and toggle it when it is right in front of it."
+ # # "Then an apple will appear and you can toggle it to succeed."
+ # },
+ {"role": "user", "content": text_input}
+ ],
+ max_tokens=3,
+ n=1,
+ temperature=0.0,
+ request_timeout=30,
+ )
+ break
+ except Exception as e:
+ print(e)
+ print("Pausing")
+ time.sleep(10)
+ continue
+ print("->LLM generation: ", c['choices'][0]['message']['content'])
+ return c['choices'][0]['message']['content']
+
+ elif re.match(r"text-.*-\d{3}", model) or model in ["gpt-3.5-turbo-instruct-0914"]:
+ while True:
+ try:
+ response = openai.Completion.create(
+ model=model,
+ prompt=text_input,
+ # temperature=0.7,
+ temperature=0.0,
+ max_tokens=3,
+ top_p=1,
+ frequency_penalty=0,
+ presence_penalty=0,
+ timeout=30
+ )
+ break
+
+ except Exception as e:
+ print(e)
+ print("Pausing")
+ time.sleep(10)
+ continue
+
+ choices = response["choices"]
+ assert len(choices) == 1
+ return choices[0]["text"].strip().lower() # remove newline from the end
+
+ elif model in ["gpt2_large", "api_bloom"]:
+ # HF_TOKEN = os.getenv("HF_TOKEN")
+ if model == "gpt2_large":
+ API_URL = "https://api-inference.huggingface.co/models/gpt2-large"
+
+ elif model == "api_bloom":
+ API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
+
+ else:
+ raise ValueError(f"Undefined model {model}.")
+
+ headers = {"Authorization": f"Bearer {HF_TOKEN}"}
+
+ def query(text_prompt, n_tokens=3):
+
+ input = text_prompt
+
+ # make n_tokens request and append the output each time - one request generates one token
+
+ for _ in range(n_tokens):
+ # prepare request
+ payload = {
+ "inputs": input,
+ "parameters": {
+ "do_sample": False,
+ 'temperature': 0,
+ 'wait_for_model': True,
+ # "max_length": 500, # for gpt2
+ # "max_new_tokens": 250 # fot gpt2-xl
+ },
+ }
+ data = json.dumps(payload)
+
+ # request
+ response = requests.request("POST", API_URL, headers=headers, data=data)
+ response_json = json.loads(response.content.decode("utf-8"))
+
+ if type(response_json) is list and len(response_json) == 1:
+ # generated_text contains the input + the response
+ response_full_text = response_json[0]['generated_text']
+
+ # we use this as the next input
+ input = response_full_text
+
+ else:
+ print("Invalid request to huggingface api")
+ from IPython import embed; embed()
+
+ # remove the prompt from the beginning
+ assert response_full_text.startswith(text_prompt)
+ response_text = response_full_text[len(text_prompt):]
+
+ return response_text
+
+ response = query(text_input).strip().lower()
+ return response
+
+ elif model in ["bloom_560m"]:
+ # from transformers import BloomForCausalLM
+ # from transformers import BloomTokenizerFast
+ #
+ # tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+ # model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+
+ inputs = hf_tokenizer(text_input, return_tensors="pt")
+ # 3 words
+ result_length = inputs['input_ids'].shape[-1]+3
+ full_output = hf_tokenizer.decode(hf_model.generate(inputs["input_ids"], max_length=result_length)[0])
+
+ assert full_output.startswith(text_input)
+ response = full_output[len(text_input):]
+
+ response = response.strip().lower()
+
+ return response
+
+ else:
+ raise ValueError("Unknown model.")
+
+
+def estimate_tokens_selenium(prompt):
+ # selenium is used because python3.9 is needed for tiktoken
+
+ from selenium import webdriver
+ from selenium.webdriver.common.by import By
+ from selenium.webdriver.support.ui import WebDriverWait
+ from selenium.webdriver.support import expected_conditions as EC
+ import time
+
+ # Initialize the WebDriver instance
+ options = webdriver.ChromeOptions()
+ options.add_argument('headless')
+
+ # set up the driver
+ driver = webdriver.Chrome(options=options)
+
+ # Navigate to the website
+ driver.get('https://platform.openai.com/tokenizer')
+
+ text_input = driver.find_element(By.XPATH, '/html/body/div[1]/div[1]/div/div[2]/div[3]/textarea')
+ text_input.clear()
+ text_input.send_keys(prompt)
+
+ n_tokens = 0
+ while n_tokens == 0:
+ time.sleep(1)
+ # Wait for the response to be loaded
+ wait = WebDriverWait(driver, 10)
+ response = wait.until(
+ EC.presence_of_element_located((By.CSS_SELECTOR, 'div.tokenizer-stat:nth-child(1) > div:nth-child(2)')))
+ n_tokens = int(response.text.replace(",", ""))
+
+
+ # Close the WebDriver instance
+ driver.quit()
+ return n_tokens
+
+
+def load_in_context_examples(in_context_episodes):
+ in_context_examples = ""
+ print(f'Loading {len(in_context_episodes)} examples.')
+ for episode_data in in_context_episodes:
+
+ in_context_examples += new_episode_marker()
+
+ for step_i, step_data in enumerate(episode_data):
+
+ action = step_data["action"]
+ info = step_data["info"]
+ obs = step_data["obs"]
+ reward = step_data["reward"]
+ done = step_data["done"]
+
+ if step_i == 0:
+ # step 0 is the initial state of the environment
+ assert action is None
+ prompt_action_text = ""
+
+ else:
+ prompt_action_text = action_to_prompt_action_text(action)
+
+ text_obs = generate_text_obs(obs, info)
+ step_text = prompt_preprocessor(prompt_action_text + text_obs)
+
+ in_context_examples += step_text
+
+ if done:
+ if reward > 0:
+ in_context_examples += success_marker()
+ else:
+ in_context_examples += failure_marker()
+
+ else:
+ # in all envs reward is given in the end
+ # in the initial step rewards is None
+ assert reward == 0 or (step_i == 0 and reward is None)
+
+ print("-------------------------- IN CONTEXT EXAMPLES --------------------------")
+ print(in_context_examples)
+ print("-------------------------------------------------------------------------")
+ exit()
+
+ return in_context_examples
+
+
+if __name__ == "__main__":
+
+ # Parse arguments
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--model", required=False,
+ help="text-ada-001")
+ parser.add_argument("--seed", type=int, default=0,
+ help="Seed of the first episode. The seed for the following episodes will be used in order: seed, seed + 1, ... seed + (n_episodes-1) (default: 0)")
+ parser.add_argument("--max-steps", type=int, default=15,
+ help="max num of steps")
+ parser.add_argument("--shift", type=int, default=0,
+ help="number of times the environment is reset at the beginning (default: 0)")
+ parser.add_argument("--argmax", action="store_true", default=False,
+ help="select the action with highest probability (default: False)")
+ parser.add_argument("--pause", type=float, default=0.5,
+ help="pause duration between two consequent actions of the agent (default: 0.5)")
+ parser.add_argument("--env-name", type=str,
+ default="SocialAI-AsocialBoxInformationSeekingParamEnv-v1",
+ # default="SocialAI-ColorBoxesLLMCSParamEnv-v1",
+ required=False,
+ help="env name")
+ parser.add_argument("--in-context-path", type=str,
+ # old
+ # default='llm_data/in_context_asocial_box.txt'
+ # default='llm_data/in_context_color_boxes.txt',
+ # new
+ # asocial box
+ default='llm_data/in_context_examples/in_context_asocialbox_SocialAI-AsocialBoxInformationSeekingParamEnv-v1_2023_07_19_19_28_48/episodes.pkl',
+ # colorbox
+ # default='llm_data/in_context_examples/in_context_colorbox_SocialAI-ColorBoxesLLMCSParamEnv-v1_2023_07_20_13_11_54/episodes.pkl',
+ required=False,
+ help="path to in context examples")
+ parser.add_argument("--episodes", type=int, default=10,
+ help="number of episodes to visualize")
+ parser.add_argument("--env-args", nargs='*', default=None)
+ parser.add_argument("--agent_view", default=False, help="draw the agent sees (partially observable view)", action='store_true' )
+ parser.add_argument("--tile_size", type=int, help="size at which to render tiles", default=32 )
+ parser.add_argument("--mask-unobserved", default=False, help="mask cells that are not observed by the agent", action='store_true' )
+ parser.add_argument("--log", type=str, default="llm_log/episodes_log", help="log from the run", required=False)
+ parser.add_argument("--feed-full-ep", default=False, help="weather to append the whole episode to the prompt", action='store_true')
+ parser.add_argument("--last-n", type=int, help="how many last steps to provide in observation (if not feed-full-ep)", default=3)
+ parser.add_argument("--skip-check", default=False, help="Don't estimate the price.", action="store_true")
+
+ args = parser.parse_args()
+
+ # Set seed for all randomness sources
+
+ seed(args.seed)
+
+ model = args.model
+
+
+ in_context_examples_path = args.in_context_path
+
+ # test for paper: remove later
+ if "asocialbox" in in_context_examples_path:
+ assert args.env_name == "SocialAI-AsocialBoxInformationSeekingParamEnv-v1"
+ elif "colorbox" in in_context_examples_path:
+ assert args.env_name == "SocialAI-ColorBoxesLLMCSParamEnv-v1"
+
+
+ print("env name:", args.env_name)
+ print("examples:", in_context_examples_path)
+ print("model:", args.model)
+
+ # datetime
+ now = datetime.now()
+ datetime_string = now.strftime("%d_%m_%Y_%H:%M:%S")
+ print(datetime_string)
+
+ # log filenames
+
+ log_folder = args.log+"_"+datetime_string+"/"
+ os.mkdir(log_folder)
+ evaluation_log_filename = log_folder+"evaluation_log.json"
+ prompt_log_filename = log_folder + "prompt_log.txt"
+ ep_h_log_filename = log_folder+"episode_history_query.txt"
+ gif_savename = log_folder + "demo.gif"
+
+
+ env_args = env_args_str_to_dict(args.env_args)
+ env = make_env(args.env_name, args.seed, env_args)
+
+ # env = gym.make(args.env_name, **env_args)
+ print(f"Environment {args.env_name} and args: {env_args_str_to_dict(args.env_args)}\n")
+
+ # Define agent
+ print("Agent loaded\n")
+
+ # prepare models
+ model_instance = None
+
+ if "text" in args.model or "gpt-3" in args.model or "gpt-4" in args.model:
+ import openai
+ openai.api_key = os.getenv("OPENAI_API_KEY")
+
+ elif args.model in ["gpt2_large", "api_bloom"]:
+ HF_TOKEN = os.getenv("HF_TOKEN")
+
+ elif args.model in ["bloom_560m"]:
+ from transformers import BloomForCausalLM
+ from transformers import BloomTokenizerFast
+
+ hf_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+ hf_model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+
+ elif args.model in ["bloom"]:
+ from transformers import BloomForCausalLM
+ from transformers import BloomTokenizerFast
+
+ hf_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom", cache_dir=".cache/huggingface/")
+ hf_model = BloomForCausalLM.from_pretrained("bigscience/bloom", cache_dir=".cache/huggingface/")
+
+ model_instance = (hf_tokenizer, hf_model)
+
+ with open(in_context_examples_path, "rb") as f:
+ in_context_episodes = pickle.load(f)
+
+ in_context_examples = load_in_context_examples(in_context_episodes)
+
+ with open(prompt_log_filename, "a+") as f:
+ f.write(datetime_string)
+
+ with open(ep_h_log_filename, "a+") as f:
+ f.write(datetime_string)
+
+ full_episode_history = args.feed_full_ep
+ last_n=args.last_n
+
+ if full_episode_history:
+ print("Full episode history.")
+ else:
+ print(f"Last {args.last_n} steps.")
+
+ if not args.skip_check and not args.model in ["dummy", "random", "interactive"]:
+ print(f"Estimating price for model {args.model}.")
+ in_context_n_tokens = estimate_tokens_selenium(in_context_examples)
+
+ n_in_context_steps = sum([len(ep) for ep in in_context_episodes])
+ tokens_per_step = in_context_n_tokens / n_in_context_steps
+
+ _, price = estimate_price(
+ num_of_episodes=args.episodes,
+ in_context_len=in_context_n_tokens,
+ tokens_per_step=tokens_per_step,
+ n_steps=args.max_steps,
+ last_n=last_n,
+ model=args.model,
+ feed_episode_history=full_episode_history
+ )
+ input(f"You will spend: {price} dollars. ok?")
+
+ # prepare frames list to save to gif
+ frames = []
+
+ assert args.max_steps <= 20
+
+ success_rates = []
+ # episodes start
+ for episode in range(args.episodes):
+ print("Episode:", episode)
+ episode_history_text = new_episode_marker()
+
+ success = False
+ episode_seed = args.seed + episode
+ env = make_env(args.env_name, episode_seed, env_args)
+
+ with open(prompt_log_filename, "a+") as f:
+ f.write("\n\n")
+
+ observations = []
+ actions = []
+ for i in range(int(args.max_steps)):
+
+ if i == 0:
+ obs, reward, done, info = reset(env)
+ prompt_action_text = ""
+
+ else:
+ with open(prompt_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(llm_prompt)
+
+ # querry the model
+ generation = generate(llm_prompt, args.model)
+
+ # parse the action
+ text_action = get_parsed_action(generation)
+
+ # get the raw action
+ action = text_action_to_action(text_action)
+
+ # execute the action
+ obs, reward, done, info = env.step(action)
+
+ prompt_action_text = f"{action_query()} {text_action}\n"
+
+ assert action_to_prompt_action_text(action) == prompt_action_text
+
+ actions.append(prompt_action_text)
+
+ text_obs = generate_text_obs(obs, info)
+ observations.append(text_obs)
+
+ step_text = prompt_preprocessor(prompt_action_text + text_obs)
+ print("Step text:")
+ print(step_text)
+
+ episode_history_text += step_text # append to history of this episode
+
+ if full_episode_history:
+ # feed full episode history
+ llm_prompt = in_context_examples + episode_history_text + action_query()
+
+ else:
+ n = min(last_n, len(observations))
+ obs = observations[-n:]
+ act = (actions + [action_query()])[-n:]
+
+ episode_text = "".join([o+a for o, a in zip(obs, act)])
+
+ llm_prompt = in_context_examples + new_episode_marker() + episode_text
+
+ llm_prompt = prompt_preprocessor(llm_prompt)
+
+ # save the image
+ env.render(mode="human")
+ rgb_img = plt_2_rgb(env)
+ frames.append(rgb_img)
+
+ if env.current_env.box.blocked and not env.current_env.box.is_open:
+ # target box is blocked -> apple can't be obtained
+ # break to save compute
+ break
+
+ if done:
+ # quadruple last frame to pause between episodes
+ for i in range(3):
+ same_img = np.copy(rgb_img)
+ # toggle a pixel between frames to avoid cropping when going from gif to mp4
+ same_img[0, 0, 2] = 0 if (i % 2) == 0 else 255
+ frames.append(same_img)
+
+ if reward > 0:
+ print("Success!")
+
+
+ episode_history_text += success_marker()
+ success = True
+ else:
+ episode_history_text += failure_marker()
+
+ with open(ep_h_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(episode_history_text)
+
+ break
+
+ else:
+ with open(ep_h_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(episode_history_text)
+
+ print(f"{'Success' if success else 'Failure'}")
+ success_rates.append(success)
+
+ mean_success_rate = np.mean(success_rates)
+ print("Success rate:", mean_success_rate)
+ print(f"Saving gif to {gif_savename}.")
+ mimsave(gif_savename, frames, duration=args.pause)
+
+ print("Done.")
+
+ log_data_dict = vars(args)
+ log_data_dict["success_rates"] = success_rates
+ log_data_dict["mean_success_rate"] = mean_success_rate
+
+ print("Evaluation log: ", evaluation_log_filename)
+ with open(evaluation_log_filename, "w") as f:
+ f.write(json.dumps(log_data_dict))
diff --git a/scripts/LLM_test_old.py b/scripts/LLM_test_old.py
new file mode 100644
index 0000000000000000000000000000000000000000..a16900195b616ffa992295c07638c29312812af6
--- /dev/null
+++ b/scripts/LLM_test_old.py
@@ -0,0 +1,586 @@
+# python -m scripts.LLM_test --gif test_GPT_boxes --episodes 1 --max-steps 8 --model text-davinci-003 --env-args size 6 --env-name SocialAI-ColorBoxesLLMCSParamEnv-v1 --in-context-path llm_data/in_context_color_boxes.txt
+# python -m scripts.LLM_test --gif test_GPT_asoc --episodes 1 --max-steps 8 --model text-ada-001 --env-args size 6 --env-name SocialAI-AsocialBoxInformationSeekingParamEnv-v1 --in-context-path llm_data/in_context_asocial_box.txt --feed-full-ep
+
+# python -m scripts.LLM_test --gif test_GPT_boxes --episodes 1 --max-steps 8 --model bloom_560m --env-args size 6 --env-name SocialAI-ColorBoxesLLMCSParamEnv-v1 --in-context-path llm_data/in_context_color_boxes.txt
+# python -m scripts.LLM_test --gif test_GPT_asoc --episodes 1 --max-steps 8 --model bloom_560m --env-args size 6 --env-name SocialAI-AsocialBoxInformationSeekingParamEnv-v1 --in-context-path llm_data/in_context_asocial_box.txt --feed-full-ep
+
+## bloom 560m
+# boxes
+# python -m scripts.LLM_test --log llm_log/bloom_560m_boxes_no_hist --gif evaluation --episodes 20 --max-steps 10 --model bloom_560m --env-args size 6 --env-name SocialAI-ColorBoxesLLMCSParamEnv-v1 --in-context-path llm_data/in_context_color_boxes.txt
+
+# asocial
+# python -m scripts.LLM_test --log llm_log/bloom_560m_asocial_no_hist --gif evaluation --episodes 20 --max-steps 10 --model bloom_560m --env-args size 6 --env-name SocialAI-AsocialBoxInformationSeekingParamEnv-v1 --in-context-path llm_data/in_context_asocial_box.txt
+
+# random
+# python -m scripts.LLM_test --log llm_log/random_boxes --gif evaluation --episodes 20 --max-steps 10 --model random --env-args size 6 --env-name SocialAI-ColorBoxesLLMCSParamEnv-v1 --in-context-path llm_data/in_context_color_boxes.txt
+
+import argparse
+import json
+import requests
+import time
+import warnings
+from n_tokens import estimate_price
+
+import numpy as np
+import torch
+from pathlib import Path
+
+from utils.babyai_utils.baby_agent import load_agent
+from utils import *
+from models import *
+import subprocess
+import os
+
+from matplotlib import pyplot as plt
+
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+from datetime import datetime
+
+from imageio import mimsave
+
+def prompt_preprocessor(llm_prompt):
+ # remove peer observations
+ lines = llm_prompt.split("\n")
+ new_lines = []
+ for line in lines:
+ if line.startswith("#"):
+ continue
+
+ elif line.startswith("Conversation"):
+ continue
+
+ elif "peer" in line:
+ caretaker = True
+ if caretaker:
+ # show only the location of the caretaker
+
+ # this is very ugly, todo: refactor this
+ assert "there is a" in line
+ start_index = line.index('there is a') + 11
+ new_line = line[:start_index] + 'caretaker'
+
+ new_lines.append(new_line)
+
+ else:
+ # no caretaker at all
+ if line.startswith("Obs :") and "peer" in line:
+ # remove only the peer descriptions
+ line = "Obs :"
+ new_lines.append(line)
+ else:
+ assert "peer" in line
+
+ elif "Caretaker:" in line:
+ # line = line.replace("Caretaker:", "Caretaker says: '") + "'"
+ new_lines.append(line)
+
+ else:
+ new_lines.append(line)
+
+ return "\n".join(new_lines)
+
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model", required=False,
+ help="text-ada-001")
+parser.add_argument("--seed", type=int, default=0,
+ help="Seed of the first episode. The seed for the following episodes will be used in order: seed, seed + 1, ... seed + (n_episodes-1) (default: 0)")
+parser.add_argument("--max-steps", type=int, default=5,
+ help="max num of steps")
+parser.add_argument("--shift", type=int, default=0,
+ help="number of times the environment is reset at the beginning (default: 0)")
+parser.add_argument("--argmax", action="store_true", default=False,
+ help="select the action with highest probability (default: False)")
+parser.add_argument("--pause", type=float, default=0.5,
+ help="pause duration between two consequent actions of the agent (default: 0.5)")
+parser.add_argument("--env-name", type=str,
+ # default="SocialAI-ELangColorBoxesTestInformationSeekingParamEnv-v1",
+ # default="SocialAI-AsocialBoxInformationSeekingParamEnv-v1",
+ default="SocialAI-ColorBoxesLLMCSParamEnv-v1",
+ required=False,
+ help="env name")
+parser.add_argument("--in-context-path", type=str,
+ # default='llm_data/short_in_context_boxes.txt'
+ # default='llm_data/in_context_asocial_box.txt'
+ default='llm_data/in_context_color_boxes.txt',
+ required=False,
+ help="path to in context examples")
+parser.add_argument("--gif", type=str, default="visualization",
+ help="store output as gif with the given filename", required=False)
+parser.add_argument("--episodes", type=int, default=1,
+ help="number of episodes to visualize")
+parser.add_argument("--env-args", nargs='*', default=None)
+parser.add_argument("--agent_view", default=False, help="draw the agent sees (partially observable view)", action='store_true' )
+parser.add_argument("--tile_size", type=int, help="size at which to render tiles", default=32 )
+parser.add_argument("--mask-unobserved", default=False, help="mask cells that are not observed by the agent", action='store_true' )
+parser.add_argument("--log", type=str, default="llm_log/episodes_log", help="log from the run", required=False)
+parser.add_argument("--feed-full-ep", default=False, help="weather to append the whole episode to the prompt", action='store_true')
+parser.add_argument("--skip-check", default=False, help="Don't estimate the price.", action="store_true")
+
+args = parser.parse_args()
+
+# Set seed for all randomness sources
+
+seed(args.seed)
+
+model = args.model
+
+
+in_context_examples_path = args.in_context_path
+
+print("env name:", args.env_name)
+print("examples:", in_context_examples_path)
+print("model:", args.model)
+
+# datetime
+now = datetime.now()
+datetime_string = now.strftime("%d_%m_%Y_%H:%M:%S")
+print(datetime_string)
+
+# log filenames
+
+log_folder = args.log+"_"+datetime_string+"/"
+os.mkdir(log_folder)
+evaluation_log_filename = log_folder+"evaluation_log.json"
+prompt_log_filename = log_folder + "prompt_log.txt"
+ep_h_log_filename = log_folder+"episode_history_query.txt"
+gif_savename = log_folder + args.gif + ".gif"
+
+assert "viz" not in gif_savename # don't use viz anymore
+
+
+env_args = env_args_str_to_dict(args.env_args)
+env = make_env(args.env_name, args.seed, env_args)
+
+# env = gym.make(args.env_name, **env_args)
+print(f"Environment {args.env_name} and args: {env_args_str_to_dict(args.env_args)}\n")
+
+# Define agent
+print("Agent loaded\n")
+
+# prepare models
+
+if args.model in ["text-davinci-003", "text-ada-001", "gpt-3.5-turbo-0301"]:
+ import openai
+ openai.api_key = os.getenv("OPENAI_API_KEY")
+
+elif args.model in ["gpt2_large", "api_bloom"]:
+ HF_TOKEN = os.getenv("HF_TOKEN")
+
+elif args.model in ["bloom_560m"]:
+ from transformers import BloomForCausalLM
+ from transformers import BloomTokenizerFast
+
+ hf_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+ hf_model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+
+elif args.model in ["bloom"]:
+ from transformers import BloomForCausalLM
+ from transformers import BloomTokenizerFast
+
+ hf_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom", cache_dir=".cache/huggingface/")
+ hf_model = BloomForCausalLM.from_pretrained("bigscience/bloom", cache_dir=".cache/huggingface/")
+
+
+def plt_2_rgb(env):
+ # data = np.frombuffer(env.window.fig.canvas.tostring_rgb(), dtype=np.uint8)
+ # data = data.reshape(env.window.fig.canvas.get_width_height()[::-1] + (3,))
+
+ width, height = env.window.fig.get_size_inches() * env.window.fig.get_dpi()
+ data = np.fromstring(env.window.fig.canvas.tostring_rgb(), dtype='uint8').reshape(int(height), int(width), 3)
+ return data
+
+def generate(text_input, model):
+ # return "(a) move forward"
+ if model == "dummy":
+ print("dummy action forward")
+ return "move forward"
+
+ elif model == "random":
+ print("random agent")
+ return random.choice([
+ "move forward",
+ "turn left",
+ "turn right",
+ "toggle",
+ ])
+
+ elif model in ["gpt-3.5-turbo-0301"]:
+ while True:
+ try:
+ c = openai.ChatCompletion.create(
+ model=model,
+ messages=[
+ # {"role": "system", "content": ""},
+ # {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
+ # {"role": "user", "content": "Continue the following text in the most logical way.\n"+text_input}
+ {"role": "user", "content": text_input}
+ ],
+ max_tokens=3,
+ n=1,
+ temperature=0,
+ request_timeout=30,
+ )
+ break
+ except Exception as e:
+ print(e)
+ print("Pausing")
+ time.sleep(10)
+ continue
+ print("generation: ", c['choices'][0]['message']['content'])
+ return c['choices'][0]['message']['content']
+
+ elif model in ["text-davinci-003", "text-ada-001"]:
+ while True:
+ try:
+ response = openai.Completion.create(
+ model=model,
+ prompt=text_input,
+ # temperature=0.7,
+ temperature=0.0,
+ max_tokens=3,
+ top_p=1,
+ frequency_penalty=0,
+ presence_penalty=0,
+ timeout=30
+ )
+ break
+
+ except Exception as e:
+ print(e)
+ print("Pausing")
+ time.sleep(10)
+ continue
+
+ choices = response["choices"]
+ assert len(choices) == 1
+ return choices[0]["text"].strip().lower() # remove newline from the end
+
+ elif model in ["gpt2_large", "api_bloom"]:
+ # HF_TOKEN = os.getenv("HF_TOKEN")
+ if model == "gpt2_large":
+ API_URL = "https://api-inference.huggingface.co/models/gpt2-large"
+
+ elif model == "api_bloom":
+ API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
+
+ else:
+ raise ValueError(f"Undefined model {model}.")
+
+ headers = {"Authorization": f"Bearer {HF_TOKEN}"}
+
+ def query(text_prompt, n_tokens=3):
+
+ input = text_prompt
+
+ # make n_tokens request and append the output each time - one request generates one token
+
+ for _ in range(n_tokens):
+ # prepare request
+ payload = {
+ "inputs": input,
+ "parameters": {
+ "do_sample": False,
+ 'temperature': 0,
+ 'wait_for_model': True,
+ # "max_length": 500, # for gpt2
+ # "max_new_tokens": 250 # fot gpt2-xl
+ },
+ }
+ data = json.dumps(payload)
+
+ # request
+ response = requests.request("POST", API_URL, headers=headers, data=data)
+ response_json = json.loads(response.content.decode("utf-8"))
+
+ if type(response_json) is list and len(response_json) == 1:
+ # generated_text contains the input + the response
+ response_full_text = response_json[0]['generated_text']
+
+ # we use this as the next input
+ input = response_full_text
+
+ else:
+ print("Invalid request to huggingface api")
+ from IPython import embed; embed()
+
+ # remove the prompt from the beginning
+ assert response_full_text.startswith(text_prompt)
+ response_text = response_full_text[len(text_prompt):]
+
+ return response_text
+
+ response = query(text_input).strip().lower()
+ return response
+
+ elif model in ["bloom_560m"]:
+ # from transformers import BloomForCausalLM
+ # from transformers import BloomTokenizerFast
+ #
+ # tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+ # model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m", cache_dir=".cache/huggingface/")
+
+ inputs = hf_tokenizer(text_input, return_tensors="pt")
+ # 3 words
+ result_length = inputs['input_ids'].shape[-1]+3
+ full_output = hf_tokenizer.decode(hf_model.generate(inputs["input_ids"], max_length=result_length)[0])
+
+ assert full_output.startswith(text_input)
+ response = full_output[len(text_input):]
+
+ response = response.strip().lower()
+
+ return response
+
+ else:
+ raise ValueError("Unknown model.")
+
+def get_parsed_action(text_action):
+ if "move forward" in text_action:
+ return "move forward"
+
+ elif "turn left" in text_action:
+ return "turn left"
+
+ elif "turn right" in text_action:
+ return "turn right"
+
+ elif "toggle" in text_action:
+ return "toggle"
+
+ elif "no_op" in text_action:
+ return "no_op"
+ else:
+ warnings.warn(f"Undefined action {text_action}")
+ return "no_op"
+
+
+def step(text_action):
+ text_action = get_parsed_action(text_action)
+
+ if "move forward" == text_action:
+ action = [int(env.actions.forward), np.nan, np.nan]
+
+ elif "turn left" == text_action:
+ action = [int(env.actions.left), np.nan, np.nan]
+
+ elif "turn right" == text_action:
+ action = [int(env.actions.right), np.nan, np.nan]
+
+ elif "toggle" == text_action:
+ action = [int(env.actions.toggle), np.nan, np.nan]
+
+ elif "no_op" == text_action:
+ action = [np.nan, np.nan, np.nan]
+
+ # if text_action.startswith("a"):
+ # action = [int(env.actions.forward), np.nan, np.nan]
+ #
+ # elif text_action.startswith("b"):
+ # action = [int(env.actions.left), np.nan, np.nan]
+ #
+ # elif text_action.startswith("c"):
+ # action = [int(env.actions.right), np.nan, np.nan]
+ #
+ # elif text_action.startswith("d"):
+ # action = [int(env.actions.toggle), np.nan, np.nan]
+ #
+ # elif text_action.startswith("e"):
+ # action = [np.nan, np.nan, np.nan]
+ #
+ # else:
+ # print("Unknown action.")
+
+ obs, reward, done, info = env.step(action)
+
+ return obs, reward, done, info
+
+
+
+def reset(env):
+ env.reset()
+ # a dirty trick just to get obs and info
+ return step("no_op")
+
+
+def generate_text_obs(obs, info):
+ llm_prompt = "Obs : "
+ llm_prompt += "".join(info["descriptions"])
+ if obs["utterance_history"] != "Conversation: \n":
+ utt_hist = obs['utterance_history']
+ utt_hist = utt_hist.replace("Conversation: \n","")
+ llm_prompt += utt_hist
+
+ return llm_prompt
+
+
+def action_query():
+ # llm_prompt = ""
+ # llm_prompt += "Your possible actions are:\n"
+ # llm_prompt += "(a) move forward\n"
+ # llm_prompt += "(b) turn left\n"
+ # llm_prompt += "(c) turn right\n"
+ # llm_prompt += "(d) toggle\n"
+ # llm_prompt += "(e) no_op\n"
+ # llm_prompt += "Your next action is: ("
+ llm_prompt = "Act :"
+ return llm_prompt
+
+# lod context examples
+with open(in_context_examples_path, "r") as f:
+ in_context_examples = f.read()
+
+with open(prompt_log_filename, "a+") as f:
+ f.write(datetime_string)
+
+with open(ep_h_log_filename, "a+") as f:
+ f.write(datetime_string)
+
+feed_episode_history = args.feed_full_ep
+
+# asoc
+in_context_n_tokens = 800
+ep_obs_len = 50 * 3
+
+# color
+in_context_n_tokens = 1434
+# ep_obs_len = 70
+
+# feed only current obs
+if feed_episode_history:
+ ep_obs_len = 50
+
+else:
+ # last_n = 1
+ # last_n = 2
+ last_n = 3
+ ep_obs_len = 50 * last_n
+
+_, price = estimate_price(
+ num_of_episodes=args.episodes,
+ in_context_len=in_context_n_tokens,
+ ep_obs_len=ep_obs_len,
+ n_steps=args.max_steps,
+ model=args.model,
+ feed_episode_history=feed_episode_history
+)
+if not args.skip_check:
+ input(f"You will spend: {price} dollars. (in context: {in_context_n_tokens} obs: {ep_obs_len}), ok?")
+
+# prepare frames list to save to gif
+frames = []
+
+assert args.max_steps <= 20
+
+success_rates = []
+# episodes start
+for episode in range(args.episodes):
+ print("Episode:", episode)
+ new_episode_text = "New episode.\n"
+ episode_history_text = new_episode_text
+
+ success = False
+ episode_seed = args.seed + episode
+ env = make_env(args.env_name, episode_seed, env_args)
+
+ with open(prompt_log_filename, "a+") as f:
+ f.write("\n\n")
+
+ observations = []
+ actions = []
+ for i in range(int(args.max_steps)):
+ if i == 0:
+ obs, reward, done, info = reset(env)
+ action_text = ""
+
+ else:
+ with open(prompt_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(llm_prompt)
+
+ text_action = generate(llm_prompt, args.model)
+ obs, reward, done, info = step(text_action)
+ action_text = f"Act : {get_parsed_action(text_action)}\n"
+ actions.append(action_text)
+
+ print(action_text)
+
+ text_obs = generate_text_obs(obs, info)
+ observations.append(text_obs)
+ print(prompt_preprocessor(text_obs))
+
+ # feed the full episode history
+ episode_history_text += prompt_preprocessor(action_text + text_obs) # append to history of this episode
+
+ if feed_episode_history:
+ # feed full episode history
+ llm_prompt = in_context_examples + episode_history_text + action_query()
+
+ else:
+ n = min(last_n, len(observations))
+ obs = observations[-n:]
+ act = (actions + [action_query()])[-n:]
+
+ episode_text = "".join([o+a for o,a in zip(obs, act)])
+
+ llm_prompt = in_context_examples + new_episode_text + episode_text
+
+ llm_prompt = prompt_preprocessor(llm_prompt)
+
+
+ # save the image
+ env.render(mode="human")
+ rgb_img = plt_2_rgb(env)
+ frames.append(rgb_img)
+
+ if env.current_env.box.blocked and not env.current_env.box.is_open:
+ # target box is blocked -> apple can't be obtained
+ # break to save compute
+ break
+
+ if done:
+ # quadruple last frame to pause between episodes
+ for i in range(3):
+ same_img = np.copy(rgb_img)
+ # toggle a pixel between frames to avoid cropping when going from gif to mp4
+ same_img[0, 0, 2] = 0 if (i % 2) == 0 else 255
+ frames.append(same_img)
+
+ if reward > 0:
+ print("Success!")
+ episode_history_text += "Success!\n"
+ success = True
+ else:
+ episode_history_text += "Failure!\n"
+
+ with open(ep_h_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(episode_history_text)
+
+ break
+
+ else:
+ with open(ep_h_log_filename, "a+") as f:
+ f.write("\nnew prompt: -----------------------------------\n")
+ f.write(episode_history_text)
+
+ print(f"{'Success' if success else 'Failure'}")
+ success_rates.append(success)
+
+mean_success_rate = np.mean(success_rates)
+print("Success rate:", mean_success_rate)
+print(f"Saving gif to {gif_savename}.")
+mimsave(gif_savename, frames, duration=args.pause)
+
+print("Done.")
+
+log_data_dict = vars(args)
+log_data_dict["success_rates"] = success_rates
+log_data_dict["mean_success_rate"] = mean_success_rate
+
+print("Evaluation log: ", evaluation_log_filename)
+with open(evaluation_log_filename, "w") as f:
+ f.write(json.dumps(log_data_dict))
diff --git a/scripts/create_LLM_examples.py b/scripts/create_LLM_examples.py
new file mode 100755
index 0000000000000000000000000000000000000000..cfc20bc650a5162474b288a27d0db2ae0280a5fc
--- /dev/null
+++ b/scripts/create_LLM_examples.py
@@ -0,0 +1,271 @@
+#!/usr/bin/env python3
+import argparse
+from gym_minigrid.window import Window
+from utils import *
+import gym
+import pickle
+from datetime import datetime
+
+episodes = []
+record = [False]
+
+
+def update_caption_with_recording_indicator():
+ new_caption = f"Recoding {'ON' if record[0] else 'OFF'}\n------------------\n\n" + window.caption.get_text()
+ window.set_caption(new_caption)
+
+def redraw(img):
+ if not args.agent_view:
+ img = env.render('rgb_array', tile_size=args.tile_size, mask_unobserved=args.mask_unobserved)
+
+ # adds the rocding
+ update_caption_with_recording_indicator()
+
+ window.show_img(img)
+
+def start_recording():
+ record[0] = True
+ print("Recording started")
+
+ episodes[-1][-1]["record"]=True
+
+def reset():
+ episodes.append([])
+ obs, info = env.reset_with_info()
+ record[0] = False
+ redraw(obs)
+
+ episodes[-1].append(
+ {
+ "action": None,
+ "info": info,
+ "obs": obs,
+ "reward": None,
+ "done": None,
+ "record": record[0],
+ }
+ )
+
+
+def step(action):
+ if type(action) == np.ndarray:
+ obs, reward, done, info = env.step(action)
+ else:
+ action = [int(action), np.nan, np.nan]
+ obs, reward, done, info = env.step(action)
+
+ episodes[-1].append(
+ {
+ "action": action,
+ "info": info,
+ "obs": obs,
+ "reward": reward,
+ "done": done,
+ "record": record[0],
+ }
+ )
+ redraw(obs)
+
+ if done:
+ print('done!')
+ print('Reward=%.2f' % (reward))
+
+ # reset and add initial state to episodes
+ reset()
+
+ else:
+ print('\nStep=%s' % (env.step_count))
+
+
+ # filter steps without recording
+ episodes_to_save = [[s for s in ep if s["record"]] for ep in episodes]
+ episodes_to_save = [ep for ep in episodes_to_save if len(ep) > 0]
+
+ # set first recording step to be as if it was just reset (the real first step)
+ for ep_to_save in episodes_to_save:
+ ep_to_save[0]["action"]=None
+ ep_to_save[0]["reward"]=None
+ ep_to_save[0]["done"]=None
+
+
+ # picle the episodes
+ dump_pickle = Path(output_dir) / "episodes.pkl"
+ print(f"Saving {len(episodes_to_save)} episodes ({[len(e) for e in episodes_to_save]}) to : {dump_pickle}")
+
+ with open(dump_pickle, 'wb') as f:
+ pickle.dump(episodes_to_save, f)
+
+
+def key_handler(event):
+
+ print('pressed', event.key)
+
+ if event.key == 'r':
+ start_recording()
+ return
+
+ if event.key == 'escape':
+ window.close()
+ return
+
+ if event.key == 's':
+ reset()
+ return
+
+ if event.key == 'tab':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'shift':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'left':
+ step(env.actions.left)
+ return
+ if event.key == 'right':
+ step(env.actions.right)
+ return
+ if event.key == 'up':
+ step(env.actions.forward)
+ return
+ if event.key == 't':
+ step(env.actions.speak)
+ return
+
+ if event.key == '1':
+ step(np.array([np.nan, 0, 0]))
+ return
+ if event.key == '2':
+ step(np.array([np.nan, 0, 1]))
+ return
+ if event.key == '3':
+ step(np.array([np.nan, 1, 0]))
+ return
+ if event.key == '4':
+ step(np.array([np.nan, 1, 1]))
+ return
+ if event.key == '5':
+ step(np.array([np.nan, 2, 2]))
+ return
+ if event.key == '6':
+ step(np.array([np.nan, 1, 2]))
+ return
+ if event.key == '7':
+ step(np.array([np.nan, 2, 1]))
+ return
+ if event.key == '8':
+ step(np.array([np.nan, 1, 3]))
+ return
+ if event.key == 'p':
+ step(np.array([np.nan, 3, 3]))
+ return
+
+ # Spacebar
+ if event.key == ' ':
+ step(env.actions.toggle)
+ return
+ if event.key == '9':
+ step(env.actions.pickup)
+ return
+ if event.key == '0':
+ step(env.actions.drop)
+ return
+
+ if event.key == 'enter':
+ step(env.actions.done)
+ return
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+
+ parser.add_argument(
+ "--env",
+ help="gym environment to load",
+ # default="SocialAI-AsocialBoxInformationSeekingParamEnv-v1",
+ # default="SocialAI-ColorBoxesLLMCSParamEnv-v1",
+ default="SocialAI-ColorLLMCSParamEnv-v1",
+ )
+ parser.add_argument(
+ "--seed",
+ type=int,
+ help="random seed to generate the environment with",
+ default=-1
+ )
+ parser.add_argument(
+ "--tile_size",
+ type=int,
+ help="size at which to render tiles",
+ default=32
+ )
+ parser.add_argument(
+ '--agent_view',
+ default=False,
+ help="draw the agent sees (partially observable view)",
+ action='store_true'
+ )
+ parser.add_argument(
+ '--mask-unobserved',
+ default=False,
+ help="mask cells that are not observed by the agent",
+ action='store_true'
+ )
+ parser.add_argument(
+ '--save-dir',
+ default="./llm_data/in_context_examples/",
+ help="file where to save episodes",
+ )
+ parser.add_argument(
+ '--load',
+ default=None,
+ help="Load in context examples to append to",
+ )
+ parser.add_argument(
+ '--name',
+ default="in_context",
+ help="additional name tag for the episodes",
+ )
+ parser.add_argument(
+ '--draw-tree',
+ action="store_true",
+ help="Draw the sampling treee",
+ )
+
+ # Put all env related arguments after --env_args, e.g. --env_args nb_foo 1 is_bar True
+ parser.add_argument("--env-args", nargs='*', default=None)
+
+ args = parser.parse_args()
+
+ env = gym.make(args.env, **env_args_str_to_dict(args.env_args))
+
+ timestamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+ output_dir = Path(args.save_dir) / f"{args.name}_{args.env}_{timestamp}"
+ os.makedirs(output_dir, exist_ok=True)
+
+ if args.load:
+ with open(args.load, 'rb') as f:
+ episodes = pickle.load(f)
+
+ if args.draw_tree:
+ # draw tree
+ env.parameter_tree.draw_tree(
+ filename=output_dir / f"/{args.env}_raw_tree",
+ ignore_labels=["Num_of_colors"],
+ )
+
+ if args.seed >= 0:
+ env.seed(args.seed)
+
+ window = Window('gym_minigrid - ' + args.env, figsize=(6, 4))
+ window.reg_key_handler(key_handler)
+ env.window = window
+
+ reset()
+ # # a trick to make the first image appear right away
+ # # this action is not saved
+ # obs, _, _, _ = env.step(np.array([np.nan, np.nan, np.nan]))
+ # redraw(obs)
+
+ # Blocking event loop
+ window.show(block=True)
diff --git a/scripts/create_LLM_examples_old.py b/scripts/create_LLM_examples_old.py
new file mode 100755
index 0000000000000000000000000000000000000000..ab8ab4a11bf738e78f9dc9744f4f6c20e2346fe9
--- /dev/null
+++ b/scripts/create_LLM_examples_old.py
@@ -0,0 +1,313 @@
+#!/usr/bin/env python3
+
+import time
+import argparse
+import numpy as np
+import gym
+import gym_minigrid
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+from utils import *
+from models import MultiModalBaby11ACModel
+from collections import Counter
+import torch_ac
+import json
+from termcolor import colored, COLORS
+
+from functools import partial
+from tkinter import *
+
+from torch.distributions import Categorical
+
+inter_acl = False
+draw_tree = True
+
+def redraw(img):
+ if not args.agent_view:
+ img = env.render('rgb_array', tile_size=args.tile_size, mask_unobserved=args.mask_unobserved)
+
+ window.show_img(img)
+
+def reset():
+ # if args.seed != -1:
+ # env.seed(args.seed)
+
+ obs = env.reset()
+
+ if hasattr(env, 'mission'):
+ print('Mission: %s' % env.mission)
+ window.set_caption(env.mission)
+
+ redraw(obs)
+
+
+tot_bonus = [0]
+
+prev = {
+ "prev_obs": None,
+ "prev_info": {},
+}
+shortened_obj_names = {
+ 'lockablebox' : 'loc_box',
+ 'applegenerator' : 'app_gen',
+ 'generatorplatform': 'gen_pl',
+ 'marbletee' : 'tee',
+ 'remotedoor' : 'rem_door',
+}
+
+IDX_TO_OBJECT = {v: shortened_obj_names.get(k, k) for k, v in OBJECT_TO_IDX.items()}
+# no duplicates
+assert len(IDX_TO_OBJECT) == len(OBJECT_TO_IDX)
+
+IDX_TO_COLOR = {v: k for k, v in COLOR_TO_IDX.items()}
+assert len(IDX_TO_COLOR) == len(COLOR_TO_IDX)
+
+
+# def to_string(enc):
+# s = "{:<8} {} {} {} {} {:3} {:3} {}\t".format(
+# IDX_TO_OBJECT.get(enc[0], enc[0]), # obj
+# *enc[1:3], # x, y
+# IDX_TO_COLOR.get(enc[3], enc[3])[:1].upper(), # color
+# *enc[4:] #
+# )
+#
+# if IDX_TO_OBJECT.get(enc[0], enc[0]) == "unseen":
+# pass
+# # s = colored(s, "on_grey")
+#
+# elif IDX_TO_OBJECT.get(enc[0], enc[0]) != "empty":
+# col = IDX_TO_COLOR.get(enc[3], enc[3])
+# if col in COLORS:
+# s = colored(s, col)
+#
+# return s
+
+
+def step(action):
+ if type(action) == np.ndarray:
+ obs, reward, done, info = env.step(action)
+ else:
+ action = [int(action), np.nan, np.nan]
+ obs, reward, done, info = env.step(action)
+
+
+ redraw(obs)
+
+ if done:
+ print('done!')
+ print('Reward=%.2f' % (reward))
+ print('Exploration_bonus=%.2f' % (tot_bonus[0]))
+ tot_bonus[0] = 0
+
+ with open(output_file, "a") as f:
+ if reward > 0:
+ f.write("Success!\n")
+ f.write("New episode.\n")
+
+ reset()
+
+ else:
+ print('\nStep=%s' % (env.step_count))
+
+ # print to screen
+ print("Obs : ", end="")
+ print("".join(info["descriptions"]), end="")
+ if obs["utterance_history"] != "Conversation: \n":
+ print(obs['utterance_history'])
+ print("Act : ", end="")
+
+ # write to file
+ with open(output_file, "a") as f:
+ f.write("Obs : ")
+ f.write("".join(info["descriptions"]))
+ if obs["utterance_history"] != "Conversation: \n":
+ f.write(obs['utterance_history'])
+ # f.write("Your possible actions are:\n")
+ # f.write("(a) move forward\n")
+ # f.write("(b) turn left\n")
+ # f.write("(c) turn right\n")
+ # f.write("(d) toggle\n")
+ # f.write("(e) no_op\n")
+ f.write("Act : ")
+
+ print('Full reward (undiminshed)=%.2f' % (reward))
+
+
+def key_handler(event):
+
+ # if hasattr(event.canvas, "_event_loop") and event.canvas._event_loop.isRunning():
+ # return
+
+ print('pressed', event.key)
+
+ action_dict = {
+ "up": "a) move forward",
+ "left": "b) turn left",
+ "right": "c) turn right",
+ " ": "d) toggle",
+ "shift": "e) no_op",
+ }
+ action_dict = {
+ "up": "move forward",
+ "left": "turn left",
+ "right": "turn right",
+ " ": "toggle",
+ "shift": "no_op",
+ }
+
+ if event.key in action_dict:
+ your_action = action_dict[event.key]
+
+ with open(output_file, "a") as f:
+ f.write("{}\n".format(your_action))
+
+ if event.key == 'escape':
+ window.close()
+ return
+
+ if event.key == 'r':
+ reset()
+ return
+
+ if event.key == 'tab':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'shift':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'left':
+ step(env.actions.left)
+ return
+ if event.key == 'right':
+ step(env.actions.right)
+ return
+ if event.key == 'up':
+ step(env.actions.forward)
+ return
+ if event.key == 't':
+ step(env.actions.speak)
+ return
+
+ if event.key == '1':
+ step(np.array([np.nan, 0, 0]))
+ return
+ if event.key == '2':
+ step(np.array([np.nan, 0, 1]))
+ return
+ if event.key == '3':
+ step(np.array([np.nan, 1, 0]))
+ return
+ if event.key == '4':
+ step(np.array([np.nan, 1, 1]))
+ return
+ if event.key == '5':
+ step(np.array([np.nan, 2, 2]))
+ return
+ if event.key == '6':
+ step(np.array([np.nan, 1, 2]))
+ return
+ if event.key == '7':
+ step(np.array([np.nan, 2, 1]))
+ return
+ if event.key == '8':
+ step(np.array([np.nan, 1, 3]))
+ return
+ if event.key == 'p':
+ step(np.array([np.nan, 3, 3]))
+ return
+
+ # Spacebar
+ if event.key == ' ':
+ step(env.actions.toggle)
+ return
+ if event.key == '9':
+ step(env.actions.pickup)
+ return
+ if event.key == '0':
+ step(env.actions.drop)
+ return
+
+ if event.key == 'enter':
+ step(env.actions.done)
+ return
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+ "--env",
+ help="gym environment to load",
+ # default="SocialAI-AsocialBoxInformationSeekingParamEnv-v1",
+ default="SocialAI-ColorBoxesLLMCSParamEnv-v1",
+)
+parser.add_argument(
+ "--seed",
+ type=int,
+ help="random seed to generate the environment with",
+ default=-1
+)
+parser.add_argument(
+ "--tile_size",
+ type=int,
+ help="size at which to render tiles",
+ default=32
+)
+parser.add_argument(
+ '--agent_view',
+ default=False,
+ help="draw the agent sees (partially observable view)",
+ action='store_true'
+)
+parser.add_argument(
+ '--print_grid',
+ default=False,
+ help="print the grid with symbols",
+ action='store_true'
+)
+parser.add_argument(
+ '--calc-bonus',
+ default=False,
+ help="calculate explo bonus",
+ action='store_true'
+)
+parser.add_argument(
+ '--mask-unobserved',
+ default=False,
+ help="mask cells that are not observed by the agent",
+ action='store_true'
+)
+parser.add_argument(
+ '--output-file',
+ default="./llm_data/in_context_color_test.txt",
+ help="file where to save episodes",
+)
+
+
+# Put all env related arguments after --env_args, e.g. --env_args nb_foo 1 is_bar True
+parser.add_argument("--env-args", nargs='*', default=None)
+
+args = parser.parse_args()
+
+output_file=args.output_file
+
+env = gym.make(args.env, **env_args_str_to_dict(args.env_args))
+
+if draw_tree:
+ # draw tree
+ env.parameter_tree.draw_tree(
+ filename="viz/SocialAIParam/{}_raw_tree".format(args.env),
+ ignore_labels=["Num_of_colors"],
+ )
+
+if args.seed >= 0:
+ env.seed(args.seed)
+
+with open(output_file, "a") as f:
+ f.write("New episode.\n")
+
+window = Window('gym_minigrid - ' + args.env, figsize=(4, 4))
+window.reg_key_handler(key_handler)
+env.window = window
+
+# Blocking event loop
+window.show(block=True)
diff --git a/scripts/draw_tree.py b/scripts/draw_tree.py
new file mode 100755
index 0000000000000000000000000000000000000000..88321a86f3810b1527a6ff81c49b18197b07b5e2
--- /dev/null
+++ b/scripts/draw_tree.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python3
+
+import argparse
+from utils import *
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+ "--env",
+ help="gym environment to load",
+ default='SocialAI-DrawingEnv-v1',
+)
+parser.add_argument(
+ "--seed",
+ type=int,
+ help="random seed to generate the environment with",
+ default=-1
+)
+parser.add_argument(
+ "--tile_size",
+ type=int,
+ help="size at which to render tiles",
+ default=32
+)
+
+# Put all env related arguments after --env_args, e.g. --env_args nb_foo 1 is_bar True
+parser.add_argument("--env-args", nargs='*', default=None)
+
+
+args = parser.parse_args()
+
+
+env = gym.make(args.env, **env_args_str_to_dict(args.env_args))
+
+# draw tree
+env.parameter_tree.draw_tree(
+ filename="viz/SocialAIParam/{}_raw_tree".format(args.env),
+ ignore_labels=["Num_of_colors"],
+ folded_nodes=["Collaboration", "AppleStealing"],
+ label_parser={
+ "AppleStealing": "Adversarial",
+ "Pragmatic_frame_complexity": "Introductory_sequence",
+},
+ selected_parameters={
+ "Env_type": "Information_seeking",
+ "Pragmatic_frame_complexity": "Eye_contact",
+ "Peer_help": "N",
+ "Cue_type": "Pointing",
+ "Problem": "Doors",
+ "N": "1",
+ "Peer": "N",
+ }
+)
diff --git a/scripts/evaluate.py b/scripts/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f73731b1cc98fd6d7f0b61c5f89589ddeab0ac9
--- /dev/null
+++ b/scripts/evaluate.py
@@ -0,0 +1,358 @@
+import argparse
+import matplotlib.pyplot as plt
+import json
+import time
+import numpy as np
+import torch
+from pathlib import Path
+
+from utils.babyai_utils.baby_agent import load_agent
+from utils.storage import get_status
+from utils.env import make_env
+from utils.other import seed
+from utils.storage import get_model_dir
+from models import *
+
+from scipy import stats
+print("Wrong script. This is from VIGIL")
+exit()
+
+start = time.time()
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--seed", type=int, default=0,
+ help="random seed (default: 0)")
+parser.add_argument("--random-agent", action="store_true", default=False,
+ help="random actions")
+parser.add_argument("--argmax", action="store_true", default=False,
+ help="select the action with highest probability (default: False)")
+parser.add_argument("--episodes", type=int, default=1000,
+ help="number of episodes to test")
+parser.add_argument("--test-p", type=float, default=0.05,
+ help="p value")
+parser.add_argument("--n-seeds", type=int, default=16,
+ help="number of episodes to test")
+parser.add_argument("--subsample-step", type=int, default=1,
+ help="subsample step")
+parser.add_argument("--start-step", type=int, default=1,
+ help="at which step to start the curves")
+
+args = parser.parse_args()
+
+# Set seed for all randomness sources
+
+seed(args.seed)
+
+assert args.seed == 1
+assert not args.argmax
+# assert args.num_frames == 28000000
+# assert args.episodes == 1000
+
+test_p = args.test_p
+n_seeds = args.n_seeds
+subsample_step = args.subsample_step
+start_step = args.start_step
+
+# Set device
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Device: {device}\n")
+
+# what to load
+models_to_evaluate = [
+ "25-03_RERUN_WizardGuide_lang64_mm_baby_short_rec_env_MiniGrid-TalkItOutNoLiar-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50",
+ "25-03_RERUN_WizardTwoGuides_lang64_mm_baby_short_rec_env_MiniGrid-TalkItOut-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50"
+]
+print("evaluating models: ", models_to_evaluate)
+
+# what to put in the legend
+label_parser_dict = {
+ "RERUN_WizardGuide_lang64_no_explo": "Abl-MH-BabyAI",
+ "RERUN_WizardTwoGuides_lang64_no_explo": "MH-BabyAI",
+
+ "RERUN_WizardGuide_lang64_mm_baby_short_rec_env": "Abl-MH-BabyAI-ExpBonus",
+ "RERUN_WizardTwoGuides_lang64_mm_baby_short_rec_env": "MH-BabyAI-ExpBonus",
+
+ "RERUN_WizardGuide_lang64_deaf_no_explo": "Abl-Deaf-MH-BabyAI",
+ "RERUN_WizardTwoGuides_lang64_deaf_no_explo": "Deaf-MH-BabyAI",
+
+ "RERUN_WizardGuide_lang64_bow": "Abl-MH-BabyAI-ExpBonus-BOW",
+ "RERUN_WizardTwoGuides_lang64_bow": "MH-BabyAI-ExpBonus-BOW",
+
+ "RERUN_WizardGuide_lang64_no_mem": "Abl-MH-BabyAI-ExpBonus-no-mem",
+ "RERUN_WizardTwoGuides_lang64_no_mem": "MH-BabyAI-ExpBonus-no-mem",
+
+ "RERUN_WizardGuide_lang64_bigru": "Abl-MH-BabyAI-ExpBonus-bigru",
+ "RERUN_WizardTwoGuides_lang64_bigru": "MH-BabyAI-ExpBonus-bigru",
+
+ "RERUN_WizardGuide_lang64_attgru": "Abl-MH-BabyAI-ExpBonus-attgru",
+ "RERUN_WizardTwoGuides_lang64_attgru": "MH-BabyAI-ExpBonus-attgru",
+
+ "RERUN_WizardGuide_lang64_curr_dial": "Abl-MH-BabyAI-ExpBonus-current-dialogue",
+ "RERUN_WizardTwoGuides_lang64_curr_dial": "MH-BabyAI-ExpBonus-current-dialogue",
+
+ "RERUN_WizardTwoGuides_lang64_mm_baby_short_rec_100M": "MH-BabyAI-ExpBonus-100M"
+}
+
+# how do to stat tests
+compare = {
+ "MH-BabyAI-ExpBonus": "Abl-MH-BabyAI-ExpBonus",
+}
+
+COLORS = ["red", "blue", "green", "black", "purpule", "brown", "orange", "gray"]
+label_color_dict = {l: c for l, c in zip(label_parser_dict.values(), COLORS)}
+
+
+test_set_check_path = Path("test_set_check_{}_nep_{}.json".format(args.seed, args.episodes))
+
+def calc_perf_for_seed(i, model_name, num_frames, seed, argmax, episodes, random_agent=False):
+ print("seed {}".format(i))
+ model = Path(model_name) / str(i)
+ model_dir = get_model_dir(model)
+
+ if test_set_check_path.exists():
+ with open(test_set_check_path, "r") as f:
+ check_loaded = json.load(f)
+ print("check loaded")
+ else:
+ print("check not loaded")
+ check_loaded = None
+
+ # Load environment
+ with open(model_dir+"/config.json") as f:
+ conf = json.load(f)
+
+ env_name = conf["env"]
+
+ env = make_env(env_name, seed)
+ print("Environment loaded\n")
+
+ # load agent
+ agent = load_agent(env, model_dir, argmax, num_frames)
+ status = get_status(model_dir, num_frames)
+ assert status["num_frames"] == num_frames
+ print("Agent loaded\n")
+
+ check = {}
+
+ seed_rewards = []
+ for episode in range(episodes):
+ print("[{}/{}]: ".format(episode, episodes), end="", flush=True)
+
+ obs = env.reset()
+
+ # check envs are the same during seeds
+ if episode in check:
+ assert check[episode] == int(obs['image'].sum())
+ else:
+ check[episode] = int(obs['image'].sum())
+
+ if check_loaded is not None:
+ assert check[episode] == int(obs['image'].sum())
+
+ while True:
+ if random_agent:
+ action = agent.get_random_action(obs)
+ else:
+ action = agent.get_action(obs)
+
+ obs, reward, done, _ = env.step(action)
+ print(".", end="", flush=True)
+ agent.analyze_feedback(reward, done)
+
+ if done:
+ seed_rewards.append(reward)
+ break
+
+ print()
+
+ seed_rewards = np.array(seed_rewards)
+ seed_success_rates = seed_rewards > 0
+
+ if not test_set_check_path.exists():
+ with open(test_set_check_path, "w") as f:
+ json.dump(check, f)
+ print("check saved")
+
+ print("seed success rate:", seed_success_rates.mean())
+ print("seed reward:", seed_rewards.mean())
+
+ return seed_rewards.mean(), seed_success_rates.mean()
+
+
+
+def get_available_steps(model):
+ model_dir = Path(get_model_dir(model))
+ per_seed_available_steps = {}
+ for seed_dir in model_dir.glob("*"):
+ per_seed_available_steps[seed_dir] = sorted([
+ int(str(p.with_suffix("")).split("status_")[-1])
+ for p in seed_dir.glob("status_*")
+ ])
+
+ num_steps = min([len(steps) for steps in per_seed_available_steps.values()])
+
+ steps = list(per_seed_available_steps.values())[0][:num_steps]
+
+ for available_steps in per_seed_available_steps.values():
+ s_steps = available_steps[:num_steps]
+ assert steps == s_steps
+
+ return steps
+
+def plot_with_shade(subplot_nb, ax, x, y, err, color, shade_color, label,
+ legend=False, leg_size=30, leg_loc='best', title=None,
+ ylim=[0, 100], xlim=[0, 40], leg_args={}, leg_linewidth=8.0, linewidth=7.0, ticksize=30,
+ zorder=None, xlabel='perf', ylabel='env steps', smooth_factor=1000):
+ # plt.rcParams.update({'font.size': 15})
+ ax.locator_params(axis='x', nbins=6)
+ ax.locator_params(axis='y', nbins=5)
+ ax.tick_params(axis='both', which='major', labelsize=ticksize)
+
+ # smoothing
+ def smooth(x_, n=50):
+ return np.array([x_[max(i - n, 0):i + 1].mean() for i in range(len(x_))])
+
+ if smooth_factor > 0:
+ y = smooth(y, n=smooth_factor)
+ err = smooth(err, n=smooth_factor)
+
+ ax.plot(x, y, color=color, label=label, linewidth=linewidth, zorder=zorder)
+ ax.fill_between(x, y - err, y + err, color=shade_color, alpha=0.2)
+ if legend:
+ leg = ax.legend(loc=leg_loc, fontsize=leg_size, **leg_args) # 34
+ for legobj in leg.legendHandles:
+ legobj.set_linewidth(leg_linewidth)
+ ax.set_xlabel(xlabel, fontsize=30)
+ if subplot_nb == 0:
+ ax.set_ylabel(ylabel, fontsize=30)
+ ax.set_xlim(xmin=xlim[0], xmax=xlim[1])
+ ax.set_ylim(bottom=ylim[0], top=ylim[1])
+ if title:
+ ax.set_title(title, fontsize=22)
+
+
+def label_parser(label, label_parser_dict):
+ if sum([1 for k, v in label_parser_dict.items() if k in label]) != 1:
+ print("ERROR")
+ print(label)
+ exit()
+
+ for k, v in label_parser_dict.items():
+ if k in label: return v
+
+ return label
+
+
+f, ax = plt.subplots(1, 1, figsize=(10.0, 6.0))
+ax = [ax]
+
+performances = {}
+per_seed_performances = {}
+stds = {}
+
+
+label_parser_dict_reverse = {v: k for k, v in label_parser_dict.items()}
+assert len(label_parser_dict_reverse) == len(label_parser_dict)
+
+label_to_model = {}
+# evaluate and draw curves
+for model in models_to_evaluate:
+ label = label_parser(model, label_parser_dict)
+ label_to_model[label] = model
+
+ color = label_color_dict[label]
+ performances[label] = []
+ per_seed_performances[label] = []
+ stds[label] = []
+
+ steps = get_available_steps(model)
+ steps = steps[::subsample_step]
+ steps = [s for s in steps if s > start_step]
+
+ print("steps:", steps)
+
+ for step in steps:
+ results = []
+ for s in range(n_seeds):
+ results.append(calc_perf_for_seed(
+ s,
+ model_name=model,
+ num_frames=step,
+ seed=args.seed,
+ argmax=args.argmax,
+ episodes=args.episodes,
+ ))
+
+ rewards, success_rates = zip(*results)
+ rewards = np.array(rewards)
+ success_rates = np.array(success_rates)
+ per_seed_performances[label].append(success_rates)
+ performances[label].append(success_rates.mean())
+ stds[label].append(success_rates.std())
+
+ means = np.array(performances[label])
+ err = np.array(stds[label])
+ label = label_parser(str(model), label_parser_dict)
+ max_steps = np.max(steps)
+ min_steps = np.min(steps)
+ min_y = 0.0
+ max_y = 1.0
+ ylabel = "performance"
+ smooth_factor = 0
+
+ plot_with_shade(0, ax[0], steps, means, err, color, color, label,
+ legend=True, xlim=[min_steps, max_steps], ylim=[min_y, max_y],
+ leg_size=20, xlabel="Env steps (millions)", ylabel=ylabel, linewidth=5.0, smooth_factor=smooth_factor)
+
+assert len(label_to_model) == len(models_to_evaluate)
+
+
+def get_compatible_steps(model1, model2, subsample_step):
+ steps_1 = get_available_steps(model1)[::subsample_step]
+ steps_2 = get_available_steps(model2)[::subsample_step]
+
+ min_steps = min(len(steps_1), len(steps_2))
+ steps_1 = steps_1[:min_steps]
+ steps_2 = steps_2[:min_steps]
+ assert steps_1 == steps_2
+
+ return steps_1
+
+
+# stat tests
+for k, v in compare.items():
+ dist_1_steps = per_seed_performances[k]
+ dist_2_steps = per_seed_performances[v]
+
+ model_k = label_to_model[k]
+ model_v = label_to_model[v]
+ steps = get_compatible_steps(model_k, model_v, subsample_step)
+ steps = [s for s in steps if s > start_step]
+
+ for step, dist_1, dist_2 in zip(steps, dist_1_steps, dist_2_steps):
+ assert len(dist_1) == n_seeds
+ assert len(dist_2) == n_seeds
+
+ p = stats.ttest_ind(
+ dist_1,
+ dist_2,
+ equal_var=False
+ ).pvalue
+
+ if np.isnan(p):
+ from IPython import embed; embed()
+
+ if p < test_p:
+ plt.scatter(step, 0.8, color=label_color_dict[k], s=50, marker="x")
+
+ print("{} (m:{}) <---> {} (m:{}) = p: {} result: {}".format(
+ k, np.mean(dist_1), v, np.mean(dist_2), p,
+ "Distributions different(p={})".format(test_p) if p < test_p else "Distributions same(p={})".format(test_p)
+ ))
+ print()
+
+f.savefig('graphics/test.png')
+f.savefig('graphics/test.svg')
diff --git a/scripts/evaluate_new.py b/scripts/evaluate_new.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a3379ebb76a1706d7573da6612935270a4c0c63
--- /dev/null
+++ b/scripts/evaluate_new.py
@@ -0,0 +1,409 @@
+import argparse
+import os
+import matplotlib.pyplot as plt
+import json
+import time
+import numpy as np
+import torch
+from pathlib import Path
+
+from utils.babyai_utils.baby_agent import load_agent
+from utils.storage import get_status
+from utils.env import make_env
+from utils.other import seed
+from utils.storage import get_model_dir
+from models import *
+from utils.env import env_args_str_to_dict
+import gym
+from termcolor import cprint
+
+os.makedirs("./evaluation", exist_ok=True)
+
+start = time.time()
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--test-set-seed", type=int, default=0,
+ help="random seed (default: 0)")
+parser.add_argument("--random-agent", action="store_true", default=False,
+ help="random actions")
+parser.add_argument("--quiet", "-q", action="store_true", default=False,
+ help="quiet")
+parser.add_argument("--eval-env", type=str, default=None,
+ help="env to evaluate on")
+parser.add_argument("--model-to-evaluate", type=str, default=None,
+ help="model to evaluate")
+parser.add_argument("--model-label", type=str, default=None,
+ help="model to evaluate")
+parser.add_argument("--max-steps", type=int, default=None,
+ help="max num of steps")
+parser.add_argument("--argmax", action="store_true", default=False,
+ help="select the action with highest probability (default: False)")
+parser.add_argument("--episodes", type=int, default=1000,
+ help="number of episodes to test")
+parser.add_argument("--test-p", type=float, default=0.05,
+ help="p value")
+parser.add_argument("--n-seeds", type=int, default=8,
+ help="number of episodes to test")
+parser.add_argument("--subsample-step", type=int, default=1,
+ help="subsample step")
+parser.add_argument("--start-step", type=int, default=1,
+ help="at which step to start the curves")
+parser.add_argument("--env_args", nargs='*', default=None)
+
+args = parser.parse_args()
+
+# Set seed for all randomness sources
+
+seed(args.test_set_seed)
+
+assert args.test_set_seed == 1 # turn on for testing
+# assert not args.argmax
+
+# assert args.num_frames == 28000000
+# assert args.episodes == 1000
+
+test_p = args.test_p
+n_seeds = args.n_seeds
+assert n_seeds in [16, 8, 4]
+cprint("n seeds: {}".format(n_seeds), "red")
+subsample_step = args.subsample_step
+start_step = args.start_step
+
+# Set device
+def qprint(*a, **kwargs):
+ if not args.quiet:
+ print(*a, **kwargs)
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+qprint(f"Device: {device}\n")
+
+# what to load
+if args.model_to_evaluate is None:
+ models_to_evaluate = [
+ "19-05_500K_HELP_env_MiniGrid-Exiter-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2"
+ ]
+ label_parser_dict = {
+ "19-05_500K_HELP_env_MiniGrid-Exiter-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2": "Exiter_EB",
+ }
+else:
+ model_name = args.model_to_evaluate.replace("./storage/", "").replace("storage/", "")
+ models_to_evaluate = [
+ model_name
+ ]
+ if args.model_label:
+ label_parser_dict = {
+ model_name: args.model_label,
+ }
+ else:
+ label_parser_dict = {
+ model_name: model_name,
+ }
+ qprint("evaluating models: ", models_to_evaluate)
+
+
+# how do to stat tests
+compare = {
+ # "MH-BabyAI-ExpBonus": "Abl-MH-BabyAI-ExpBonus",
+}
+
+COLORS = ["red", "blue", "green", "black", "purpule", "brown", "orange", "gray"]
+label_color_dict = {l: c for l, c in zip(label_parser_dict.values(), COLORS)}
+
+
+test_set_check_path = Path("test_set_check_{}_nep_{}.json".format(args.test_set_seed, args.episodes))
+
+def calc_perf_for_seed(i, model_name, seed, argmax, episodes, random_agent=False, num_frames=None):
+ qprint("seed {}".format(i))
+ model = Path(model_name) / str(i)
+ model_dir = get_model_dir(model)
+
+ if test_set_check_path.exists():
+ with open(test_set_check_path, "r") as f:
+ check_loaded = json.load(f)
+ qprint("check loaded")
+ else:
+ qprint("check not loaded")
+ check_loaded = None
+
+ # Load environment
+ with open(model_dir+"/config.json") as f:
+ conf = json.load(f)
+
+ if args.eval_env is None:
+ qprint("evaluating on the original env")
+ env_name = conf["env"]
+ else:
+ qprint("evaluating on a different env")
+ env_name = args.eval_env
+
+ env = gym.make(env_name, **env_args_str_to_dict(args.env_args))
+ qprint("Environment loaded\n")
+
+ # load agent
+ agent = load_agent(env, model_dir, argmax)
+ status = get_status(model_dir)
+ qprint("Agent loaded at {} steps.".format(status.get("num_frames", -1)))
+
+ check = {}
+
+ seed_rewards = []
+ seed_sr = []
+ for episode in range(episodes):
+ qprint("[{}/{}]: ".format(episode, episodes), end="", flush=True)
+
+ obs = env.reset()
+
+ # check envs are the same during seeds
+ if episode in check:
+ assert check[episode] == int(obs['image'].sum())
+ else:
+ check[episode] = int(obs['image'].sum())
+
+ if check_loaded is not None:
+ assert check[episode] == int(obs['image'].sum())
+ i = 0
+ tot_reward = 0
+ while True:
+ i+=1
+ if random_agent:
+ action = agent.get_random_action(obs)
+ else:
+ action = agent.get_action(obs)
+
+ obs, reward, done, info = env.step(action)
+ if reward:
+ qprint("*", end="", flush=True)
+ else:
+ qprint(".", end="", flush=True)
+
+ agent.analyze_feedback(reward, done)
+
+ tot_reward += reward
+
+ if done:
+ seed_rewards.append(tot_reward)
+ seed_sr.append(info["success"])
+ break
+
+ if args.max_steps is not None:
+ if i > args.max_steps:
+ seed_rewards.append(tot_reward)
+ seed_sr.append(info["success"])
+ break
+
+ qprint()
+
+ seed_rewards = np.array(seed_rewards)
+ seed_success_rates = np.array(seed_sr)
+
+ if not test_set_check_path.exists():
+ with open(test_set_check_path, "w") as f:
+ json.dump(check, f)
+ qprint("check saved")
+
+ qprint("seed success rate:", seed_success_rates.mean())
+ qprint("seed reward:", seed_rewards.mean())
+
+ return seed_rewards.mean(), seed_success_rates.mean()
+
+
+def get_available_steps(model):
+ model_dir = Path(get_model_dir(model))
+ per_seed_available_steps = {}
+ for seed_dir in model_dir.glob("*"):
+ per_seed_available_steps[seed_dir] = sorted([
+ int(str(p.with_suffix("")).split("status_")[-1])
+ for p in seed_dir.glob("status_*")
+ ])
+
+ num_steps = min([len(steps) for steps in per_seed_available_steps.values()])
+
+ steps = list(per_seed_available_steps.values())[0][:num_steps]
+
+ for available_steps in per_seed_available_steps.values():
+ s_steps = available_steps[:num_steps]
+ assert steps == s_steps
+
+ return steps
+
+def plot_with_shade(subplot_nb, ax, x, y, err, color, shade_color, label,
+ legend=False, leg_size=30, leg_loc='best', title=None,
+ ylim=[0, 100], xlim=[0, 40], leg_args={}, leg_linewidth=8.0, linewidth=7.0, ticksize=30,
+ zorder=None, xlabel='perf', ylabel='env steps', smooth_factor=1000):
+ # plt.rcParams.update({'font.size': 15})
+ ax.locator_params(axis='x', nbins=6)
+ ax.locator_params(axis='y', nbins=5)
+ ax.tick_params(axis='both', which='major', labelsize=ticksize)
+
+ # smoothing
+ def smooth(x_, n=50):
+ return np.array([x_[max(i - n, 0):i + 1].mean() for i in range(len(x_))])
+
+ if smooth_factor > 0:
+ y = smooth(y, n=smooth_factor)
+ err = smooth(err, n=smooth_factor)
+
+ ax.plot(x, y, color=color, label=label, linewidth=linewidth, zorder=zorder)
+ ax.fill_between(x, y - err, y + err, color=shade_color, alpha=0.2)
+ if legend:
+ leg = ax.legend(loc=leg_loc, fontsize=leg_size, **leg_args) # 34
+ for legobj in leg.legendHandles:
+ legobj.set_linewidth(leg_linewidth)
+ ax.set_xlabel(xlabel, fontsize=30)
+ if subplot_nb == 0:
+ ax.set_ylabel(ylabel, fontsize=30)
+ ax.set_xlim(xmin=xlim[0], xmax=xlim[1])
+ ax.set_ylim(bottom=ylim[0], top=ylim[1])
+ if title:
+ ax.set_title(title, fontsize=22)
+
+
+def label_parser(label, label_parser_dict):
+ if sum([1 for k, v in label_parser_dict.items() if k in label]) != 1:
+ qprint("ERROR")
+ qprint(label)
+ exit()
+
+ for k, v in label_parser_dict.items():
+ if k in label: return v
+
+ return label
+
+
+f, ax = plt.subplots(1, 1, figsize=(10.0, 6.0))
+ax = [ax]
+
+performances = {}
+per_seed_performances = {}
+stds = {}
+
+
+label_parser_dict_reverse = {v: k for k, v in label_parser_dict.items()}
+assert len(label_parser_dict_reverse) == len(label_parser_dict)
+
+label_to_model = {}
+# evaluate and draw curves
+for model in models_to_evaluate:
+ label = label_parser(model, label_parser_dict)
+ label_to_model[label] = model
+
+ color = label_color_dict[label]
+ performances[label] = []
+ per_seed_performances[label] = []
+ stds[label] = []
+
+ final_perf = True
+
+ if final_perf:
+
+ results = []
+ for s in range(n_seeds):
+ results.append(calc_perf_for_seed(
+ s,
+ model_name=model,
+ num_frames=None,
+ seed=args.test_set_seed,
+ argmax=args.argmax,
+ episodes=args.episodes,
+ ))
+ rewards, success_rates = zip(*results)
+ # dump per seed performance
+ np.save("./evaluation/{}".format(label), success_rates)
+ rewards = np.array(rewards)
+ success_rates = np.array(success_rates)
+ success_rate_mean = success_rates.mean()
+ succes_rate_std = success_rates.std()
+
+ label = label_parser(str(model), label_parser_dict)
+ cprint("{}: {} +- std {}".format(label, success_rate_mean, succes_rate_std), "red")
+
+ else:
+ steps = get_available_steps(model)
+ steps = steps[::subsample_step]
+ steps = [s for s in steps if s > start_step]
+ qprint("steps:", steps)
+
+ for step in steps:
+ results = []
+ for s in range(n_seeds):
+ results.append(calc_perf_for_seed(
+ s,
+ model_name=model,
+ num_frames=step,
+ seed=args.test_set_seed,
+ argmax=args.argmax,
+ episodes=args.episodes,
+ ))
+
+ rewards, success_rates = zip(*results)
+ rewards = np.array(rewards)
+ success_rates = np.array(success_rates)
+ per_seed_performances[label].append(success_rates)
+ performances[label].append(success_rates.mean())
+ stds[label].append(success_rates.std())
+
+ means = np.array(performances[label])
+ err = np.array(stds[label])
+ label = label_parser(str(model), label_parser_dict)
+ max_steps = np.max(steps)
+ min_steps = np.min(steps)
+ min_y = 0.0
+ max_y = 1.0
+ ylabel = "performance"
+ smooth_factor = 0
+
+ plot_with_shade(0, ax[0], steps, means, err, color, color, label,
+ legend=True, xlim=[min_steps, max_steps], ylim=[min_y, max_y],
+ leg_size=20, xlabel="Env steps (millions)", ylabel=ylabel, linewidth=5.0, smooth_factor=smooth_factor)
+
+assert len(label_to_model) == len(models_to_evaluate)
+
+
+def get_compatible_steps(model1, model2, subsample_step):
+ steps_1 = get_available_steps(model1)[::subsample_step]
+ steps_2 = get_available_steps(model2)[::subsample_step]
+
+ min_steps = min(len(steps_1), len(steps_2))
+ steps_1 = steps_1[:min_steps]
+ steps_2 = steps_2[:min_steps]
+ assert steps_1 == steps_2
+
+ return steps_1
+
+
+# # stat tests
+# for k, v in compare.items():
+# dist_1_steps = per_seed_performances[k]
+# dist_2_steps = per_seed_performances[v]
+#
+# model_k = label_to_model[k]
+# model_v = label_to_model[v]
+# steps = get_compatible_steps(model_k, model_v, subsample_step)
+# steps = [s for s in steps if s > start_step]
+#
+# for step, dist_1, dist_2 in zip(steps, dist_1_steps, dist_2_steps):
+# assert len(dist_1) == n_seeds
+# assert len(dist_2) == n_seeds
+#
+# p = stats.ttest_ind(
+# dist_1,
+# dist_2,
+# equal_var=False
+# ).pvalue
+#
+# if np.isnan(p):
+# from IPython import embed; embed()
+#
+# if p < test_p:
+# plt.scatter(step, 0.8, color=label_color_dict[k], s=50, marker="x")
+#
+# print("{} (m:{}) <---> {} (m:{}) = p: {} result: {}".format(
+# k, np.mean(dist_1), v, np.mean(dist_2), p,
+# "Distributions different(p={})".format(test_p) if p < test_p else "Distributions same(p={})".format(test_p)
+# ))
+# print()
+#
+# f.savefig('graphics/test.png')
+# f.savefig('graphics/test.svg')
diff --git a/scripts/evaluate_old.py b/scripts/evaluate_old.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ce0febe8791dbc3e4a21d934f259519dd60a3c0
--- /dev/null
+++ b/scripts/evaluate_old.py
@@ -0,0 +1,120 @@
+import argparse
+import time
+import torch
+from torch_ac.utils.penv import ParallelEnv
+
+import utils
+from models import ACModel, RandomTalkingMultiHeadedACModel
+
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--env", required=True,
+ help="name of the environment (REQUIRED)")
+parser.add_argument("--model", required=True,
+ help="name of the trained model (REQUIRED)")
+parser.add_argument("--episodes", type=int, default=100,
+ help="number of episodes of evaluation (default: 100)")
+parser.add_argument("--seed", type=int, default=0,
+ help="random seed (default: 0)")
+parser.add_argument("--procs", type=int, default=16,
+ help="number of processes (default: 16)")
+parser.add_argument("--argmax", action="store_true", default=False,
+ help="action with highest probability is selected")
+parser.add_argument("--worst-episodes-to-show", type=int, default=10,
+ help="how many worst episodes to show")
+parser.add_argument("--memory", action="store_true", default=False,
+ help="add a LSTM to the model")
+parser.add_argument("--text", action="store_true", default=False,
+ help="add a GRU to the model")
+parser.add_argument("--dialogue", action="store_true", default=False,
+ help="add a GRU to the model")
+parser.add_argument("--multi-headed-agent", action="store_true", default=False,
+ help="add a talking head")
+args = parser.parse_args()
+
+# Set seed for all randomness sources
+
+utils.seed(args.seed)
+
+# Set device
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Device: {device}\n")
+
+# Load environments
+
+envs = []
+for i in range(args.procs):
+ env = utils.make_env(args.env, args.seed + 10000 * i)
+ envs.append(env)
+env = ParallelEnv(envs)
+print("Environments loaded\n")
+
+# Load agent
+
+model_dir = utils.get_model_dir(args.model)
+agent = utils.Agent(env.observation_space, env.action_space, model_dir,
+ device=device, argmax=args.argmax, num_envs=args.procs,
+ use_memory=args.memory, use_text=args.text, use_dialogue=args.dialogue,
+ agent_class=RandomTalkingMultiHeadedACModel if args.multi_headed_agent else ACModel
+ )
+print("Agent loaded\n")
+
+# Initialize logs
+
+logs = {"num_frames_per_episode": [], "return_per_episode": []}
+
+# Run agent
+
+start_time = time.time()
+
+obss = env.reset()
+
+log_done_counter = 0
+log_episode_return = torch.zeros(args.procs, device=device)
+log_episode_num_frames = torch.zeros(args.procs, device=device)
+
+while log_done_counter < args.episodes:
+ actions = agent.get_actions(obss)
+ obss, rewards, dones, _ = env.step(actions)
+ agent.analyze_feedbacks(rewards, dones)
+
+ log_episode_return += torch.tensor(rewards, device=device, dtype=torch.float)
+ log_episode_num_frames += torch.ones(args.procs, device=device)
+
+ for i, done in enumerate(dones):
+ if done:
+ log_done_counter += 1
+ logs["return_per_episode"].append(log_episode_return[i].item())
+ logs["num_frames_per_episode"].append(log_episode_num_frames[i].item())
+
+ mask = 1 - torch.tensor(dones, device=device, dtype=torch.float)
+ log_episode_return *= mask
+ log_episode_num_frames *= mask
+
+end_time = time.time()
+
+# Print logs
+
+num_frames = sum(logs["num_frames_per_episode"])
+fps = num_frames/(end_time - start_time)
+duration = int(end_time - start_time)
+return_per_episode = utils.synthesize(logs["return_per_episode"])
+num_frames_per_episode = utils.synthesize(logs["num_frames_per_episode"])
+
+print("F {} | FPS {:.0f} | D {} | R:μσmM {:.2f} {:.2f} {:.2f} {:.2f} | F:μσmM {:.1f} {:.1f} {} {}"
+ .format(num_frames, fps, duration,
+ *return_per_episode.values(),
+ *num_frames_per_episode.values()))
+
+# Print worst episodes
+
+n = args.worst_episodes_to_show
+if n > 0:
+ print("\n{} worst episodes:".format(n))
+
+ indexes = sorted(range(len(logs["return_per_episode"])), key=lambda k: logs["return_per_episode"][k])
+ for i in indexes[:n]:
+ print("- episode {}: R={}, F={}".format(i, logs["return_per_episode"][i], logs["num_frames_per_episode"][i]))
diff --git a/scripts/generate_hp_tuning_script.py b/scripts/generate_hp_tuning_script.py
new file mode 100644
index 0000000000000000000000000000000000000000..062d7ad728e41003ead55a905427f986126bf84b
--- /dev/null
+++ b/scripts/generate_hp_tuning_script.py
@@ -0,0 +1,61 @@
+import sys
+import itertools
+
+if __name__ == '__main__':
+ '''
+ Generate scripts to perform grid search on agents's hyperparameters.
+
+ Defines the values to test for each hyperparameter.
+ '''
+
+ tuning_dict = {
+ # "PPO": {
+ # "frames-per-proc": [20, 40, 80],
+ # "lr": [1e-4, 1e-3],
+ # "entropy-coef": [0, 0.01, 0.05],
+ # "recurrence": [5, 10],
+ # "epochs": [4, 8, 12],
+ # "batch-size": [640, 1280, 2560],
+ # "env": ["MiniGrid-CoinThief-8x8-v0 --env_args few_actions True", "MiniGrid-TalkItOutPolite-8x8-v0"]
+ # },
+ "PPO-RND": {
+ # rnd and ride
+ "optim-eps": [1e-5, 1e-7],
+ "entropy-coef": [0.01, 0.0001, 0.0005],
+ "intrinsic-reward-learning-rate": [0.0001, 0.0004, 0.001],
+ # "intrinsic-reward-coef": [0.1],
+ # "intrinisc-reward-momentum": [0],
+ "intrinsic-reward-epsilon": [0.01, 0.001, 0.0001],
+ # "intrinsic-reward-alpha": [0.99],
+ "intrinsic-reward-max-grad-norm": [1000, 40, 20, 1],
+ # rnd
+ # "intrinsic-reward-loss-coef": [0.1],
+ "env": ["MiniGrid-CoinThief-8x8-v0 --env_args few_actions True", "MiniGrid-TalkItOutPolite-8x8-v0"]
+ }
+ }
+
+ with open("hp_tuning_agent.txt", 'w') as f:
+ for agent in tuning_dict:
+ f.write('## {}\n'.format(agent))
+ current_agent_parameters = list(tuning_dict[agent].keys())
+ current_agent_hyperparams = tuning_dict[agent].values()
+ for point in itertools.product(*current_agent_hyperparams):
+ current_arguments = ''
+ for i in range(len(current_agent_parameters)):
+ current_arguments += ' --*' + current_agent_parameters[i]
+ current_arguments += ' ' + str(point[i]) if point[i] is not None else ''
+
+ if agent == "PPO-RND":
+ f.write(
+ '--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 '
+ '--save-interval 100 --log-interval 100 '
+ '--dialogue --multi-modal-babyai11-agent '
+ '--exploration-bonus --exploration-bonus-type rnd --clipped-rewards '
+ '--arch original_endpool_res {}\n'.format(current_arguments))
+ else:
+ f.write(
+ '--slurm_conf jz_short_2gpus_32g --nb_seeds 8 --model PPO_RND_tuning --algo ppo -cs --frames 10000000 '
+ '--save-interval 100 --log-interval 100 '
+ '--dialogue --multi-modal-babyai11-agent '
+ '--arch original_endpool_res {}\n'.format(current_arguments))
+
diff --git a/scripts/manual_control.py b/scripts/manual_control.py
new file mode 100755
index 0000000000000000000000000000000000000000..00b1715252b71c43a1888aee43964c9a1e535d2b
--- /dev/null
+++ b/scripts/manual_control.py
@@ -0,0 +1,450 @@
+#!/usr/bin/env python3
+
+import time
+import argparse
+import numpy as np
+import gym
+import gym_minigrid
+from gym_minigrid.wrappers import *
+from gym_minigrid.window import Window
+from utils import *
+from models import MultiModalBaby11ACModel
+from collections import Counter
+import torch_ac
+import json
+from termcolor import colored, COLORS
+
+from functools import partial
+from tkinter import *
+
+from torch.distributions import Categorical
+
+inter_acl = False
+draw_tree = True
+
+class InteractiveACL:
+
+ def choose(self, node):
+
+ def pop_up(options):
+ pop_data = {}
+
+ def setVar(value):
+ pop_data["var"] = value
+ root.destroy()
+
+ root = Tk()
+ root.title(node.label)
+ root.geometry('600x{}'.format(50*len(options)))
+
+ for i, o in enumerate(options):
+ fn = partial(setVar, value=i)
+ Button(root, text='{}'.format(o), command=fn).pack()
+
+ root.mainloop()
+
+ return pop_data["var"]
+
+ chosen_ind = pop_up([n.label for n in node.children])
+
+ ch = node.children[chosen_ind]
+
+ return ch
+
+
+if inter_acl:
+ interactive_acl = InteractiveACL()
+else:
+ interactive_acl = None
+
+
+def redraw(img):
+ if not args.agent_view:
+ img = env.render('human', tile_size=args.tile_size, mask_unobserved=args.mask_unobserved)
+
+ window.show_img(img)
+
+def reset():
+ # if args.seed != -1:
+ # env.seed(args.seed)
+
+ obs = env.reset()
+
+ if hasattr(env, 'mission'):
+ print('Mission: %s' % env.mission)
+ window.set_caption(env.mission)
+
+ redraw(obs)
+
+ if draw_tree:
+ # draw tree
+ params = env.current_env.parameters
+ env.parameter_tree.draw_tree(
+ filename="viz/SocialAIParam/parameters_{}_{}".format(params["Env_type"], hash(str(params))),
+ ignore_labels=["Num_of_colors"],
+ selected_parameters=params
+ )
+
+ with open('viz/SocialAIParam/parameters_{}_{}.json'.format(params["Env_type"], hash(str(params))), 'w') as fp:
+ json.dump(params, fp)
+
+
+tot_bonus = [0]
+
+prev = {
+ "prev_obs": None,
+ "prev_info": {},
+}
+shortened_obj_names = {
+ 'lockablebox' : 'loc_box',
+ 'applegenerator' : 'app_gen',
+ 'generatorplatform': 'gen_pl',
+ 'marbletee' : 'tee',
+ 'remotedoor' : 'rem_door',
+}
+
+IDX_TO_OBJECT = {v: shortened_obj_names.get(k, k) for k, v in OBJECT_TO_IDX.items()}
+# no duplicates
+assert len(IDX_TO_OBJECT) == len(OBJECT_TO_IDX)
+
+IDX_TO_COLOR = {v: k for k, v in COLOR_TO_IDX.items()}
+assert len(IDX_TO_COLOR) == len(COLOR_TO_IDX)
+
+
+def to_string(enc):
+ s = "{:<8} {} {} {} {} {:3} {:3} {}\t".format(
+ IDX_TO_OBJECT.get(enc[0], enc[0]), # obj
+ *enc[1:3], # x, y
+ IDX_TO_COLOR.get(enc[3], enc[3])[:1].upper(), # color
+ *enc[4:] #
+ )
+
+ if IDX_TO_OBJECT.get(enc[0], enc[0]) == "unseen":
+ pass
+ # s = colored(s, "on_grey")
+
+ elif IDX_TO_OBJECT.get(enc[0], enc[0]) != "empty":
+ col = IDX_TO_COLOR.get(enc[3], enc[3])
+ if col in COLORS:
+ s = colored(s, col)
+
+ return s
+
+
+def step(action):
+ if type(action) == np.ndarray:
+ obs, reward, done, info = env.step(action)
+ else:
+ action = [int(action), np.nan, np.nan]
+ obs, reward, done, info = env.step(action)
+
+ print('\nStep=%s' % (env.step_count))
+
+ # print("".join(info["descriptions"]))
+ print(obs['utterance_history'])
+ print("")
+ # print("Your possible actions are:")
+ # print("a) move forward")
+ # print("b) turn left")
+ # print("c) turn right")
+ # print("d) toggle")
+ # print("e) no_op")
+ # print("Your next action is: ")
+
+ if args.print_grid:
+ grid = obs['image'].transpose((1, 0, 2))
+ for row_i, row in enumerate(grid):
+
+ # if row_i == 0:
+ # for _ in row:
+ # print(to_string(["OBJECT", "X", "Y", "C", "-", "---", "---", "-"]), end="")
+ # # print("{:<8} {} {} {} {:2} {:2} {} {}\t".format("Object", "X", "Y", "C", "", "", "", ""), end="")
+ # print(end="\n")
+
+ for col_i, enc in enumerate(row):
+ print(str(enc), end=" | ")
+ # if row_i == len(grid) - 1 and col_i == len(row) // 2:
+ # # gent
+ # print(to_string(["^^^^^^", "^", "^", "^", "^", "^^^", "^^^", "^"]), end="")
+ # else:
+ # print(to_string(enc), end="")
+ print(end="\n")
+
+ if not args.agent_view:
+
+ nvec = algo.acmodel.model_raw_action_space.nvec
+
+ raw_action = (
+ 5 if np.isnan(action[0]) else 1, # speak switch
+ 0 if np.isnan(action[1]) else 1, # speak switch
+ 0 if np.isnan(action[1]) else action[1], # template
+ 0 if np.isnan(action[2]) else action[2], # word
+ )
+
+
+ dist = []
+ for a, n in zip(raw_action, nvec):
+ logits = torch.ones(n)[None, :]
+ logits[0][int(a)] *= 10
+
+ d = Categorical(logits=logits)
+ dist.append(d)
+ if args.calc_bonus:
+ bonus = algo.calculate_exploration_bonus(
+ obs=[obs],
+ embeddings=torch.zeros([1,128]),
+ done=[done],
+ prev_obs=[prev["prev_obs"]],
+ prev_info=[prev["prev_info"]],
+ agent_actions=torch.tensor([raw_action]),
+ dist=dist,
+ i_step=0,
+ )
+
+ else:
+ bonus = [0]
+
+ prev["prev_obs"] = obs
+ prev["prev_info"] = info
+
+ tot_bonus[0] = tot_bonus[0]+bonus[0]
+ print('expl_bonus_step=%.2f' % (bonus[0]))
+ print('tot_bonus=%.2f' % (tot_bonus[0]))
+
+ if done:
+ for v in algo.visitation_counter.values():
+ v[0] = Counter()
+
+ print('Full reward (undiminshed)=%.2f' % (reward))
+
+ redraw(obs)
+
+ if done:
+ print('done!')
+ print('Reward=%.2f' % (reward))
+ print('Exploration_bonus=%.2f' % (tot_bonus[0]))
+ tot_bonus[0] = 0
+
+ if draw_tree:
+ # draw tree
+ params = env.current_env.parameters
+ env.parameter_tree.draw_tree(
+ filename="viz/SocialAIParam/parameters_{}_{}".format(params["Env_type"], hash(str(params))),
+ ignore_labels=[],
+ selected_parameters=params,
+ )
+
+ with open('viz/SocialAIParam/parameters_{}_{}.json'.format(params["Env_type"], hash(str(params))),
+ 'w') as fp:
+ json.dump(params, fp)
+
+ reset()
+
+def key_handler(event):
+
+ # if hasattr(event.canvas, "_event_loop") and event.canvas._event_loop.isRunning():
+ # return
+
+ print('pressed', event.key)
+
+ action_dict = {
+ "up": "a) move forward",
+ "left": "b) turn left",
+ "right": "c) turn right",
+ " ": "d) toggle",
+ "shift": "e) no_op",
+ }
+
+ if event.key in action_dict:
+ your_action = action_dict[event.key]
+ print("Your next action is: {}".format(your_action))
+
+ if event.key == 'escape':
+ window.close()
+ return
+
+ if event.key == 'r':
+ reset()
+ return
+
+ if event.key == 'tab':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'shift':
+ step(np.array([np.nan, np.nan, np.nan]))
+ return
+
+ if event.key == 'left':
+ step(env.actions.left)
+ return
+ if event.key == 'right':
+ step(env.actions.right)
+ return
+ if event.key == 'up':
+ step(env.actions.forward)
+ return
+ if event.key == 't':
+ step(env.actions.speak)
+ return
+
+ if event.key == '1':
+ step(np.array([np.nan, 0, 0]))
+ return
+ if event.key == '2':
+ step(np.array([np.nan, 0, 1]))
+ return
+ if event.key == '3':
+ step(np.array([np.nan, 1, 0]))
+ return
+ if event.key == '4':
+ step(np.array([np.nan, 1, 1]))
+ return
+ if event.key == '5':
+ step(np.array([np.nan, 2, 2]))
+ return
+ if event.key == '6':
+ step(np.array([np.nan, 1, 2]))
+ return
+ if event.key == '7':
+ step(np.array([np.nan, 2, 1]))
+ return
+ if event.key == '8':
+ step(np.array([np.nan, 1, 3]))
+ return
+ if event.key == 'p':
+ step(np.array([np.nan, 3, 3]))
+ return
+
+ # Spacebar
+ if event.key == ' ':
+ step(env.actions.toggle)
+ return
+ if event.key == '9':
+ step(env.actions.pickup)
+ return
+ if event.key == '0':
+ step(env.actions.drop)
+ return
+
+ if event.key == 'enter':
+ step(env.actions.done)
+ return
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+ "--env",
+ help="gym environment to load",
+ default='SocialAI-ELangColorBoxesTestInformationSeekingParamEnv-v1',
+)
+parser.add_argument(
+ "--seed",
+ type=int,
+ help="random seed to generate the environment with",
+ default=-1
+)
+parser.add_argument(
+ "--tile_size",
+ type=int,
+ help="size at which to render tiles",
+ default=32
+)
+parser.add_argument(
+ '--agent_view',
+ default=False,
+ help="draw the agent sees (partially observable view)",
+ action='store_true'
+)
+parser.add_argument(
+ '--print_grid',
+ default=False,
+ help="print the grid with symbols",
+ action='store_true'
+)
+parser.add_argument(
+ '--calc-bonus',
+ default=False,
+ help="calculate explo bonus",
+ action='store_true'
+)
+parser.add_argument(
+ '--mask-unobserved',
+ default=False,
+ help="mask cells that are not observed by the agent",
+ action='store_true'
+)
+
+# Put all env related arguments after --env_args, e.g. --env_args nb_foo 1 is_bar True
+parser.add_argument("--env-args", nargs='*', default=None)
+
+parser.add_argument("--exploration-bonus", action="store_true", default=False,
+ help="Use a count based exploration bonus")
+parser.add_argument("--exploration-bonus-type", nargs="+", default=["lang"],
+ help="modality on which to use the bonus (lang/grid/cell)")
+parser.add_argument("--exploration-bonus-params", nargs="+", type=float, default=(30., 50.), # lang
+ help="parameters for a count based exploration bonus (C, M)")
+# parser.add_argument("--exploration-bonus-params", nargs="+", type=float, default=(3, 50.), # cell
+# help="parameters for a count based exploration bonus (C, M)")
+# parser.add_argument("--exploration-bonus-params", nargs="+", type=float, default=(1.5, 50.), # grid
+# help="parameters for a count based exploration bonus (C, M)")
+parser.add_argument("--exploration-bonus-tanh", nargs="+", type=float, default=None,
+ help="tanh expl bonus scale, None means no tanh")
+parser.add_argument("--intrinsic-reward-coef", type=float, default=0.1,
+ help="tanh expl bonus scale, None means no tanh")
+
+args = parser.parse_args()
+
+if interactive_acl:
+ env = gym.make(args.env, curriculum=interactive_acl, **env_args_str_to_dict(args.env_args))
+else:
+ env = gym.make(args.env, **env_args_str_to_dict(args.env_args))
+
+if draw_tree:
+ # draw tree
+ env.parameter_tree.draw_tree(
+ filename="viz/SocialAIParam/{}_raw_tree".format(args.env),
+ ignore_labels=["Num_of_colors"],
+ )
+
+
+# if hasattr(env, "draw_tree"):
+# env.draw_tree(ignore_labels=["Num_of_colors"])
+
+# if hasattr(env, "print_tree"):
+# env.print_tree()
+
+if args.seed >= 0:
+ env.seed(args.seed)
+
+# dummy just algo instance just to enable exploration bonus calculation
+algo = torch_ac.PPOAlgo(
+ envs=[env],
+ acmodel=MultiModalBaby11ACModel(
+ obs_space=utils.get_obss_preprocessor(
+ obs_space=env.observation_space,
+ text=False,
+ dialogue_current=False,
+ dialogue_history=True,
+ )[0],
+ action_space=env.action_space,
+ ),
+ exploration_bonus=True,
+ exploration_bonus_tanh=args.exploration_bonus_tanh,
+ exploration_bonus_type=args.exploration_bonus_type,
+ exploration_bonus_params=args.exploration_bonus_params,
+ expert_exploration_bonus=False,
+ episodic_exploration_bonus=True,
+ intrinsic_reward_coef=args.intrinsic_reward_coef,
+ num_frames_per_proc=40,
+)
+
+# if args.agent_view:
+# env = RGBImgPartialObsWrapper(env)
+# env = ImgObsWrapper(env)
+
+window = Window('gym_minigrid - ' + args.env, figsize=(4, 4))
+window.reg_key_handler(key_handler)
+env.window = window
+
+# Blocking event loop
+window.show(block=True)
diff --git a/scripts/tensorboard_aggregator.py b/scripts/tensorboard_aggregator.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4002c231bce5dfc1a61108241d64053f687e220
--- /dev/null
+++ b/scripts/tensorboard_aggregator.py
@@ -0,0 +1,141 @@
+import os
+import sys
+import argparse
+import shutil
+from collections import defaultdict
+
+import numpy as np
+import tensorflow as tf
+from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
+
+
+def tabulate_events(exp_path):
+
+ seeds = [s for s in os.listdir(exp_path) if "combined" not in s]
+ summary_iterators = [EventAccumulator(os.path.join(exp_path, dname)).Reload() for dname in seeds]
+
+ tags = summary_iterators[0].Tags()['scalars']
+ for it in summary_iterators:
+ assert it.Tags()['scalars'] == tags
+
+ out = defaultdict(list)
+ for tag in tags:
+ for events in zip(*[acc.Scalars(tag) for acc in summary_iterators]):
+ assert len(set(e.step for e in events)) == 1
+
+ out[tag].append([e.value for e in events])
+
+ return out
+
+
+def create_histogram_summary(tag, values, bins=1000):
+ # Convert to a numpy array
+ values = np.array(values)
+
+ # Create histogram using numpy
+ counts, bin_edges = np.histogram(values, bins=bins)
+
+ # Fill fields of histogram proto
+ hist = tf.HistogramProto()
+ hist.min = float(np.min(values))
+ hist.max = float(np.max(values))
+ hist.num = int(np.prod(values.shape))
+ hist.sum = float(np.sum(values))
+ hist.sum_squares = float(np.sum(values**2))
+
+ # Requires equal number as bins, where the first goes from -DBL_MAX to bin_edges[1]
+ # See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/summary.proto#L30
+ # Thus, we drop the start of the first bin
+ bin_edges = bin_edges[1:]
+
+ # Add bin edges and counts
+ for edge in bin_edges:
+ hist.bucket_limit.append(edge)
+ for c in counts:
+ hist.bucket.append(c)
+
+ # Create and write Summary
+ return tf.Summary.Value(tag=tag, histo=hist)
+
+
+def create_parsed_histogram_summary(tag, values, bins=1000):
+ # Convert to a numpy array
+
+ # Create histogram using numpy
+ counts, bin_edges = np.histogram(values, bins=bins)
+
+ # Fill fields of histogram proto
+ hist = tf.HistogramProto()
+ hist.min = float(np.min(values))
+ hist.max = float(np.max(values))
+ hist.num = int(np.prod(values.shape))
+ hist.sum = float(np.sum(values))
+ hist.sum_squares = float(np.sum(values**2))
+
+ # Requires equal number as bins, where the first goes from -DBL_MAX to bin_edges[1]
+ # See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/summary.proto#L30
+ # Thus, we drop the start of the first bin
+ bin_edges = bin_edges[1:]
+
+ # Add bin edges and counts
+ for edge in bin_edges:
+ hist.bucket_limit.append(edge)
+ for c in counts:
+ hist.bucket.append(c)
+
+ # Create and write Summary
+ return tf.Summary.Value(tag=tag, histo=hist)
+
+
+def write_combined_events(exp_path, d_combined, dname='combined', mean_var_tags=()):
+
+ fpath = os.path.join(exp_path, dname)
+ if os.path.isdir(fpath):
+ shutil.rmtree(fpath)
+ assert not os.path.isdir(fpath)
+
+ writer = tf.summary.FileWriter(fpath)
+
+
+ tags, values = zip(*d_combined.items())
+
+ cap = min([len(v) for v in values])
+ values = [v[:cap] for v in values]
+
+ timestep_mean = np.array(values).mean(axis=-1)
+ timestep_var = np.array(values).var(axis=-1)
+ timesteps = timestep_mean[tags.index("frames")]
+
+ for tag, means, vars in zip(tags, timestep_mean, timestep_var):
+ for i, mean, var in zip(timesteps, means, vars):
+ summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=mean)])
+ writer.add_summary(summary, global_step=i)
+ writer.flush()
+
+ if tag in mean_var_tags:
+ values = np.array([mean - var, mean, mean + var])
+
+ summary = tf.Summary(value=[
+ create_histogram_summary(tag=tag+"_var", values=values)
+ ])
+ writer.add_summary(summary, global_step=i)
+ writer.flush()
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1:
+ dpath = sys.argv[1]
+ else:
+ raise ValueError("Specify dir")
+
+ parser = argparse.ArgumentParser()
+
+ parser.add_argument('--experiments', nargs='+', help='experiment directories to aggregate', required=True)
+
+ parser.add_argument('--mean-var-tags', nargs='+', help='tags to create mean-var histograms from', required=False, default=["return_mean"])
+
+ args = parser.parse_args()
+
+ for exp_path in args.experiments:
+ d = tabulate_events(exp_path)
+ write_combined_events(exp_path, d, mean_var_tags=args.mean_var_tags)
\ No newline at end of file
diff --git a/scripts/train.py b/scripts/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9de0484a673ab0ffbdcbe64b7a10d85a340202d
--- /dev/null
+++ b/scripts/train.py
@@ -0,0 +1,956 @@
+import argparse
+import random
+import warnings
+import numpy as np
+import time
+import datetime
+import torch
+
+import gym_minigrid.social_ai_envs
+import torch_ac
+import sys
+import json
+import utils
+from pathlib import Path
+from distutils.dir_util import copy_tree
+from utils.env import env_args_str_to_dict
+from models import *
+
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+
+## General parameters
+parser.add_argument("--algo", required=True,
+ help="algorithm to use: ppo (REQUIRED)")
+parser.add_argument("--env", required=True,
+ help="name of the environment to train on (REQUIRED)")
+parser.add_argument("--model", default=None,
+ help="name of the model (default: {ENV}_{ALGO}_{TIME})")
+parser.add_argument("--seed", type=int, default=1,
+ help="random seed (default: 1)")
+parser.add_argument("--log-interval", type=int, default=10,
+ help="number of updates between two logs (default: 10)")
+parser.add_argument("--save-interval", type=int, default=10,
+ help="number of updates between two saves (default: 10, 0 means no saving)")
+parser.add_argument("--procs", type=int, default=16,
+ help="number of processes (default: 16)")
+parser.add_argument("--frames", type=int, default=10**7,
+ help="number of frames of training (default: 1e7)")
+parser.add_argument("--continue-train", default=None,
+ help="path to the model to finetune", type=str)
+parser.add_argument("--finetune-train", default=None,
+ help="path to the model to finetune", type=str)
+parser.add_argument("--compact-save", "-cs", action="store_true", default=False,
+ help="Keep only last model save")
+parser.add_argument("--lr-schedule-end-frames", type=int, default=0,
+ help="Learning rate will be diminished from --lr to 0 linearly over the period of --lr-schedule-end-frames (default: 0 - no diminsh)")
+parser.add_argument("--lr-end", type=float, default=0,
+ help="the final lr that will be reached at 'lr-schedule-end-frames' (default = 0)")
+
+## Periodic test parameters
+parser.add_argument("--test-set-name", required=False,
+ help="name of the environment to test on, default use the train env", default="SocialAITestSet")
+# parser.add_argument("--test-env", required=False,
+# help="name of the environment to test on, default use the train env")
+# parser.add_argument("--no-test", "-nt", action="store_true", default=False,
+# help="don't perform periodic testing")
+parser.add_argument("--test-seed", type=int, default=0,
+ help="random seed (default: 0)")
+parser.add_argument("--test-episodes", type=int, default=50,
+ help="number of episodes to test")
+parser.add_argument("--test-interval", type=int, default=-1,
+ help="number of updates between two tests (default: -1, no testing)")
+parser.add_argument("--test-env-args", nargs='*', default="like_train_no_acl")
+
+## Parameters for main algorithm
+parser.add_argument("--acl", action="store_true", default=False,
+ help="use acl")
+parser.add_argument("--acl-type", type=str, default=None,
+ help="acl type")
+parser.add_argument("--acl-thresholds", nargs="+", type=float, default=(0.75, 0.75),
+ help="per phase thresholds for expert CL")
+parser.add_argument("--acl-minimum-episodes", type=int, default=1000,
+ help="Never go to second phase before this.")
+parser.add_argument("--acl-average-interval", type=int, default=500,
+ help="Average the perfromance estimate over this many last episodes")
+parser.add_argument("--epochs", type=int, default=4,
+ help="number of epochs for PPO (default: 4)")
+parser.add_argument("--exploration-bonus", action="store_true", default=False,
+ help="Use a count based exploration bonus")
+parser.add_argument("--exploration-bonus-type", nargs="+", default=["lang"],
+ help="modality on which to use the bonus (lang/grid)")
+parser.add_argument("--exploration-bonus-params", nargs="+", type=float, default=(30., 50.),
+ help="parameters for a count based exploration bonus (C, M)")
+parser.add_argument("--exploration-bonus-tanh", nargs="+", type=float, default=None,
+ help="tanh expl bonus scale, None means no tanh")
+parser.add_argument("--expert-exploration-bonus", action="store_true", default=False,
+ help="Use an expert exploration bonus")
+parser.add_argument("--episodic-exploration-bonus", action="store_true", default=False,
+ help="Use the exploration bonus in a episodic setting")
+parser.add_argument("--batch-size", type=int, default=256,
+ help="batch size for PPO (default: 256)")
+parser.add_argument("--frames-per-proc", type=int, default=None,
+ help="number of frames per process before update (default: 5 for A2C and 128 for PPO)")
+parser.add_argument("--discount", type=float, default=0.99,
+ help="discount factor (default: 0.99)")
+parser.add_argument("--lr", type=float, default=0.001,
+ help="learning rate (default: 0.001)")
+parser.add_argument("--gae-lambda", type=float, default=0.99,
+ help="lambda coefficient in GAE formula (default: 0.99, 1 means no gae)")
+parser.add_argument("--entropy-coef", type=float, default=0.01,
+ help="entropy term coefficient (default: 0.01)")
+parser.add_argument("--value-loss-coef", type=float, default=0.5,
+ help="value loss term coefficient (default: 0.5)")
+parser.add_argument("--max-grad-norm", type=float, default=0.5,
+ help="maximum norm of gradient (default: 0.5)")
+parser.add_argument("--optim-eps", type=float, default=1e-8,
+ help="Adam and RMSprop optimizer epsilon (default: 1e-8)")
+parser.add_argument("--optim-alpha", type=float, default=0.99,
+ help="RMSprop optimizer alpha (default: 0.99)")
+parser.add_argument("--clip-eps", type=float, default=0.2,
+ help="clipping epsilon for PPO (default: 0.2)")
+parser.add_argument("--recurrence", type=int, default=1,
+ help="number of time-steps gradient is backpropagated (default: 1). If > 1, a LSTM is added to the model to have memory.")
+parser.add_argument("--text", action="store_true", default=False,
+ help="add a GRU to the model to handle text input")
+parser.add_argument("--dialogue", action="store_true", default=False,
+ help="add a GRU to the model to handle the history of dialogue input")
+parser.add_argument("--current-dialogue-only", action="store_true", default=False,
+ help="add a GRU to the model to handle only the current dialogue input")
+parser.add_argument("--multi-headed-agent", action="store_true", default=False,
+ help="add a talking head")
+parser.add_argument("--babyai11_agent", action="store_true", default=False,
+ help="use the babyAI 1.1 agent architecture")
+parser.add_argument("--multi-headed-babyai11-agent", action="store_true", default=False,
+ help="use the multi headed babyAI 1.1 agent architecture")
+parser.add_argument("--custom-ppo", action="store_true", default=False,
+ help="use BabyAI original PPO hyperparameters")
+parser.add_argument("--custom-ppo-2", action="store_true", default=False,
+ help="use BabyAI original PPO hyperparameters but with smaller memory")
+parser.add_argument("--custom-ppo-3", action="store_true", default=False,
+ help="use BabyAI original PPO hyperparameters but with no memory")
+parser.add_argument("--custom-ppo-rnd", action="store_true", default=False,
+ help="rnd reconstruct")
+parser.add_argument("--custom-ppo-rnd-reference", action="store_true", default=False,
+ help="rnd reconstruct")
+parser.add_argument("--custom-ppo-ride", action="store_true", default=False,
+ help="rnd reconstruct")
+parser.add_argument("--custom-ppo-ride-reference", action="store_true", default=False,
+ help="rnd reconstruct")
+parser.add_argument("--ppo-hp-tuning", action="store_true", default=False,
+ help="use PPO hyperparameters selected from our HP tuning")
+parser.add_argument("--multi-modal-babyai11-agent", action="store_true", default=False,
+ help="use the multi headed babyAI 1.1 agent architecture")
+
+# ride ref
+parser.add_argument("--ride-ref-agent", action="store_true", default=False,
+ help="Model from the ride paper")
+parser.add_argument("--ride-ref-preprocessor", action="store_true", default=False,
+ help="use ride reference preprocessor (3D images)")
+
+parser.add_argument("--bAI-lang-model", help="lang model type for babyAI models", default="gru")
+parser.add_argument("--memory-dim", type=int, help="memory dim (128 is small 2048 is big", default=128)
+parser.add_argument("--clipped-rewards", action="store_true", default=False,
+ help="add a talking head")
+parser.add_argument("--intrinsic-reward-epochs", type=int, default=0,
+ help="")
+parser.add_argument("--balance-moa-training", action="store_true", default=False,
+ help="balance moa training to handle class imbalance.")
+parser.add_argument("--moa-memory-dim", type=int, help="memory dim (default=128)", default=128)
+
+# rnd + ride
+parser.add_argument("--intrinsic-reward-coef", type=float, default=0.1,
+ help="")
+parser.add_argument("--intrinsic-reward-learning-rate", type=float, default=0.0001,
+ help="")
+parser.add_argument("--intrinsic-reward-momentum", type=float, default=0,
+ help="")
+parser.add_argument("--intrinsic-reward-epsilon", type=float, default=0.01,
+ help="")
+parser.add_argument("--intrinsic-reward-alpha", type=float, default=0.99,
+ help="")
+parser.add_argument("--intrinsic-reward-max-grad-norm", type=float, default=40,
+ help="")
+# rnd + soc_inf
+parser.add_argument("--intrinsic-reward-loss-coef", type=float, default=0.1,
+ help="")
+# ride
+parser.add_argument("--intrinsic-reward-forward-loss-coef", type=float, default=10,
+ help="")
+parser.add_argument("--intrinsic-reward-inverse-loss-coef", type=float, default=0.1,
+ help="")
+
+parser.add_argument("--reset-rnd-ride-at-phase", action="store_true", default=False,
+ help="expert knowledge resets rnd ride at acl phase change")
+
+# babyAI1.1 related
+parser.add_argument("--arch", default="original_endpool_res",
+ help="image embedding architecture")
+parser.add_argument("--num-films", type=int, default=2,
+ help="")
+
+# Put all env related arguments after --env_args, e.g. --env_args nb_foo 1 is_bar True
+parser.add_argument("--env-args", nargs='*', default=None)
+
+args = parser.parse_args()
+
+if args.compact_save:
+ print("Compact save is deprecated. Don't use it. It doesn't do anything now.")
+
+if args.save_interval != args.log_interval:
+ print(f"save_interval ({args.save_interval}) and log_interval ({args.log_interval}) are not the same. This is not ideal for train continuation.")
+
+if args.seed == -1:
+ args.seed = np.random.randint(424242)
+
+if args.custom_ppo:
+ print("babyAI's ppo config")
+
+ assert not args.custom_ppo_2
+ assert not args.custom_ppo_3
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.gae_lambda = 0.99
+ args.recurrence = 20
+ args.optim_eps = 1e-05
+ args.clip_eps = 0.2
+ args.batch_size = 1280
+
+elif args.custom_ppo_2:
+ print("babyAI's ppo config with smaller memory")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_3
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.gae_lambda = 0.99
+ args.recurrence = 10
+ args.optim_eps = 1e-05
+ args.clip_eps = 0.2
+ args.batch_size = 1280
+
+elif args.custom_ppo_3:
+ print("babyAI's ppo config with no memory")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_2
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.gae_lambda = 0.99
+ args.recurrence = 1
+ args.optim_eps = 1e-05
+ args.clip_eps = 0.2
+ args.batch_size = 1280
+
+elif args.custom_ppo_rnd:
+ print("RND reconstruct")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_2
+ assert not args.custom_ppo_3
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.recurrence = 1
+ # args.recurrence = 5 # use 5 for SocialAI envs
+ args.batch_size = 640
+ args.epochs = 4
+
+ # args.optim_eps = 1e-05
+ # args.entropy_coef = 0.0001
+ args.clipped_rewards = True
+
+elif args.custom_ppo_ride:
+ print("RIDE reconstruct")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_2
+ assert not args.custom_ppo_3
+ assert not args.custom_ppo_rnd
+
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.recurrence = 1
+ # args.recurrence = 5 # use 5 for SocialAI envs
+ args.batch_size = 640
+ args.epochs = 4
+
+ # args.optim_eps = 1e-05
+ # args.entropy_coef = 0.0005
+ args.clipped_rewards = True
+
+elif args.custom_ppo_rnd_reference:
+ print("RND reconstruct")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_2
+ assert not args.custom_ppo_3
+
+ args.frames_per_proc = 128 # 128 for PPO
+ args.lr = 1e-4
+ args.recurrence = 64
+
+ args.gae_lambda = 0.99
+ args.batch_size = 1280
+ args.epochs = 4
+
+ args.optim_eps = 1e-05
+ args.clip_eps = 0.2
+ args.entropy_coef = 0.0001
+ args.clipped_rewards = True
+
+
+elif args.custom_ppo_ride_reference:
+ print("RIDE reference")
+
+ assert not args.custom_ppo
+ assert not args.custom_ppo_2
+ assert not args.custom_ppo_3
+ assert not args.custom_ppo_rnd
+
+ args.frames_per_proc = 128 # 128 for PPO
+ args.lr = 1e-4
+ args.recurrence = 64
+
+ args.gae_lambda = 0.99
+ args.batch_size = 1280
+ args.epochs = 4
+
+ args.optim_eps = 1e-05
+ args.clip_eps = 0.2
+ args.entropy_coef = 0.0005
+ args.clipped_rewards = True
+
+elif args.ppo_hp_tuning:
+
+ args.frames_per_proc = 40
+ args.lr = 1e-4
+ args.recurrence = 5
+ args.batch_size = 640
+ args.epochs = 4
+
+if args.env not in [
+ "MiniGrid-KeyCorridorS3R3-v0",
+ "MiniGrid-MultiRoom-N2-S4-v0",
+ "MiniGrid-MultiRoom-N4-S5-v0",
+ "MiniGrid-MultiRoom-N7-S4-v0",
+ "MiniGrid-MultiRoomNoisyTV-N7-S4-v0"
+]:
+ if args.recurrence <= 1:
+ print("You are using recurrence {} with {} env. This is probably unintentional.".format(args.recurrence, args.env))
+ # warnings.warn("You are using recurrence {} with {} env. This is probably unintentional.".format(args.recurrence, args.env))
+
+
+args.mem = args.recurrence > 1
+
+# Set run dir
+date = datetime.datetime.now().strftime("%y-%m-%d-%H-%M-%S")
+default_model_name = f"{args.env}_{args.algo}_seed{args.seed}_{date}"
+
+model_name = args.model or default_model_name
+model_dir = utils.get_model_dir(model_name)
+
+if Path(model_dir).exists() and args.continue_train is None:
+ raise ValueError(f"Dir {model_dir} already exists and continue train is None.")
+
+# Load loggers and Tensorboard writer
+txt_logger = utils.get_txt_logger(model_dir)
+csv_file, csv_logger = utils.get_csv_logger(model_dir)
+
+
+# Log command and all script arguments
+txt_logger.info("{}\n".format(" ".join(sys.argv)))
+txt_logger.info("{}\n".format(args))
+
+# Set seed for all randomness sources
+utils.seed(args.seed)
+
+# Set device
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+txt_logger.info(f"Device: {device}\n")
+
+# Create env_args dict
+env_args = env_args_str_to_dict(args.env_args)
+
+if args.acl:
+ # expert_acl = "three_stage_expert"
+ expert_acl = args.acl_type
+ print(f"Using curriculum: {expert_acl}.")
+else:
+ expert_acl = None
+
+env_args_no_acl = env_args.copy()
+env_args["curriculum"] = expert_acl
+env_args["expert_curriculum_thresholds"] = args.acl_thresholds
+env_args["expert_curriculum_average_interval"] = args.acl_average_interval
+env_args["expert_curriculum_minimum_episodes"] = args.acl_minimum_episodes
+env_args["egocentric_observation"] = True
+
+# test env args
+if not args.test_env_args:
+ test_env_args = {}
+elif args.test_env_args == "like_train_no_acl":
+ test_env_args = env_args_no_acl
+elif args.test_env_args == "like_train":
+ test_env_args = env_args
+else:
+ test_env_args = env_args_str_to_dict(args.test_env_args)
+
+
+if "SocialAI-" not in args.env:
+ env_args = {}
+ test_env_args = {}
+
+print("train_env_args:", env_args)
+print("test_env_args:", test_env_args)
+
+# Load train environments
+
+envs = []
+for i in range(args.procs):
+ envs.append(utils.make_env(args.env, args.seed + 10000 * i, env_args=env_args))
+
+txt_logger.info("Environments loaded\n")
+
+if args.continue_train and args.finetune_train:
+ raise ValueError(f"Continue path ({args.continue_train}) and finetune path ({args.finetune_train}) can't both be set.")
+
+# Load training status
+if args.continue_train:
+ if args.continue_train == "auto":
+ status_continue_path = Path(model_dir)
+ args.continue_train = status_continue_path # just in case
+ else:
+ status_continue_path = Path(args.continue_train)
+
+ if status_continue_path.is_dir():
+ # if dir, assume experiment dir so append the seed
+ # status_continue_path = Path(status_continue_path) / str(args.seed)
+ status_continue_path = utils.get_status_path(status_continue_path)
+
+ else:
+ if not status_continue_path.is_file():
+ raise ValueError(f"{status_continue_path} is not a file")
+
+ if "status" not in status_continue_path.name:
+ raise UserWarning(f"{status_continue_path} is does not contain status, is this the correct file? ")
+
+ status = utils.load_status(status_continue_path)
+
+ txt_logger.info("Training status loaded\n")
+ txt_logger.info(f"{model_name} continued from {status_continue_path}")
+
+ # copy everything from model_dir to backup_dir
+ assert Path(status_continue_path).is_file()
+
+elif args.finetune_train:
+
+ status_finetune_path = Path(args.finetune_train)
+
+ if status_finetune_path.is_dir():
+ # if dir, assume experiment dir so append the seed
+ status_finetune_seed_path = Path(status_finetune_path) / str(args.seed)
+ if status_finetune_seed_path.exists():
+ # if a seed folder exists assume that you use that one
+ status_finetune_path = utils.get_status_path(status_finetune_seed_path)
+
+ else:
+ # if not assume that no seed folder exists
+ status_finetune_path = utils.get_status_path(status_finetune_path)
+
+ else:
+ if not status_finetune_path.is_file():
+ raise ValueError(f"{status_finetune_path} is not dir or a file")
+
+ if "status" not in status_finetune_path.name:
+ raise UserWarning(f"{status_finetune_path} is does not contain status, is this the correct file? ")
+
+ status = utils.load_status(status_finetune_path)
+
+ txt_logger.info("Training status loaded\n")
+ txt_logger.info(f"{model_name} finetuning from {status_finetune_path}")
+
+ # copy everything from model_dir to backup_dir
+ assert Path(status_finetune_path).is_file()
+
+ # reset parameters for finetuning
+ status["num_frames"] = 0
+ status["update"] = 0
+ del status["optimizer_state"]
+ del status["lr_scheduler_state"]
+ del status["env_args"]
+
+else:
+ status = {"num_frames": 0, "update": 0}
+
+# Parameter sanity checks
+if args.dialogue and args.current_dialogue_only:
+ raise ValueError("Either use dialogue or current-dialogue-only")
+
+if not args.dialogue and not args.current_dialogue_only:
+ warnings.warn("Not using dialogue")
+
+if args.text:
+ raise ValueError("Text should not be used. Use dialogue instead.")
+
+
+# Load observations preprocessor
+obs_space, preprocess_obss = utils.get_obss_preprocessor(
+ obs_space=envs[0].observation_space,
+ text=args.text,
+ dialogue_current=args.current_dialogue_only,
+ dialogue_history=args.dialogue,
+ custom_image_preprocessor=utils.ride_ref_image_preprocessor if args.ride_ref_preprocessor else None,
+ custom_image_space_preprocessor=utils.ride_ref_image_space_preprocessor if args.ride_ref_preprocessor else None,
+)
+
+if args.continue_train is not None or args.finetune_train is not None:
+ assert "vocab" in status
+ preprocess_obss.vocab.load_vocab(status["vocab"])
+ txt_logger.info("Observations preprocessor loaded")
+
+if args.exploration_bonus:
+ if args.expert_exploration_bonus:
+ warnings.warn("You are using expert exploration bonus.")
+
+# Load model
+assert sum(map(int, [
+ args.multi_modal_babyai11_agent,
+ args.multi_headed_babyai11_agent,
+ args.babyai11_agent,
+ args.multi_headed_agent,
+])) <= 1
+
+if args.multi_modal_babyai11_agent:
+ acmodel = MultiModalBaby11ACModel(
+ obs_space=obs_space,
+ action_space=envs[0].action_space,
+ arch=args.arch,
+ use_text=args.text,
+ use_dialogue=args.dialogue,
+ use_current_dialogue_only=args.current_dialogue_only,
+ use_memory=args.mem,
+ lang_model=args.bAI_lang_model,
+ memory_dim=args.memory_dim,
+ num_films=args.num_films
+ )
+elif args.ride_ref_agent:
+ assert args.mem
+ assert not args.text
+ assert not args.dialogue
+
+ acmodel = RefACModel(
+ obs_space=obs_space,
+ action_space=envs[0].action_space,
+ use_memory=args.mem,
+ use_text=args.text,
+ use_dialogue=args.dialogue,
+ input_size=obs_space['image'][-1],
+ )
+ if args.current_dialogue_only: raise NotImplementedError("current dialogue only")
+
+else:
+ acmodel = ACModel(
+ obs_space=obs_space,
+ action_space=envs[0].action_space,
+ use_memory=args.mem,
+ use_text=args.text,
+ use_dialogue=args.dialogue,
+ input_size=obs_space['image'][-1],
+ )
+ if args.current_dialogue_only: raise NotImplementedError("current dialogue only")
+
+# if args.continue_train is not None:
+# assert "model_state" in status
+# acmodel.load_state_dict(status["model_state"])
+
+acmodel.to(device)
+txt_logger.info("Model loaded\n")
+txt_logger.info("{}\n".format(acmodel))
+
+# Load algo
+assert args.algo == "ppo"
+algo = torch_ac.PPOAlgo(
+ envs=envs,
+ acmodel=acmodel,
+ device=device,
+ num_frames_per_proc=args.frames_per_proc,
+ discount=args.discount,
+ lr=args.lr,
+ gae_lambda=args.gae_lambda,
+ entropy_coef=args.entropy_coef,
+ value_loss_coef=args.value_loss_coef,
+ max_grad_norm=args.max_grad_norm,
+ recurrence=args.recurrence,
+ adam_eps=args.optim_eps,
+ clip_eps=args.clip_eps,
+ epochs=args.epochs,
+ batch_size=args.batch_size,
+ preprocess_obss=preprocess_obss,
+ exploration_bonus=args.exploration_bonus,
+ exploration_bonus_tanh=args.exploration_bonus_tanh,
+ exploration_bonus_type=args.exploration_bonus_type,
+ exploration_bonus_params=args.exploration_bonus_params,
+ expert_exploration_bonus=args.expert_exploration_bonus,
+ episodic_exploration_bonus=args.episodic_exploration_bonus,
+ clipped_rewards=args.clipped_rewards,
+ # for rnd, ride, and social influence
+ intrinsic_reward_coef=args.intrinsic_reward_coef,
+ # for rnd and ride
+ intrinsic_reward_epochs=args.intrinsic_reward_epochs,
+ intrinsic_reward_learning_rate=args.intrinsic_reward_learning_rate,
+ intrinsic_reward_momentum=args.intrinsic_reward_momentum,
+ intrinsic_reward_epsilon=args.intrinsic_reward_epsilon,
+ intrinsic_reward_alpha=args.intrinsic_reward_alpha,
+ intrinsic_reward_max_grad_norm=args.intrinsic_reward_max_grad_norm,
+ # for rnd and social influence
+ intrinsic_reward_loss_coef=args.intrinsic_reward_loss_coef,
+ # for ride
+ intrinsic_reward_forward_loss_coef=args.intrinsic_reward_forward_loss_coef,
+ intrinsic_reward_inverse_loss_coef=args.intrinsic_reward_inverse_loss_coef,
+ # for social influence
+ balance_moa_training=args.balance_moa_training,
+ moa_memory_dim=args.moa_memory_dim,
+ lr_schedule_end_frames=args.lr_schedule_end_frames,
+ end_lr=args.lr_end,
+ reset_rnd_ride_at_phase=args.reset_rnd_ride_at_phase,
+)
+
+if args.continue_train or args.finetune_train:
+ algo.load_status_dict(status)
+ # txt_logger.info(f"Model + Algo loaded from {args.continue_train or args.finetune_train}\n")
+ if args.continue_train:
+ txt_logger.info(f"Model + Algo loaded from {status_continue_path} \n")
+ elif args.finetune_train:
+ txt_logger.info(f"Model + Algo loaded from {status_finetune_path} \n")
+
+
+# todo: make nicer
+# Set and load test environment
+if args.test_set_name:
+ if args.test_set_name == "SocialAITestSet":
+ # "SocialAI-AskEyeContactLanguageBoxesInformationSeekingParamEnv-v1",
+ # "SocialAI-NoIntroPointingBoxesInformationSeekingParamEnv-v1"
+ test_env_names = [
+ "SocialAI-TestLanguageColorBoxesInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackBoxesInformationSeekingEnv-v1",
+ "SocialAI-TestPointingBoxesInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationBoxesInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageColorSwitchesInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackSwitchesInformationSeekingEnv-v1",
+ "SocialAI-TestPointingSwitchesInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationSwitchesInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageColorMarbleInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackMarbleInformationSeekingEnv-v1",
+ "SocialAI-TestPointingMarbleInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationMarbleInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageColorGeneratorsInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackGeneratorsInformationSeekingEnv-v1",
+ "SocialAI-TestPointingGeneratorsInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationGeneratorsInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageColorLeversInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackLeversInformationSeekingEnv-v1",
+ "SocialAI-TestPointingLeversInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationLeversInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageColorDoorsInformationSeekingEnv-v1",
+ "SocialAI-TestLanguageFeedbackDoorsInformationSeekingEnv-v1",
+ "SocialAI-TestPointingDoorsInformationSeekingEnv-v1",
+ "SocialAI-TestEmulationDoorsInformationSeekingEnv-v1",
+
+ "SocialAI-TestLeverDoorCollaborationEnv-v1",
+ "SocialAI-TestMarblePushCollaborationEnv-v1",
+ "SocialAI-TestMarblePassCollaborationEnv-v1",
+
+ "SocialAI-TestAppleStealingPerspectiveTakingEnv-v1"
+ ]
+ elif args.test_set_name == "SocialAIGSTestSet":
+ test_env_names = [
+ "SocialAI-GridSearchParamEnv-v1",
+ "SocialAI-GridSearchPointingParamEnv-v1",
+ "SocialAI-GridSearchLangColorParamEnv-v1",
+ "SocialAI-GridSearchLangFeedbackParamEnv-v1",
+ ]
+ elif args.test_set_name == "SocialAICuesGSTestSet":
+ test_env_names = [
+ "SocialAI-CuesGridSearchParamEnv-v1",
+ "SocialAI-CuesGridSearchPointingParamEnv-v1",
+ "SocialAI-CuesGridSearchLangColorParamEnv-v1",
+ "SocialAI-CuesGridSearchLangFeedbackParamEnv-v1",
+ ]
+ elif args.test_set_name == "BoxesPointingTestSet":
+ test_env_names = [
+ "SocialAI-TestPointingBoxesInformationSeekingParamEnv-v1",
+ ]
+ elif args.test_set_name == "PointingTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.pointing_test_set
+ elif args.test_set_name == "LangColorTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.language_color_test_set
+ elif args.test_set_name == "LangFeedbackTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.language_feedback_test_set
+ # joint attention
+ elif args.test_set_name == "JAPointingTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.ja_pointing_test_set
+ elif args.test_set_name == "JALangColorTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.ja_language_color_test_set
+ elif args.test_set_name == "JALangFeedbackTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.ja_language_feedback_test_set
+ # emulation
+ elif args.test_set_name == "DistrEmulationTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.distr_emulation_test_set
+ elif args.test_set_name == "NoDistrEmulationTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.no_distr_emulation_test_set
+ # formats
+ elif args.test_set_name == "NFormatsTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.N_formats_test_set
+ elif args.test_set_name == "EFormatsTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.E_formats_test_set
+ elif args.test_set_name == "AFormatsTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.A_formats_test_set
+ elif args.test_set_name == "AEFormatsTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.AE_formats_test_set
+
+ elif args.test_set_name == "RoleReversalTestSet":
+ test_env_names = gym_minigrid.social_ai_envs.role_reversal_test_set
+
+ else:
+ raise ValueError("Undefined test set name.")
+
+
+else:
+ test_env_names = [args.env]
+
+# test_envs = []
+testers = []
+if args.test_interval > 0:
+ for test_env_name in test_env_names:
+ make_env_args = {
+ "env_key": test_env_name,
+ "seed": args.test_seed,
+ "env_args": test_env_args,
+ }
+ testers.append(utils.Tester(
+ make_env_args, args.test_seed, args.test_episodes, model_dir, acmodel, preprocess_obss, device)
+ )
+
+ # test_env = utils.make_env(test_env_name, args.test_seed, env_args=test_env_args)
+ # test_envs.append(test_env)
+
+ # init tester
+ # testers.append(utils.Tester(test_env, args.test_seed, args.test_episodes, model_dir, acmodel, preprocess_obss, device))
+
+if args.continue_train:
+ for tester in testers:
+ tester.load()
+
+
+# Save config
+env_args_ = {k: v.__repr__() if k == "curriculum" else v for k, v in env_args.items()}
+test_env_args_ = {k: v.__repr__() if k == "curriculum" else v for k, v in test_env_args.items()}
+config_dict = {
+ "seed": args.seed,
+ "env": args.env,
+ "env_args": env_args_,
+ "test_seed": args.test_seed,
+ "test_env": args.test_set_name,
+ "test_env_args": test_env_args_
+}
+config_dict.update(algo.get_config_dict())
+config_dict.update(acmodel.get_config_dict())
+with open(model_dir+'/config.json', 'w') as fp:
+ json.dump(config_dict, fp)
+
+
+# Train model
+
+num_frames = status["num_frames"]
+update = status["update"]
+start_time = time.time()
+
+log_add_headers = num_frames == 0 or not args.continue_train
+
+long_term_save_interval = 5000000
+
+if args.continue_train:
+ # set next long term save interval
+ next_long_term_save = (1 + num_frames // long_term_save_interval) * long_term_save_interval
+
+else:
+ next_long_term_save = 0 # for long term logging
+
+
+while num_frames < args.frames:
+ # Update model parameters
+
+ update_start_time = time.time()
+ # print("current_seed_pre_train:", np.random.get_state()[1][0])
+ exps, logs1 = algo.collect_experiences()
+ logs2 = algo.update_parameters(exps)
+ logs = {**logs1, **logs2}
+ update_end_time = time.time()
+
+ num_frames += logs["num_frames"]
+ update += 1
+
+ NPC_intro = np.mean(logs["NPC_introduced_to"])
+
+ # Print logs
+
+ if update % args.log_interval == 0:
+ fps = logs["num_frames"]/(update_end_time - update_start_time)
+ duration = int(time.time() - start_time)
+ return_per_episode = utils.synthesize(logs["return_per_episode"])
+ extrinsic_return_per_episode = utils.synthesize(logs["extrinsic_return_per_episode"])
+ exploration_bonus_per_episode = utils.synthesize(logs["exploration_bonus_per_episode"])
+ success_rate = utils.synthesize(logs["success_rate_per_episode"])
+ curriculum_max_success_rate = utils.synthesize(logs["curriculum_max_mean_perf_per_episode"])
+ curriculum_param = utils.synthesize(logs["curriculum_param_per_episode"])
+ rreturn_per_episode = utils.synthesize(logs["reshaped_return_per_episode"])
+ num_frames_per_episode = utils.synthesize(logs["num_frames_per_episode"])
+
+ # intrinsic_reward_perf = utils.synthesize(logs["intr_reward_perf"])
+ # intrinsic_reward_perf_ = utils.synthesize(logs["intr_reward_perf_"])
+
+ intrinsic_reward_perf = logs["intr_reward_perf"]
+ intrinsic_reward_perf_ = logs["intr_reward_perf_"]
+
+ lr_ = logs["lr"]
+
+ time_now = int(datetime.datetime.now().strftime("%d%m%Y%H%M%S"))
+
+ header = ["update", "frames", "FPS", "duration", "time"]
+ data = [update, num_frames, fps, duration, time_now]
+ data_to_print = [update, num_frames, fps, duration, time_now]
+
+ header += ["success_rate_" + key for key in success_rate.keys()]
+ data += success_rate.values()
+ data_to_print += success_rate.values()
+
+ header += ["curriculum_max_success_rate_" + key for key in curriculum_max_success_rate.keys()]
+ data += curriculum_max_success_rate.values()
+ if args.acl:
+ data_to_print += curriculum_max_success_rate.values()
+
+ header += ["curriculum_param_" + key for key in curriculum_param.keys()]
+ data += curriculum_param.values()
+ if args.acl:
+ data_to_print += curriculum_param.values()
+
+ header += ["extrinsic_return_" + key for key in extrinsic_return_per_episode.keys()]
+ data += extrinsic_return_per_episode.values()
+ data_to_print += extrinsic_return_per_episode.values()
+
+ # turn on
+ header += ["exploration_bonus_" + key for key in exploration_bonus_per_episode.keys()]
+ data += exploration_bonus_per_episode.values()
+ data_to_print += exploration_bonus_per_episode.values()
+
+ header += ["rreturn_" + key for key in rreturn_per_episode.keys()]
+ data += rreturn_per_episode.values()
+ data_to_print += rreturn_per_episode.values()
+
+
+ header += ["intrinsic_reward_perf_"]
+ data += [intrinsic_reward_perf]
+ # data_to_print += [intrinsic_reward_perf]
+
+ header += ["intrinsic_reward_perf2_"]
+ data += [intrinsic_reward_perf_]
+ # data_to_print += [intrinsic_reward_perf_]
+
+ # header += ["num_frames_" + key for key in num_frames_per_episode.keys()]
+ # data += num_frames_per_episode.values()
+
+ header += ["NPC_intro"]
+ data += [NPC_intro]
+ data_to_print += [NPC_intro]
+
+ header += ["lr"]
+ data += [lr_]
+ data_to_print += [lr_]
+
+ # header += ["entropy", "value", "policy_loss", "value_loss", "grad_norm"]
+ # data += [logs["entropy"], logs["value"], logs["policy_loss"], logs["value_loss"], logs["grad_norm"]]
+
+ # curr_history_len = len(algo.env.envs[0].curriculum.performance_history)
+ # header += ["curr_history_len"]
+ # data += [curr_history_len]
+
+ txt_logger.info("".join([
+ "U {} | F {:06} | FPS {:04.0f} | D {} | T {} ",
+ "| SR:μσmM {:.2f} {:.1f} {:.1f} {:.1f} ",
+ "| CurMaxSR:μσmM {:.2f} {:.1f} {:.1f} {:.1f} " if args.acl else "",
+ "| CurPhase:μσmM {:.2f} {:.1f} {:.1f} {:.1f} " if args.acl else "",
+ "| ExR:μσmM {:.2f} {:.1f} {:.1f} {:.1f} ",
+ "| InR:μσmM {:.2f} {:.1f} {:.1f} {:.1f} ",
+ "| rR:μσmM {:.6f} {:.1f} {:.1f} {:.1f} ",
+ # "| irp:μσmM {:.6f} {:.2f} {:.2f} {:.2f} ",
+ # "| irp_:μσmM {:.6f} {:.2f} {:.2f} {:.2f} ",
+ # "| F:μσmM {:.1f} {:.1f} {} {} ",
+ "| NPC_intro: {:.3f}",
+ "| lr: {:.5f}",
+ # "| cur_his_len: {:.5f}" if args.acl else "",
+ # "| H {:.3f} | V {:.3f} | pL {:.3f} | vL {:.3f} | ∇ {:.3f}"
+ ]).format(*data_to_print))
+
+ header += ["return_" + key for key in return_per_episode.keys()]
+ data += return_per_episode.values()
+
+ if log_add_headers:
+ csv_logger.writerow(header)
+ log_add_headers = False
+ csv_logger.writerow(data)
+ csv_file.flush()
+
+ # Save status
+ long_term_save = False
+ if num_frames >= next_long_term_save:
+ next_long_term_save += long_term_save_interval
+ long_term_save = True
+
+ if (args.save_interval > 0 and update % args.save_interval == 0) or long_term_save:
+ # continuing train works best when save_interval == log_interval, the csv is cleaner wo redundancies
+ status = {"num_frames": num_frames, "update": update}
+
+ algo_status = algo.get_status_dict()
+ status = {**status, **algo_status}
+
+ if hasattr(preprocess_obss, "vocab"):
+ status["vocab"] = preprocess_obss.vocab.vocab
+ status["env_args"] = env_args
+
+ if long_term_save:
+ utils.save_status(status, model_dir, num_frames=num_frames)
+ utils.save_model(acmodel, model_dir, num_frames=num_frames)
+ txt_logger.info("Status and Model saved for {} frames".format(num_frames))
+
+ else:
+ utils.save_status(status, model_dir)
+ utils.save_model(acmodel, model_dir)
+ txt_logger.info("Status and Model saved")
+
+ if args.test_interval > 0 and (update % args.test_interval == 0 or update == 1):
+ txt_logger.info(f"Testing at update {update}.")
+ test_success_rates = []
+ for tester in testers:
+ mean_success_rate, mean_rewards = tester.test_agent(num_frames)
+ test_success_rates.append(mean_success_rate)
+ txt_logger.info(f"\t{tester.envs[0].spec.id} -> {mean_success_rate} (SR)")
+ tester.dump()
+
+ if len(testers):
+ txt_logger.info(f"Test set SR: {np.array(test_success_rates).mean()}")
+
+
+# save at the end
+status = {"num_frames": num_frames, "update": update}
+algo_status = algo.get_status_dict()
+status = {**status, **algo_status}
+
+if hasattr(preprocess_obss, "vocab"):
+ status["vocab"] = preprocess_obss.vocab.vocab
+ status["env_args"] = env_args
+
+utils.save_status(status, model_dir)
+utils.save_model(acmodel, model_dir)
+txt_logger.info("Status and Model saved at the end")
diff --git a/scripts/visualize.py b/scripts/visualize.py
new file mode 100644
index 0000000000000000000000000000000000000000..f643eda716d4d1f24f38cc5517f7685b7569850c
--- /dev/null
+++ b/scripts/visualize.py
@@ -0,0 +1,159 @@
+import argparse
+import json
+import time
+import numpy as np
+import torch
+from pathlib import Path
+
+from utils.babyai_utils.baby_agent import load_agent
+from utils.env import make_env
+from utils.other import seed
+from utils.storage import get_model_dir
+from utils.storage import get_status
+from models import *
+import subprocess
+
+# Parse arguments
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model", required=True,
+ help="name of the trained model (REQUIRED)")
+parser.add_argument("--seed", type=int, default=0,
+ help="random seed (default: 0)")
+parser.add_argument("--max-steps", type=int, default=None,
+ help="max num of steps")
+parser.add_argument("--shift", type=int, default=0,
+ help="number of times the environment is reset at the beginning (default: 0)")
+parser.add_argument("--argmax", action="store_true", default=False,
+ help="select the action with highest probability (default: False)")
+parser.add_argument("--pause", type=float, default=0.5,
+ help="pause duration between two consequent actions of the agent (default: 0.5)")
+parser.add_argument("--env-name", type=str, default=None, required=True,
+ help="env name")
+parser.add_argument("--gif", type=str, default=None,
+ help="store output as gif with the given filename", required=True)
+parser.add_argument("--episodes", type=int, default=10,
+ help="number of episodes to visualize")
+
+args = parser.parse_args()
+
+# Set seed for all randomness sources
+
+seed(args.seed)
+
+save = args.gif
+if save:
+ savename = args.gif
+ if savename == "model_id":
+ savename = args.model.replace('storage/', '')
+ savename = savename.replace('/','_')
+ savename += '_{}'.format(args.seed)
+
+
+
+
+# Set device
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Device: {device}\n")
+
+# Load environment
+
+if str(args.model).startswith("./storage/"):
+ args.model = args.model.replace("./storage/", "")
+
+if str(args.model).startswith("storage/"):
+ args.model = args.model.replace("storage/", "")
+
+with open(Path("./storage") / args.model / "config.json") as f:
+ conf = json.load(f)
+
+if args.env_name is None:
+ # load env_args from status
+ env_args = {}
+ if not "env_args" in conf.keys():
+ env_args = get_status(get_model_dir(args.model), None)['env_args']
+ else:
+ env_args = conf["env_args"]
+
+ env = make_env(args.env_name, args.seed, env_args=env_args)
+else:
+ env_name = args.env_name
+ env = make_env(args.env_name, args.seed)
+
+for _ in range(args.shift):
+ env.reset()
+print("Environment loaded\n")
+
+# Define agent
+model_dir = get_model_dir(args.model)
+num_frames = None
+agent = load_agent(env, model_dir, args.argmax, num_frames)
+
+print("Agent loaded\n")
+
+# Run the agent
+
+if save:
+ from imageio import mimsave
+ old_frames = []
+ frames = []
+
+# Create a window to view the environment
+env.render(mode='human')
+
+def plt_2_rgb(env):
+ data = np.frombuffer(env.window.fig.canvas.tostring_rgb(), dtype=np.uint8)
+ data = data.reshape(env.window.fig.canvas.get_width_height()[::-1] + (3,))
+ return data
+
+
+for episode in range(args.episodes):
+ print("episode:", episode)
+ obs = env.reset()
+
+ env.render(mode='human')
+ if save:
+ frames.append(plt_2_rgb(env))
+
+ i = 0
+ while True:
+ i += 1
+
+ action = agent.get_action(obs)
+ obs, reward, done, _ = env.step(action)
+ agent.analyze_feedback(reward, done)
+ env.render(mode='human')
+
+ if save:
+ img = plt_2_rgb(env)
+ frames.append(img)
+ if done:
+ # quadruple last frame to pause between episodes
+ for i in range(3):
+ same_img = np.copy(img)
+ # toggle a pixel between frames to avoid cropping when going from gif to mp4
+ same_img[0,0,2] = 0 if (i % 2) == 0 else 255
+ frames.append(same_img)
+
+ if done or env.window.closed:
+ break
+
+ if args.max_steps is not None:
+ if i > args.max_steps:
+ break
+
+
+ if env.window.closed:
+ break
+
+if save:
+ # from IPython import embed; embed()
+ print(f"Saving to {savename} ", end="")
+ mimsave(savename + '.gif', frames, duration=args.pause)
+ # Reduce gif size
+ # bashCommand = "gifsicle -O3 --colors 32 -o {}.gif {}.gif".format(savename, savename)
+ # process = subprocess.run(bashCommand.split(), stdout=subprocess.PIPE)
+
+
+ print("Done.")
diff --git a/stester.py b/stester.py
new file mode 100644
index 0000000000000000000000000000000000000000..51dcc639e850db9c0a6a160af87bbd9266f7a40c
--- /dev/null
+++ b/stester.py
@@ -0,0 +1,83 @@
+import os
+import numpy as np
+import re
+from pathlib import Path
+from collections import defaultdict
+from scipy import stats
+
+experiments = Path("./results_1000/")
+
+results_dict = {}
+
+def label_parser(label):
+ label_parser_dict = {
+ "VIGIL4_WizardGuide_lang64_no_explo": "ABL_MH-BabyAI",
+ "VIGIL4_WizardTwoGuides_lang64_no_explo": "FULL_MH-BabyAI",
+
+ "VIGIL4_WizardGuide_lang64_mm": "ABL_MH-BabyAI-ExpBonus",
+ "VIGIL4_WizardTwoGuides_lang64_mm": "FULL_MH-BabyAI-ExpBonus",
+
+ "VIGIL4_WizardGuide_lang64_deaf_no_explo": "ABL_Deaf-MH-BabyAI",
+ "VIGIL4_WizardTwoGuides_lang64_deaf_no_explo": "FULL_Deaf-MH-BabyAI",
+
+ "VIGIL4_WizardGuide_lang64_bow": "ABL_MH-BabyAI-BOW",
+ "VIGIL4_WizardTwoGuides_lang64_bow": "FULL_MH-BabyAI-BOW",
+
+ "VIGIL4_WizardGuide_lang64_no_mem": "ABL_MH-BabyAI-no-mem",
+ "VIGIL4_WizardTwoGuides_lang64_no_mem": "FULL_MH-BabyAI-no-mem",
+
+ "VIGIL5_WizardGuide_lang64_bigru": "ABL_MH-BabyAI-bigru",
+ "VIGIL5_WizardTwoGuides_lang64_bigru": "FULL_MH-BabyAI-bigru",
+
+ "VIGIL5_WizardGuide_lang64_attgru": "ABL_MH-BabyAI-attgru",
+ "VIGIL5_WizardTwoGuides_lang64_attgru": "FULL_MH-BabyAI-attgru",
+
+ "VIGIL4_WizardGuide_lang64_curr_dial": "ABL_MH-BabyAI-current-dialogue",
+ "VIGIL4_WizardTwoGuides_lang64_curr_dial": "FULL_MH-BabyAI-current-dialogue",
+
+ "random_WizardGuide": "ABL_Random-agent",
+ "random_WizardTwoGuides": "FULL_Random-agent",
+ }
+ if sum([1 for k, v in label_parser_dict.items() if k in label]) != 1:
+ print("ERROR")
+ print(label)
+ exit()
+
+ for k, v in label_parser_dict.items():
+ if k in label: return v
+
+ return label
+
+for experiment_out_file in experiments.iterdir():
+ results_dict[label_parser(str(experiment_out_file))] = []
+ with open(experiment_out_file) as f:
+ for line in f:
+ if "seed success rate" in line:
+ seed_success_rate = float(re.search('[0-9]\.[0-9]*', line).group())
+ results_dict[label_parser(str(experiment_out_file))].append(seed_success_rate)
+
+assert set([len(v) for v in results_dict.values()]) == set([16])
+
+test_p = 0.05
+compare = {
+ "ABL_MH-BabyAI-ExpBonus": "ABL_MH-BabyAI",
+ "ABL_MH-BabyAI": "ABL_Deaf-MH-BabyAI",
+ "ABL_Deaf-MH-BabyAI": "ABL_Random-agent",
+ "FULL_MH-BabyAI-ExpBonus": "FULL_MH-BabyAI",
+ "FULL_MH-BabyAI": "FULL_Deaf-MH-BabyAI",
+ "FULL_Deaf-MH-BabyAI": "FULL_Random-agent",
+}
+for k, v in compare.items():
+ p = stats.ttest_ind(
+ results_dict[k],
+ results_dict[v],
+ equal_var=False
+ ).pvalue
+ if np.isnan(p):
+ from IPython import embed; embed()
+ print("{} (m:{}) <---> {} (m:{}) = p: {} result: {}".format(
+ k, np.mean(results_dict[k]), v, np.mean(results_dict[v]), p,
+ "Distributions different(p={})".format(test_p) if p < test_p else "Distributions same(p={})".format(test_p)
+ ))
+ print()
+# from IPython import embed; embed()
\ No newline at end of file
diff --git a/svg2png.sh b/svg2png.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65102e1bc9519e63731da01d089ef570f1a09abe
--- /dev/null
+++ b/svg2png.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Check if a directory path is provided as an argument
+if [ -z "$1" ]; then
+ echo "Please provide a directory path as an argument"
+ exit 1
+fi
+
+# Check if the provided directory path exists
+if [ ! -d "$1" ]; then
+ echo "The provided directory path does not exist"
+ exit 1
+fi
+
+# Convert SVG files to PNGs using ImageMagick
+for file in "$1"/*.svg; do
+ if [ -f "$file" ]; then
+ filename="${file%.*}"
+ convert "$file" "${filename}.png"
+ echo "Converted $file to ${filename}.png"
+ fi
+done
diff --git a/time_profiling.sh b/time_profiling.sh
new file mode 100644
index 0000000000000000000000000000000000000000..57d62dc233ce127e2c89cbfacb0c0cfd3e74f203
--- /dev/null
+++ b/time_profiling.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+rm -rf storage/test && python -m cProfile -o graphics/train.prof scripts/train.py --model test --seed 1 --algo ppo --dialogue --save-interval 100 --log-interval 1 --test-interval 0 --frames-per-proc 40 --multi-modal-babyai11-agent --env SocialAI-GridSearchParamEnv-v1 --clipped-rewards --batch-size 640 --clip-eps 0.2 --recurrence 5 --max-grad-norm 0.5 --epochs 4 --optim-eps 1e-05 --lr 0.0001 --entropy-coef 0.00001 --env-args see_through_walls True --arch original_endpool_res --env-args max_steps 80 --frames 12800
+#snakeviz train.prof
diff --git a/torch-ac/.gitignore b/torch-ac/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..fb1b2f3a27757483bfcbc84312071ec7ebde7066
--- /dev/null
+++ b/torch-ac/.gitignore
@@ -0,0 +1,4 @@
+*__pycache__
+*egg-info
+build/*
+dist/*
\ No newline at end of file
diff --git a/torch-ac/LICENSE b/torch-ac/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..346a2315807f6ee37d0b4e1daae48655b19af9d4
--- /dev/null
+++ b/torch-ac/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2019 Lucas Willems
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/torch-ac/README.md b/torch-ac/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48e671d448fef521201574452049dd3e76b952b7
--- /dev/null
+++ b/torch-ac/README.md
@@ -0,0 +1,122 @@
+# PyTorch Actor-Critic deep reinforcement learning algorithms: A2C and PPO
+
+The `torch_ac` package contains the PyTorch implementation of two Actor-Critic deep reinforcement learning algorithms:
+
+- [Synchronous A3C (A2C)](https://arxiv.org/pdf/1602.01783.pdf)
+- [Proximal Policy Optimizations (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
+
+**Note:** An example of use of this package is given in the [`rl-starter-files` repository](https://github.com/lcswillems/rl-starter-files). More details below.
+
+## Features
+
+- **Recurrent policies**
+- Reward shaping
+- Handle observation spaces that are tensors or _dict of tensors_
+- Handle _discrete_ action spaces
+- Observation preprocessing
+- Multiprocessing
+- CUDA
+
+## Installation
+
+```bash
+pip3 install torch-ac
+```
+
+**Note:** If you want to modify `torch-ac` algorithms, you will need to rather install a cloned version, i.e.:
+```
+git clone https://github.com/lcswillems/torch-ac.git
+cd torch-ac
+pip3 install -e .
+```
+
+## Package components overview
+
+A brief overview of the components of the package:
+
+- `torch_ac.A2CAlgo` and `torch_ac.PPOAlgo` classes for A2C and PPO algorithms
+- `torch_ac.ACModel` and `torch_ac.RecurrentACModel` abstract classes for non-recurrent and recurrent actor-critic models
+- `torch_ac.DictList` class for making dictionnaries of lists list-indexable and hence batch-friendly
+
+## Package components details
+
+Here are detailled the most important components of the package.
+
+`torch_ac.A2CAlgo` and `torch_ac.PPOAlgo` have 2 methods:
+- `__init__` that may take, among the other parameters:
+ - an `acmodel` actor-critic model, i.e. an instance of a class inheriting from either `torch_ac.ACModel` or `torch_ac.RecurrentACModel`.
+ - a `preprocess_obss` function that transforms a list of observations into a list-indexable object `X` (e.g. a PyTorch tensor). The default `preprocess_obss` function converts observations into a PyTorch tensor.
+ - a `reshape_reward` function that takes into parameter an observation `obs`, the action `action` taken, the reward `reward` received and the terminal status `done` and returns a new reward. By default, the reward is not reshaped.
+ - a `recurrence` number to specify over how many timesteps gradient is backpropagated. This number is only taken into account if a recurrent model is used and **must divide** the `num_frames_per_agent` parameter and, for PPO, the `batch_size` parameter.
+- `update_parameters` that first collects experiences, then update the parameters and finally returns logs.
+
+`torch_ac.ACModel` has 2 abstract methods:
+- `__init__` that takes into parameter an `observation_space` and an `action_space`.
+- `forward` that takes into parameter N preprocessed observations `obs` and returns a PyTorch distribution `dist` and a tensor of values `value`. The tensor of values **must be** of size N, not N x 1.
+
+`torch_ac.RecurrentACModel` has 3 abstract methods:
+- `__init__` that takes into parameter the same parameters than `torch_ac.ACModel`.
+- `forward` that takes into parameter the same parameters than `torch_ac.ACModel` along with a tensor of N memories `memory` of size N x M where M is the size of a memory. It returns the same thing than `torch_ac.ACModel` plus a tensor of N memories `memory`.
+- `memory_size` that returns the size M of a memory.
+
+**Note:** The `preprocess_obss` function must return a list-indexable object (e.g. a PyTorch tensor). If your observations are dictionnaries, your `preprocess_obss` function may first convert a list of dictionnaries into a dictionnary of lists and then make it list-indexable using the `torch_ac.DictList` class as follow:
+
+```python
+>>> d = DictList({"a": [[1, 2], [3, 4]], "b": [[5], [6]]})
+>>> d.a
+[[1, 2], [3, 4]]
+>>> d[0]
+DictList({"a": [1, 2], "b": [5]})
+```
+
+**Note:** if you use a RNN, you will need to set `batch_first` to `True`.
+
+## Examples
+
+Examples of use of the package components are given in the [`rl-starter-scripts` repository](https://github.com/lcswillems/torch-rl).
+
+### Example of use of `torch_ac.A2CAlgo` and `torch_ac.PPOAlgo`
+
+```python
+...
+
+algo = torch_ac.PPOAlgo(envs, acmodel, args.frames_per_proc, args.discount, args.lr, args.gae_lambda,
+ args.entropy_coef, args.value_loss_coef, args.max_grad_norm, args.recurrence,
+ args.optim_eps, args.clip_eps, args.epochs, args.batch_size, preprocess_obss)
+
+...
+
+exps, logs1 = algo.collect_experiences()
+logs2 = algo.update_parameters(exps)
+```
+
+More details [here](https://github.com/lcswillems/rl-starter-files/blob/master/scripts/train.py).
+
+### Example of use of `torch_ac.DictList`
+
+```python
+torch_ac.DictList({
+ "image": preprocess_images([obs["image"] for obs in obss], device=device),
+ "text": preprocess_texts([obs["mission"] for obs in obss], vocab, device=device)
+})
+```
+
+More details [here](https://github.com/lcswillems/rl-starter-files/blob/master/utils/format.py).
+
+### Example of implementation of `torch_ac.RecurrentACModel`
+
+```python
+class ACModel(nn.Module, torch_ac.RecurrentACModel):
+ ...
+
+ def forward(self, obs, memory):
+ ...
+
+ return dist, value, memory
+```
+
+More details [here](https://github.com/lcswillems/rl-starter-files/blob/master/model.py).
+
+### Examples of `preprocess_obss` functions
+
+More details [here](https://github.com/lcswillems/rl-starter-files/blob/master/utils/format.py).
diff --git a/torch-ac/setup.py b/torch-ac/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..06635dfc2129344771f2b27ac774e45bacc3ab7e
--- /dev/null
+++ b/torch-ac/setup.py
@@ -0,0 +1,14 @@
+from setuptools import setup, find_packages
+
+setup(
+ name="torch_ac",
+ version="1.1.0",
+ keywords="reinforcement learning, actor-critic, a2c, ppo, multi-processes, gpu",
+ packages=find_packages(),
+ install_requires=[
+ "numpy==1.17.0",
+ #"torch>=1.10.2"
+ "torch==1.10.2"
+ #"torch==1.10.2+cu102"
+ ]
+)
diff --git a/torch-ac/torch_ac/__init__.py b/torch-ac/torch_ac/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0bc3f758b1fea075f6030bfa769075600af0353
--- /dev/null
+++ b/torch-ac/torch_ac/__init__.py
@@ -0,0 +1,3 @@
+from torch_ac.algos import A2CAlgo, PPOAlgo
+from torch_ac.model import ACModel, RecurrentACModel
+from torch_ac.utils import DictList
\ No newline at end of file
diff --git a/torch-ac/torch_ac/algos/__init__.py b/torch-ac/torch_ac/algos/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d6fa88c9723f478b3a2859e03049ab9fa1a0a68
--- /dev/null
+++ b/torch-ac/torch_ac/algos/__init__.py
@@ -0,0 +1,2 @@
+from torch_ac.algos.a2c import A2CAlgo
+from torch_ac.algos.ppo import PPOAlgo
\ No newline at end of file
diff --git a/torch-ac/torch_ac/algos/a2c.py b/torch-ac/torch_ac/algos/a2c.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc720cd54ce95cd90cac40827c697810427e85bd
--- /dev/null
+++ b/torch-ac/torch_ac/algos/a2c.py
@@ -0,0 +1,110 @@
+import numpy
+import torch
+import torch.nn.functional as F
+
+from torch_ac.algos.base import BaseAlgo
+
+class A2CAlgo(BaseAlgo):
+ """The Advantage Actor-Critic algorithm."""
+
+ def __init__(self, envs, acmodel, device=None, num_frames_per_proc=None, discount=0.99, lr=0.01, gae_lambda=0.95,
+ entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5, recurrence=4,
+ rmsprop_alpha=0.99, rmsprop_eps=1e-8, preprocess_obss=None, reshape_reward=None):
+ num_frames_per_proc = num_frames_per_proc or 8
+
+ super().__init__(envs, acmodel, device, num_frames_per_proc, discount, lr, gae_lambda, entropy_coef,
+ value_loss_coef, max_grad_norm, recurrence, preprocess_obss, reshape_reward)
+ self.optimizer = torch.optim.RMSprop(self.acmodel.parameters(), lr,
+ alpha=rmsprop_alpha, eps=rmsprop_eps)
+ raise NotImplementedError("This needs to be refactored to work with mm actions")
+
+ def update_parameters(self, exps):
+ # Compute starting indexes
+
+ inds = self._get_starting_indexes()
+
+ # Initialize update values
+
+ update_entropy = 0
+ update_value = 0
+ update_policy_loss = 0
+ update_value_loss = 0
+ update_loss = 0
+
+ # Initialize memory
+
+ if self.acmodel.recurrent:
+ memory = exps.memory[inds]
+
+ for i in range(self.recurrence):
+ # Create a sub-batch of experience
+
+ sb = exps[inds + i]
+
+ # Compute loss
+
+ if self.acmodel.recurrent:
+ dist, value, memory = self.acmodel(sb.obs, memory * sb.mask)
+ else:
+ dist, value = self.acmodel(sb.obs)
+
+ entropy = dist.entropy().mean()
+
+ policy_loss = -(dist.log_prob(sb.action) * sb.advantage).mean()
+
+ value_loss = (value - sb.returnn).pow(2).mean()
+
+ loss = policy_loss - self.entropy_coef * entropy + self.value_loss_coef * value_loss
+
+ # Update batch values
+
+ update_entropy += entropy.item()
+ update_value += value.mean().item()
+ update_policy_loss += policy_loss.item()
+ update_value_loss += value_loss.item()
+ update_loss += loss
+
+ # Update update values
+
+ update_entropy /= self.recurrence
+ update_value /= self.recurrence
+ update_policy_loss /= self.recurrence
+ update_value_loss /= self.recurrence
+ update_loss /= self.recurrence
+
+ # Update actor-critic
+
+ self.optimizer.zero_grad()
+ update_loss.backward()
+ update_grad_norm = sum(p.grad.data.norm(2) ** 2 for p in self.acmodel.parameters()) ** 0.5
+ torch.nn.utils.clip_grad_norm_(self.acmodel.parameters(), self.max_grad_norm)
+ self.optimizer.step()
+
+ # Log some values
+
+ logs = {
+ "entropy": update_entropy,
+ "value": update_value,
+ "policy_loss": update_policy_loss,
+ "value_loss": update_value_loss,
+ "grad_norm": update_grad_norm
+ }
+
+ return logs
+
+ def _get_starting_indexes(self):
+ """Gives the indexes of the observations given to the model and the
+ experiences used to compute the loss at first.
+
+ The indexes are the integers from 0 to `self.num_frames` with a step of
+ `self.recurrence`. If the model is not recurrent, they are all the
+ integers from 0 to `self.num_frames`.
+
+ Returns
+ -------
+ starting_indexes : list of int
+ the indexes of the experiences to be used at first
+ """
+
+ starting_indexes = numpy.arange(0, self.num_frames, self.recurrence)
+ return starting_indexes
diff --git a/torch-ac/torch_ac/algos/base.py b/torch-ac/torch_ac/algos/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e1c1ef232e54b3def6515cf94559d0fff70f3a6
--- /dev/null
+++ b/torch-ac/torch_ac/algos/base.py
@@ -0,0 +1,972 @@
+from abc import ABC, abstractmethod
+import numpy as np
+import torch
+
+from torch_ac.format import default_preprocess_obss
+from torch_ac.utils import DictList, ParallelEnv
+from torch_ac.intrinsic_reward_models import *
+
+from collections import Counter
+
+
+class BaseAlgo(ABC):
+ """The base class for RL algorithms."""
+
+ def __init__(self,
+ envs,
+ acmodel,
+ device,
+ num_frames_per_proc,
+ discount,
+ lr,
+ gae_lambda,
+ entropy_coef,
+ value_loss_coef,
+ max_grad_norm,
+ recurrence,
+ preprocess_obss,
+ reshape_reward,
+ exploration_bonus=False,
+ exploration_bonus_params=None,
+ exploration_bonus_tanh=None,
+ expert_exploration_bonus=False,
+ exploration_bonus_type="lang",
+ episodic_exploration_bonus=True,
+ utterance_moa_net=True, # used for social influence
+ clipped_rewards=False,
+ # default is set to fit RND
+ intrinsic_reward_loss_coef=0.1, # also used for social influence
+ intrinsic_reward_coef=0.1, # also used for social influence
+ intrinsic_reward_learning_rate=0.0001,
+ intrinsic_reward_momentum=0,
+ intrinsic_reward_epsilon=0.01,
+ intrinsic_reward_alpha=0.99,
+ intrinsic_reward_max_grad_norm=40,
+ intrinsic_reward_forward_loss_coef=10,
+ intrinsic_reward_inverse_loss_coef=0.1,
+ reset_rnd_ride_at_phase=False,
+ # social_influence
+ balance_moa_training=False,
+ moa_memory_dim=128,
+ ):
+ """
+ Initializes a `BaseAlgo` instance.
+
+ Parameters:
+ ----------
+ envs : list
+ a list of environments that will be run in parallel
+ acmodel : torch.Module
+ the model
+ num_frames_per_proc : int
+ the number of frames collected by every process for an update
+ discount : float
+ the discount for future rewards
+ lr : float
+ the learning rate for optimizers
+ gae_lambda : float
+ the lambda coefficient in the GAE formula
+ ([Schulman et al., 2015](https://arxiv.org/abs/1506.02438))
+ entropy_coef : float
+ the weight of the entropy cost in the final objective
+ value_loss_coef : float
+ the weight of the value loss in the final objective
+ max_grad_norm : float
+ gradient will be clipped to be at most this value
+ recurrence : int
+ the number of steps the gradient is propagated back in time
+ preprocess_obss : function
+ a function that takes observations returned by the environment
+ and converts them into the format that the model can handle
+ reshape_reward : function
+ a function that shapes the reward, takes an
+ (observation, action, reward, done) tuple as an input
+ """
+
+ # Store parameters
+
+ self.env = ParallelEnv(envs)
+ self.acmodel = acmodel
+ self.device = device
+ self.num_frames_per_proc = num_frames_per_proc
+ self.discount = discount
+ self.lr = lr
+ self.gae_lambda = gae_lambda
+ self.entropy_coef = entropy_coef
+ self.value_loss_coef = value_loss_coef
+ self.max_grad_norm = max_grad_norm
+ self.recurrence = recurrence
+ self.preprocess_obss = preprocess_obss or default_preprocess_obss
+ self.reshape_reward = reshape_reward
+ self.exploration_bonus = exploration_bonus
+ self.expert_exploration_bonus = expert_exploration_bonus
+ self.exploration_bonus_type = exploration_bonus_type
+ self.episodic_exploration_bonus = episodic_exploration_bonus
+ self.clipped_rewards = clipped_rewards
+ self.update_epoch = 0
+ self.utterance_moa_net = utterance_moa_net # todo: as parameter
+
+ self.reset_rnd_ride_at_phase = reset_rnd_ride_at_phase
+ self.was_reset = False
+
+ # Control parameters
+
+ assert self.acmodel.recurrent or self.recurrence == 1
+ assert self.num_frames_per_proc % self.recurrence == 0
+
+ # Configure acmodel
+
+ self.acmodel.to(self.device)
+ self.acmodel.train()
+
+ # Store helpers values
+
+ self.num_procs = len(envs)
+ self.num_frames = self.num_frames_per_proc * self.num_procs
+
+ # Initialize experience values
+
+ shape = (self.num_frames_per_proc, self.num_procs)
+
+ self.obs = self.env.reset()
+ self.obss = [None]*(shape[0])
+
+ self.info = [{}]*(shape[0])
+ self.infos = [None]*(shape[0])
+ if self.acmodel.recurrent:
+ self.memory = torch.zeros(shape[1], self.acmodel.memory_size, device=self.device)
+ self.memories = torch.zeros(*shape, self.acmodel.memory_size, device=self.device)
+ self.mask = torch.ones(shape[1], device=self.device)
+ self.masks = torch.zeros(*shape, device=self.device)
+ self.next_masks = torch.zeros(*shape, device=self.device)
+
+ self.values = torch.zeros(*shape, device=self.device)
+ self.next_values = torch.zeros(*shape, device=self.device)
+ self.rewards = torch.zeros(*shape, device=self.device)
+ self.extrinsic_rewards = torch.zeros(*shape, device=self.device)
+ self.advantages = torch.zeros(*shape, device=self.device)
+
+ # as_shape = self.env.envs[0].action_space.shape
+ as_shape = self.acmodel.model_raw_action_space.shape
+ self.actions = torch.zeros(*(shape+as_shape), device=self.device, dtype=torch.int)
+ self.log_probs = torch.zeros(*(shape+as_shape), device=self.device)
+
+ # Initialize log values
+
+ self.log_episode_return = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_extrinsic_return = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_exploration_bonus = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_success_rate = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_curriculum_mean_perf = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_reshaped_return = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_num_frames = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_mission_string_observed = torch.zeros(self.num_procs, device=self.device)
+ self.log_episode_NPC_introduced_to = np.zeros(self.num_procs).astype(bool)
+ self.log_episode_curriculum_param = torch.zeros(self.num_procs, device=self.device)
+
+ self.intrinsic_reward_loss_coef = intrinsic_reward_loss_coef
+ self.intrinsic_reward_coef = intrinsic_reward_coef
+ self.intrinsic_reward_learning_rate = intrinsic_reward_learning_rate
+ self.intrinsic_reward_momentum = intrinsic_reward_momentum
+ self.intrinsic_reward_epsilon = intrinsic_reward_epsilon
+ self.intrinsic_reward_alpha = intrinsic_reward_alpha
+ self.intrinsic_reward_max_grad_norm = intrinsic_reward_max_grad_norm
+ self.intrinsic_reward_forward_loss_coef = intrinsic_reward_forward_loss_coef
+ self.intrinsic_reward_inverse_loss_coef = intrinsic_reward_inverse_loss_coef
+ self.balance_moa_training = balance_moa_training
+ self.moa_memory_dim = moa_memory_dim
+
+ self.log_done_counter = 0
+ self.log_return = [0] * self.num_procs
+ self.log_extrinsic_return = [0] * self.num_procs
+ self.log_exploration_bonus = [0] * self.num_procs
+ self.log_success_rate = [0] * self.num_procs
+ self.log_curriculum_max_mean_perf = [0] * self.num_procs
+ self.log_curriculum_param = [0] * self.num_procs
+ self.log_reshaped_return = [0] * self.num_procs
+ self.log_num_frames = [0] * self.num_procs
+ self.log_mission_string_observed = [0] * self.num_procs
+ self.log_NPC_introduced_to = [False] * self.num_procs
+ self.images_counter = [Counter() for _ in range(self.num_procs)]
+
+ if self.exploration_bonus:
+ self.visitation_counter = {}
+ self.exploration_bonus_params = {}
+ self.exploration_bonus_tanh = {}
+
+ for i, bonus_type in enumerate(self.exploration_bonus_type):
+ if bonus_type == "rnd":
+ assert not self.episodic_exploration_bonus
+ self.init_rnd_networks_and_optimizer()
+
+ elif bonus_type == "ride":
+ self.init_ride_networks_and_optimizer()
+
+
+ elif bonus_type == "soc_inf":
+
+ # npc actions
+ self.fn_name_to_npc_prim_act = self.env.envs[0].npc_prim_actions_dict
+
+ self.num_npc_prim_actions = len(self.fn_name_to_npc_prim_act)
+
+ self.npc_utterance_to_id = {a: i for i, a in enumerate(self.env.envs[0].all_npc_utterance_actions)}
+ self.num_npc_utterance_actions = len(self.npc_utterance_to_id)
+
+ if self.utterance_moa_net:
+ self.num_npc_all_actions = self.num_npc_prim_actions * self.num_npc_utterance_actions
+ else:
+ self.num_npc_all_actions = self.num_npc_prim_actions
+
+ # construct possible agent_action's list
+ self.all_possible_agent_actions, self.act_to_ind_dict = self.construct_all_possible_agent_actions()
+ self.agent_actions_tiled_all = None
+
+ im_shape = self.env.observation_space['image'].shape
+
+ embedding_size = self.acmodel.semi_memory_size
+
+ input_size = embedding_size \
+ + self.num_npc_prim_actions \
+ + self.acmodel.model_raw_action_space.nvec[0] \
+ + self.acmodel.model_raw_action_space.nvec[2] \
+ + self.acmodel.model_raw_action_space.nvec[3]
+
+ if self.utterance_moa_net:
+ input_size += self.num_npc_utterance_actions # todo: feed as index or as text?
+
+ self.moa_net = LSTMMoaNet(
+ input_size=input_size,
+ num_npc_prim_actions=self.num_npc_prim_actions,
+ num_npc_utterance_actions=self.num_npc_utterance_actions if self.utterance_moa_net else None,
+ acmodel=self.acmodel,
+ memory_dim=self.moa_memory_dim
+ ).to(device=self.device)
+
+ # memory
+ assert shape == (self.num_frames_per_proc, self.num_procs)
+ self.moa_memory = torch.zeros(shape[1], self.moa_net.memory_size, device=self.device)
+ self.moa_memories = torch.zeros(*shape, self.moa_net.memory_size, device=self.device)
+
+ elif bonus_type in ["cell", "grid", "lang"]:
+ if self.episodic_exploration_bonus:
+ self.visitation_counter[bonus_type] = [Counter() for _ in range(self.num_procs)]
+ else:
+ self.visitation_counter[bonus_type] = Counter()
+
+ if exploration_bonus_params:
+ self.exploration_bonus_params[bonus_type] = exploration_bonus_params[2*i:2*i+2]
+ else:
+ self.exploration_bonus_params[bonus_type] = (100, 50.)
+
+ if exploration_bonus_tanh is None:
+ self.exploration_bonus_tanh[bonus_type] = None
+ else:
+ self.exploration_bonus_tanh[bonus_type] = exploration_bonus_tanh[i]
+ else:
+ raise ValueError(f"bonus type: {bonus_type} unknown.")
+
+ def load_status_dict(self, status):
+
+ self.acmodel.load_state_dict(status["model_state"])
+
+ if hasattr(self.env, "curriculum") and self.env.curriculum is not None:
+ self.env.curriculum.load_status_dict(status)
+ self.env.broadcast_curriculum_parameters(self.env.curriculum.get_parameters())
+
+ # self.optimizer.load_state_dict(status["optimizer_state"])
+
+ if self.exploration_bonus:
+ for i, bonus_type in enumerate(self.exploration_bonus_type):
+
+ if bonus_type == "rnd":
+ self.random_target_network.load_state_dict(status["random_target_network"])
+ self.predictor_network.load_state_dict(status["predictor_network"])
+ self.intrinsic_reward_optimizer.load_state_dict(status["intrinsic_reward_optimizer"])
+
+ elif bonus_type == "ride":
+ self.forward_dynamics_model.load_state_dict(status["forward_dynamics_model"])
+ self.inverse_dynamics_model.load_state_dict(status["inverse_dynamics_model"])
+ self.state_embedding_model.load_state_dict(status["state_embedding_model"])
+
+ self.state_embedding_optimizer.load_state_dict(status["state_embedding_optimizer"])
+ self.inverse_dynamics_optimizer.load_state_dict(status["inverse_dynamics_optimizer"])
+ self.forward_dynamics_optimizer.load_state_dict(status["forward_dynamics_optimizer"])
+
+ elif bonus_type == "soc_inf":
+ self.moa_net.load_state_dict(status["moa_net"])
+
+ def get_status_dict(self):
+
+ algo_status_dict = {
+ "model_state": self.acmodel.state_dict(),
+ }
+
+ if hasattr(self.env, "curriculum") and self.env.curriculum is not None:
+ algo_status_dict = {
+ **algo_status_dict,
+ **self.env.curriculum.get_status_dict()
+ }
+
+ if self.exploration_bonus:
+ for i, bonus_type in enumerate(self.exploration_bonus_type):
+
+ if bonus_type == "rnd":
+ algo_status_dict["random_target_network"] = self.random_target_network.state_dict()
+ algo_status_dict["predictor_network"] = self.predictor_network.state_dict()
+ algo_status_dict["intrinsic_reward_optimizer"] = self.intrinsic_reward_optimizer.state_dict()
+
+ elif bonus_type == "ride":
+ algo_status_dict["forward_dynamics_model"] = self.forward_dynamics_model.state_dict()
+ algo_status_dict["inverse_dynamics_model"] = self.inverse_dynamics_model.state_dict()
+ algo_status_dict["state_embedding_model"] = self.state_embedding_model.state_dict()
+
+ algo_status_dict["state_embedding_optimizer"] = self.state_embedding_optimizer.state_dict()
+ algo_status_dict["inverse_dynamics_optimizer"] = self.inverse_dynamics_optimizer.state_dict()
+ algo_status_dict["forward_dynamics_optimizer"] = self.forward_dynamics_optimizer.state_dict()
+
+ elif bonus_type == "soc_inf":
+ algo_status_dict["moa_net"] = self.moa_net.state_dict()
+
+ return algo_status_dict
+
+ def construct_all_possible_agent_actions(self):
+
+ if self.acmodel is None:
+ raise ValueError("This should be called after the model has been set")
+
+ # add non-speaking actions
+
+ # a non-speaking actions look like (?, 0, 0, 0)
+ # the last two zeros would normally mean the frst template and first word, but here they are to be
+ # ignored because of the second 0 (which means to not speak)
+ non_speaking_action_subspace = (self.acmodel.model_raw_action_space.nvec[0], 1, 1, 1)
+ non_speaking_actions = np.array(list(np.ndindex(non_speaking_action_subspace)))
+
+ # add speaking actions
+ speaking_action_subspace = (
+ self.acmodel.model_raw_action_space.nvec[0],
+ 1, # one action,
+ self.acmodel.model_raw_action_space.nvec[2],
+ self.acmodel.model_raw_action_space.nvec[3],
+ )
+
+ speaking_actions = np.array(list(np.ndindex(speaking_action_subspace)))
+ speaking_actions = self.acmodel.no_speak_to_speak_action(speaking_actions)
+
+ # all actions
+ all_possible_agent_actions = np.concatenate([non_speaking_actions, speaking_actions])
+
+ # create the action -> index dict
+ act_to_ind_dict = {tuple(act): ind for ind, act in enumerate(all_possible_agent_actions)}
+
+ # map other non-speaking actions to the (?, 0, 0, 0), ex. (3, 0, 4, 12) -> (3, 0, 0, 0)
+ other_non_speaking_action_subspace = (
+ self.acmodel.model_raw_action_space.nvec[0],
+ 1,
+ self.acmodel.model_raw_action_space.nvec[2],
+ self.acmodel.model_raw_action_space.nvec[3]
+ )
+ for action in np.ndindex(other_non_speaking_action_subspace):
+ assert action[1] == 0 # non-speaking
+ act_to_ind_dict[tuple(action)] = act_to_ind_dict[(action[0], 0, 0, 0)]
+
+ return all_possible_agent_actions, act_to_ind_dict
+
+ def step_to_n_frames(self, step):
+ return step * self.num_frames_per_proc * self.num_procs
+
+ def calculate_exploration_bonus(self, obs=None, done=None, prev_obs=None, info=None, prev_info=None, agent_actions=None, dist=None,
+ i_step=None, embeddings=None):
+
+ def state_hashes(observation, exploration_bonus_type):
+ if exploration_bonus_type == "lang":
+ hashes = [observation['utterance']]
+ assert len(hashes) == 1
+ elif exploration_bonus_type == "cell":
+ # for all new cells
+ im = observation["image"]
+ hashes = np.unique(im.reshape(-1, im.shape[-1]), axis=0)
+ hashes = np.apply_along_axis(lambda a: a.data.tobytes(), 1, hashes)
+
+ elif exploration_bonus_type == "grid":
+ # for seeing new grid configurations
+ im = observation["image"]
+ hashes = [im.data.tobytes()]
+ assert len(hashes) == 1
+ else:
+ raise ValueError(f"Unknown exploration bonus type {bonus_type}")
+
+ return hashes
+
+ total_bonus = [0]*len(obs)
+ for bonus_type in self.exploration_bonus_type:
+ if bonus_type == "rnd":
+ # -- [unroll_length x batch_size x height x width x channels] == [1, n_proc, 7, 7, 4]
+ batch = torch.tensor(np.array([[o['image'] for o in obs]])).to(self.device)
+
+ with torch.no_grad():
+ random_embedding = self.random_target_network(batch).reshape(len(obs), 128)
+ predicted_embedding = self.predictor_network(batch).reshape(len(obs), 128)
+ intrinsic_rewards = torch.norm(predicted_embedding.detach() - random_embedding.detach(), dim=1, p=2)
+ intrinsic_reward_coef = self.intrinsic_reward_coef
+ intrinsic_rewards *= intrinsic_reward_coef
+
+ # is this the best way? should we somehow extract the next_state?
+ bonus = [0.0 if d else float(r) for d, r in zip(done, intrinsic_rewards)]
+
+ elif bonus_type == "ride":
+ with torch.no_grad():
+ _obs = torch.tensor(np.array([[o['image'] for o in prev_obs]])).to(self.device)
+ _next_obs = torch.tensor(np.array([[o['image'] for o in obs]])).to(self.device)
+
+ # counts - number of times a state was seen during the SAME episode -> can be computed here
+ count_rewards = torch.tensor([1/np.sqrt(self.images_counter[p_i][np.array(o.to("cpu")).tobytes()]) for p_i, o in enumerate(_next_obs[0])]).to(self.device)
+ assert not any(torch.isinf(count_rewards))
+
+ state_emb = self.state_embedding_model(_obs.to(device=self.device)).reshape(len(obs), 128)
+ next_state_emb = self.state_embedding_model(_next_obs.to(device=self.device)).reshape(len(obs), 128)
+
+ control_rewards = torch.norm(next_state_emb - state_emb, dim=1, p=2)
+
+ intrinsic_rewards = self.intrinsic_reward_coef*(count_rewards * control_rewards)
+
+ # is this the best way? should we somehow extract the next_state?
+ bonus = [0.0 if d else float(r) for d, r in zip(done, intrinsic_rewards)]
+
+ elif bonus_type == "soc_inf":
+ if prev_info == [{}] * len(prev_info):
+ # this is the first step, info is not given during reset
+
+ # first step in the episode no influence can be estimated as there is no previous action
+ # todo: padd with zeros, and estimate anyway?
+ bonus = [0.0 for _ in done]
+ else:
+ # social influence
+ n_procs = len(obs)
+
+ _prev_NPC_prim_actions = torch.tensor(
+ [self.fn_name_to_npc_prim_act[o["NPC_prim_action"]] for o in prev_info]
+ ).to(self.device)
+
+ # todo: what is the best way to feed utt action?
+ _prev_NPC_utt_actions = torch.tensor(
+ [self.npc_utterance_to_id[o["NPC_utterance"]] for o in prev_info]
+ ).to(self.device)
+
+ # new
+ # calculate counterfactuals
+ npc_previous_prim_actions_all = _prev_NPC_prim_actions.repeat(len(self.all_possible_agent_actions)) # [A_ag*n_procs, ...]
+ npc_previous_utt_actions_all = _prev_NPC_utt_actions.repeat(len(self.all_possible_agent_actions)) # [A_ag*n_procs, ...]
+
+ # agent actions tiled
+ if self.agent_actions_tiled_all is not None:
+ agent_actions_tiled_all = self.agent_actions_tiled_all
+
+ else:
+ # only first time, we can't do it in init because we need len(im_obs)
+ agent_actions_tiled_all = []
+ for pot_agent_action in self.all_possible_agent_actions:
+ pot_agent_action_tiled = torch.from_numpy(np.tile(pot_agent_action, (n_procs, 1))) # [n_procs,...]
+ agent_actions_tiled_all.append(pot_agent_action_tiled.to(self.device))
+
+ agent_actions_tiled_all = torch.concat(agent_actions_tiled_all) # [A_ag*n_procs,....]
+
+ self.agent_actions_tiled_all = agent_actions_tiled_all
+
+ with torch.no_grad():
+ # todo: move this tiling above?
+ masked_memory = self.moa_memory * self.mask.unsqueeze(1)
+ masked_memory_tiled_all = masked_memory.repeat([len(self.all_possible_agent_actions), 1])
+ embedding_tiled_all = embeddings.repeat([len(self.all_possible_agent_actions), 1])
+
+ # use current memory for every action
+
+ counterfactuals_logits, moa_memory = self.moa_net(
+ embeddings=embedding_tiled_all,
+ # observations=observations_all,
+ npc_previous_prim_actions=npc_previous_prim_actions_all,
+ npc_previous_utterance_actions=npc_previous_utt_actions_all if self.utterance_moa_net else None,
+ agent_actions=agent_actions_tiled_all,
+ memory=masked_memory_tiled_all
+ ) # logits : [A_ag * n_procs, A_npc]
+
+ counterfactuals_logits = counterfactuals_logits.reshape(
+ [len(self.all_possible_agent_actions), n_procs, self.num_npc_all_actions])
+
+ counterfactuals_logits = counterfactuals_logits.swapaxes(0, 1) # [n_procs, A_ag, A_npc]
+
+ assert counterfactuals_logits.shape == (len(obs), len(self.all_possible_agent_actions), self.num_npc_all_actions)
+
+ # compute npc logits p(A_npc|A_ag, s)
+
+ # note: ex (5,0,5,2) is mapped to (5,0,0,0), todo: is this ok everywhere?
+ agent_action_indices = [self.act_to_ind_dict[tuple(act.cpu().numpy())] for act in agent_actions]
+ # ~ p(a_npc| a_ag, ...)
+
+ predicted_logits = torch.stack([ctr[ind] for ctr, ind in zip(counterfactuals_logits, agent_action_indices)])
+
+ assert i_step is not None
+ self.moa_memories[i_step] = self.moa_memory
+
+ # only save for the actions actually taken
+ self.moa_memory = moa_memory[agent_action_indices]
+
+ assert predicted_logits.shape == (len(obs), self.num_npc_all_actions)
+
+ predicted_probs = torch.softmax(predicted_logits, dim=1) # use exp_softmax or something?
+
+
+ # compute marginal npc logits p(A_npc|s) = sum( p(A_NPC|A_ag,s), for every A_ag )
+ # compute agent logits for all possible agent actions
+ per_non_speaking_action_log_probs = dist[0].logits + dist[1].logits[:, :1]
+
+ per_speaking_action_log_probs = []
+ for p in range(n_procs):
+
+ log_probs_for_proc_p = [d.logits[p].cpu().numpy() for d in dist]
+
+ # speaking actions
+ speaking_log_probs = log_probs_for_proc_p
+ speaking_log_probs[1] = speaking_log_probs[1][1:] # only the speak action
+
+ # sum everybody with everybody
+ out = np.add.outer(speaking_log_probs[0], speaking_log_probs[1]).reshape(-1)
+ out = np.add.outer(out, speaking_log_probs[2]).reshape(-1)
+ out = np.add.outer(out, speaking_log_probs[3]).reshape(-1)
+ per_speaking_action_log_probs_proc_p = out
+
+ per_speaking_action_log_probs.append(per_speaking_action_log_probs_proc_p)
+
+ per_speaking_action_log_probs = np.stack(per_speaking_action_log_probs)
+
+ agent_log_probs = torch.concat([
+ per_non_speaking_action_log_probs,
+ torch.tensor(per_speaking_action_log_probs).to(device=self.device),
+ ], dim=1)
+
+ # assert
+ for p in range(n_procs):
+ log_probs_for_proc_p = [d.logits[p].cpu().numpy() for d in dist]
+
+ assert torch.abs(agent_log_probs[p][self.act_to_ind_dict[(0, 1, 3, 1)]] - sum([p[a] for p, a in list(zip(log_probs_for_proc_p, (0, 1, 3, 1)))])) < 1e-5
+ assert torch.abs(agent_log_probs[p][self.act_to_ind_dict[(0, 1, 1, 10)]] - sum([p[a] for p, a in list(zip(log_probs_for_proc_p, (0, 1, 1, 10)))])) < 1e-5
+
+
+ agent_probs = agent_log_probs.exp()
+
+ counterfactuals_probs = counterfactuals_logits.softmax(dim=-1) # [n_procs, A_ag, A_npc]
+ counterfactuals_perm = counterfactuals_probs.permute(0, 2, 1) # [n_procs, A_npc, A_agent]
+
+ # compute marginal distributions
+ marginals = (counterfactuals_perm * agent_probs[:, None, :]).sum(-1)
+
+ # this already sums to one, so the following normalization is not needed
+ marginal_probs = marginals / marginals.sum(1, keepdims=True) # sum over npc_actions
+ assert marginal_probs.shape == (n_procs, self.num_npc_all_actions) # [batch, A_npc]
+
+ KL_loss = (predicted_probs * (predicted_probs.log() - marginal_probs.log())).sum(axis=-1)
+
+
+ intrinsic_rewards = self.intrinsic_reward_coef * KL_loss
+
+ # is the NPC observed in the image that is fed as input in this step
+ # (returned by the previous step() call )
+ NPC_observed = torch.tensor([pi["NPC_observed"] for pi in prev_info]).to(self.device)
+
+ intrinsic_rewards = intrinsic_rewards * NPC_observed
+
+ bonus = [0.0 if d else float(r) for d, r in zip(done, intrinsic_rewards)]
+
+ elif bonus_type in ["cell", "grid", "lang"]:
+ C, M = self.exploration_bonus_params[bonus_type]
+ C_ = C / self.num_frames_per_proc
+
+ if self.expert_exploration_bonus:
+ # expert
+ raise DeprecationWarning("Deprecated exploration bonus type")
+
+ elif self.episodic_exploration_bonus:
+
+ hashes = [state_hashes(o, bonus_type) for o in obs]
+ bonus = [
+ 0 if d else # no bonus if done
+ np.sum([
+ C_ / ((self.visitation_counter[bonus_type][i_p][h] + 1) ** M) for h in hs
+ ])
+ for i_p, (hs, d) in enumerate(zip(hashes, done))
+ ]
+
+ # update the counters
+ for i_p, (o, d, hs) in enumerate(zip(obs, done, hashes)):
+ if not d:
+ for h in hs:
+ self.visitation_counter[bonus_type][i_p][h] += 1
+
+ else:
+ raise DeprecationWarning("Use episodic exploration bonus.")
+ # non-episodic exploration bonus
+
+ bonus = [
+ 0 if d else # no bonus if done
+ np.sum([
+ C_ / ((self.visitation_counter[bonus_type][h] + 1) ** M) for h in state_hashes(o. bonus_type)
+ ]) for o, d in zip(obs, done)
+ ]
+
+ # update the counters
+ for o, d in zip(obs, done):
+ if not d:
+ for h in state_hashes(o, self.exploration_bonus_type):
+ self.visitation_counter[bonus_type][h] += 1
+
+ if self.exploration_bonus_tanh[bonus_type] is not None:
+ bonus = [np.tanh(b)*self.exploration_bonus_tanh[bonus_type] for b in bonus]
+ else:
+ raise ValueError(f"Unknown exploration bonus type {bonus_type}")
+
+ assert len(total_bonus) == len(bonus)
+ total_bonus = [tb+b for tb, b in zip(total_bonus, bonus)]
+
+ return total_bonus
+
+ def collect_experiences(self):
+ """Collects rollouts and computes advantages.
+
+ Runs several environments concurrently. The next actions are computed
+ in a batch mode for all environments at the same time. The rollouts
+ and advantages from all environments are concatenated together.
+
+ Returns
+ -------
+ exps : DictList
+ Contains actions, rewards, advantages etc as attributes.
+ Each attribute, e.g. `exps.reward` has a shape
+ (self.num_frames_per_proc * num_envs, ...). k-th block
+ of consecutive `self.num_frames_per_proc` frames contains
+ data obtained from the k-th environment. Be careful not to mix
+ data from different environments!
+ logs : dict
+ Useful stats about the training process, including the average
+ reward, policy loss, value loss, etc.
+ """
+
+ for i_step in range(self.num_frames_per_proc):
+ # Do one agent-environment interaction
+ preprocessed_obs = self.preprocess_obss(self.obs, device=self.device)
+ with torch.no_grad():
+
+ if self.acmodel.recurrent:
+ dist, value, memory, policy_embedding = self.acmodel(preprocessed_obs, self.memory * self.mask.unsqueeze(1), return_embeddings=True)
+ else:
+ dist, value, policy_embedding = self.acmodel(preprocessed_obs, return_embeddings=True)
+
+ action = self.acmodel.sample_action(dist)
+
+ obs, reward, done, info = self.env.step(
+ self.acmodel.construct_final_action(
+ action.cpu().numpy()
+ )
+ )
+
+ if hasattr(self.env, "curriculum") and self.env.curriculum is not None:
+ curriculum_params = self.env.curriculum.update_parameters({
+ "obs": obs,
+ "reward": reward,
+ "done": done,
+ "info": info,
+ })
+ # broadcast new parameters to all parallel environments
+ self.env.broadcast_curriculum_parameters(curriculum_params)
+
+ if self.reset_rnd_ride_at_phase and curriculum_params['phase'] == 2 and not self.was_reset:
+ self.was_reset = True
+ assert not self.episodic_exploration_bonus
+
+ for i, bonus_type in enumerate(self.exploration_bonus_type):
+ if bonus_type == "rnd":
+ self.init_rnd_networks_and_optimizer()
+
+ elif bonus_type == "ride":
+ self.init_ride_networks_and_optimizer()
+
+ for p_i, o in enumerate(obs):
+ self.images_counter[p_i][o['image'].tobytes()] += 1
+
+ extrinsic_reward = reward
+ exploration_bonus = (0,) * len(reward)
+
+ if self.exploration_bonus:
+ bonus = self.calculate_exploration_bonus(
+ obs=obs, done=done, prev_obs=self.obs, info=info, prev_info=self.info, agent_actions=action, dist=dist,
+ i_step=i_step, embeddings=policy_embedding,
+ )
+ exploration_bonus = bonus
+ reward = [r + b for r, b in zip(reward, bonus)]
+
+ if self.clipped_rewards:
+ # this should not be used with classic count-based rewards as they often,
+ # when combined with extr. rew go past 1.0
+ reward = list(map(float, torch.clamp(torch.tensor(reward), -1, 1)))
+
+ # Update experiences values
+ self.obss[i_step] = self.obs
+ self.obs = obs
+ self.infos[i_step] = info # info of this step is the current info
+ self.info = info # save as previous info
+
+ if self.acmodel.recurrent:
+ self.memories[i_step] = self.memory
+ self.memory = memory
+ self.masks[i_step] = self.mask
+ self.mask = 1 - torch.tensor(done, device=self.device, dtype=torch.float)
+
+ self.actions[i_step] = action
+ self.values[i_step] = value
+
+ if self.reshape_reward is not None:
+ self.rewards[i_step] = torch.tensor([
+ self.reshape_reward(obs_, action_, reward_, done_)
+ for obs_, action_, reward_, done_ in zip(obs, action, reward, done)
+ ], device=self.device)
+ else:
+ self.rewards[i_step] = torch.tensor(reward, device=self.device)
+
+ self.log_probs[i_step] = self.acmodel.calculate_log_probs(dist, action)
+
+ # Update log values
+
+ self.log_episode_return += torch.tensor(reward, device=self.device, dtype=torch.float)
+ self.log_episode_extrinsic_return += torch.tensor(extrinsic_reward, device=self.device, dtype=torch.float)
+ self.log_episode_exploration_bonus += torch.tensor(exploration_bonus, device=self.device, dtype=torch.float)
+ self.log_episode_success_rate = torch.tensor([i["success"] for i in info]).float().to(self.device)
+ self.log_episode_curriculum_mean_perf = torch.tensor([i.get("curriculum_info_max_mean_perf", 0) for i in info]).float().to(self.device)
+ self.log_episode_reshaped_return += self.rewards[i_step]
+ self.log_episode_num_frames += torch.ones(self.num_procs, device=self.device)
+ self.log_episode_curriculum_param = torch.tensor([i.get("curriculum_info_param", 0.0) for i in info]).float().to(self.device)
+ # self.log_episode_curriculum_param = torch.tensor([i.get("curriculum_info_mean_perf", 0.0) for i in info]).float().to(self.device)
+ assert self.log_episode_curriculum_param.var() == 0
+
+ log_episode_NPC_introduced_to_current = np.array([i.get('NPC_was_introduced_to', False) for i in info])
+ assert all((self.log_episode_NPC_introduced_to | log_episode_NPC_introduced_to_current) == log_episode_NPC_introduced_to_current)
+
+ self.log_episode_NPC_introduced_to = self.log_episode_NPC_introduced_to | log_episode_NPC_introduced_to_current
+
+ self.log_episode_mission_string_observed += torch.tensor([
+ float(m in o.get("utterance", ''))
+ for m, o in zip(self.env.get_mission(), self.obs)
+ ], device=self.device, dtype=torch.float)
+
+ for p, done_ in enumerate(done):
+ if done_:
+ self.log_mission_string_observed.append(
+ torch.clamp(self.log_episode_mission_string_observed[p], 0, 1).item()
+ )
+ self.log_done_counter += 1
+ self.log_return.append(self.log_episode_return[p].item())
+ self.log_extrinsic_return.append(self.log_episode_extrinsic_return[p].item())
+ self.log_exploration_bonus.append(self.log_episode_exploration_bonus[p].item())
+ self.log_success_rate.append(self.log_episode_success_rate[p].item())
+ self.log_curriculum_max_mean_perf.append(self.log_episode_curriculum_mean_perf[p].item())
+ self.log_reshaped_return.append(self.log_episode_reshaped_return[p].item())
+ self.log_num_frames.append(self.log_episode_num_frames[p].item())
+ self.log_curriculum_param.append(self.log_episode_curriculum_param[p].item())
+ if self.episodic_exploration_bonus:
+ for v in self.visitation_counter.values():
+ v[p] = Counter()
+ self.images_counter[p] = Counter()
+ self.log_NPC_introduced_to.append(self.log_episode_NPC_introduced_to[p])
+ # print("log history:", self.log_success_rate)
+ # print("log history len:", len(self.log_success_rate)-16)
+
+ self.log_episode_mission_string_observed *= self.mask
+ self.log_episode_return *= self.mask
+ self.log_episode_extrinsic_return *= self.mask
+ self.log_episode_exploration_bonus *= self.mask
+ self.log_episode_success_rate *= self.mask
+ self.log_episode_curriculum_mean_perf *= self.mask
+ self.log_episode_reshaped_return *= self.mask
+ self.log_episode_num_frames *= self.mask
+ self.log_episode_NPC_introduced_to *= self.mask.cpu().numpy().astype(bool)
+ self.log_episode_curriculum_param *= self.mask
+ # Add advantage and return to experiences
+
+ preprocessed_obs = self.preprocess_obss(self.obs, device=self.device)
+ with torch.no_grad():
+ if self.acmodel.recurrent:
+ _, next_value, _ = self.acmodel(preprocessed_obs, self.memory * self.mask.unsqueeze(1))
+ else:
+ _, next_value = self.acmodel(preprocessed_obs)
+ for f in reversed(range(self.num_frames_per_proc)):
+ next_mask = self.masks[f+1] if f < self.num_frames_per_proc - 1 else self.mask
+ next_value = self.values[f+1] if f < self.num_frames_per_proc - 1 else next_value
+ next_advantage = self.advantages[f+1] if f < self.num_frames_per_proc - 1 else 0
+
+ self.next_masks[f] = next_mask
+ self.next_values[f] = next_value
+
+ delta = self.rewards[f] + self.discount * next_value * next_mask - self.values[f]
+ self.advantages[f] = delta + self.discount * self.gae_lambda * next_advantage * next_mask
+
+ # Define experiences:
+ # the whole experience is the concatenation of the experience
+ # of each process.
+ # In comments below:
+ # - T is self.num_frames_per_proc,
+ # - P is self.num_procs,
+ # - D is the dimensionality.
+
+ exps = DictList()
+ exps.obs = [self.obss[f][p]
+ for p in range(self.num_procs)
+ for f in range(self.num_frames_per_proc)]
+
+ exps.infos = np.array([self.infos[f][p]
+ for p in range(self.num_procs)
+ for f in range(self.num_frames_per_proc)])
+
+ # obs: (p1 (f1,f2,f3) ; p2 (f1,f2,f3); p3 (f1,f2,f3)
+
+ if self.acmodel.recurrent:
+ # T x P x D -> P x T x D -> (P * T) x D
+ exps.memory = self.memories.transpose(0, 1).reshape(-1, *self.memories.shape[2:])
+ # T x P -> P x T -> (P * T) x 1
+ exps.mask = self.masks.transpose(0, 1).reshape(-1).unsqueeze(1)
+ exps.next_mask = self.next_masks.transpose(0, 1).reshape(-1).unsqueeze(1)
+
+ if self.exploration_bonus and "soc_inf" in self.exploration_bonus_type:
+ exps.moa_memory = self.moa_memories.transpose(0, 1).reshape(-1, *self.moa_memories.shape[2:])
+
+ # for all tensors below, T x P -> P x T -> P * T
+
+ exps.action = self.actions.transpose(0, 1).reshape((-1, self.actions.shape[-1]))
+ exps.log_prob = self.log_probs.transpose(0, 1).reshape((-1, self.actions.shape[-1]))
+
+ exps.value = self.values.transpose(0, 1).reshape(-1)
+ exps.next_value = self.next_values.transpose(0, 1).reshape(-1)
+ exps.reward = self.rewards.transpose(0, 1).reshape(-1)
+ exps.advantage = self.advantages.transpose(0, 1).reshape(-1)
+ exps.returnn = exps.value + exps.advantage
+
+ # Preprocess experiences
+
+ exps.obs = self.preprocess_obss(exps.obs, device=self.device)
+
+ # Log some values
+
+ keep = max(self.log_done_counter, self.num_procs)
+
+ flat_actions = self.actions.reshape(-1, self.actions.shape[-1])
+ action_modalities = {
+ "action_modality_{}".format(m): flat_actions[:, m].cpu().numpy() for m in range(self.actions.shape[-1])
+ }
+
+ if not self.exploration_bonus:
+ assert self.log_return == self.log_extrinsic_return
+
+ logs = {
+ "return_per_episode": self.log_return[-keep:],
+ "mission_string_observed": self.log_mission_string_observed[-keep:],
+ "extrinsic_return_per_episode": self.log_extrinsic_return[-keep:],
+ "exploration_bonus_per_episode": self.log_exploration_bonus[-keep:],
+ "success_rate_per_episode": self.log_success_rate[-keep:],
+ "curriculum_max_mean_perf_per_episode": self.log_curriculum_max_mean_perf[-keep:],
+ "curriculum_param_per_episode": self.log_curriculum_param[-keep:],
+ "reshaped_return_per_episode": self.log_reshaped_return[-keep:],
+ "num_frames_per_episode": self.log_num_frames[-keep:],
+ "num_frames": self.num_frames,
+ "NPC_introduced_to": self.log_NPC_introduced_to[-keep:],
+ **action_modalities
+ }
+
+ self.log_done_counter = 0
+ self.log_return = self.log_return[-self.num_procs:]
+ self.log_extrinsic_return = self.log_extrinsic_return[-self.num_procs:]
+ self.log_exploration_bonus = self.log_exploration_bonus[-self.num_procs:]
+ self.log_reshaped_return = self.log_reshaped_return[-self.num_procs:]
+ self.log_num_frames = self.log_num_frames[-self.num_procs:]
+
+ return exps, logs
+
+ def compute_advantages_and_returnn(self, exps):
+ """
+ This function can be used for algorithms which reuse old data (not online RL) to
+ recompute non episodic intrinsic rewards on old experience.
+ This method is not used in PPO training.
+
+ Example usage from update_parameters
+ advs, retnn = self.compute_advantages_and_returnn(exps)
+
+ # if you want to do a sanity check
+ assert torch.equal(exps.advantage, advs)
+ assert torch.equal(exps.returnn, retnn)
+
+ exps.advantages, exps.returnn = advs, retnn
+ """
+ shape = (self.num_frames_per_proc, self.num_procs)
+ advs = torch.zeros(*shape, device=self.device)
+
+ rewards = exps.reward.reshape(self.num_procs, self.num_frames_per_proc).transpose(0, 1)
+ values = exps.value.reshape(self.num_procs, self.num_frames_per_proc).transpose(0, 1)
+ next_values = exps.next_value.reshape(self.num_procs, self.num_frames_per_proc).transpose(0, 1)
+ next_masks = exps.next_mask.reshape(self.num_procs, self.num_frames_per_proc).transpose(0, 1)
+
+ for f in reversed(range(self.num_frames_per_proc)):
+ next_advantage = advs[f+1] if f < self.num_frames_per_proc - 1 else 0
+
+ delta = rewards[f] + self.discount * next_values[f] * next_masks[f] - values[f]
+ advs[f] = delta + self.discount * self.gae_lambda * next_advantage * next_masks[f]
+
+ advantage = advs.transpose(0, 1).reshape(-1)
+ returnn = exps.value + advantage
+ return advantage, returnn
+
+ @abstractmethod
+ def update_parameters(self):
+ pass
+
+ def init_rnd_networks_and_optimizer(self):
+ self.random_target_network = MinigridStateEmbeddingNet(self.env.observation_space['image'].shape).to(
+ device=self.device)
+ self.predictor_network = MinigridStateEmbeddingNet(self.env.observation_space['image'].shape).to(device=self.device)
+
+ self.intrinsic_reward_optimizer = torch.optim.RMSprop(
+ self.predictor_network.parameters(),
+ lr=self.intrinsic_reward_learning_rate,
+ momentum=self.intrinsic_reward_momentum,
+ eps=self.intrinsic_reward_epsilon,
+ alpha=self.intrinsic_reward_alpha,
+ )
+
+ def init_ride_networks_and_optimizer(self):
+ self.state_embedding_model = MinigridStateEmbeddingNet(self.env.observation_space['image'].shape).to(
+ device=self.device)
+ # linquistic actions
+ # n_actions = self.acmodel.model_raw_action_space.nvec.prod
+
+ # we only use primitive actions for ride
+ n_actions = self.acmodel.model_raw_action_space.nvec[0]
+
+ self.forward_dynamics_model = MinigridForwardDynamicsNet(n_actions).to(device=self.device)
+ self.inverse_dynamics_model = MinigridInverseDynamicsNet(n_actions).to(device=self.device)
+
+ self.state_embedding_optimizer = torch.optim.RMSprop(
+ self.state_embedding_model.parameters(),
+ lr=self.intrinsic_reward_learning_rate,
+ momentum=self.intrinsic_reward_momentum,
+ eps=self.intrinsic_reward_epsilon,
+ alpha=self.intrinsic_reward_alpha)
+
+ self.inverse_dynamics_optimizer = torch.optim.RMSprop(
+ self.inverse_dynamics_model.parameters(),
+ lr=self.intrinsic_reward_learning_rate,
+ momentum=self.intrinsic_reward_momentum,
+ eps=self.intrinsic_reward_epsilon,
+ alpha=self.intrinsic_reward_alpha)
+
+ self.forward_dynamics_optimizer = torch.optim.RMSprop(
+ self.forward_dynamics_model.parameters(),
+ lr=self.intrinsic_reward_learning_rate,
+ momentum=self.intrinsic_reward_momentum,
+ eps=self.intrinsic_reward_epsilon,
+ alpha=self.intrinsic_reward_alpha)
diff --git a/torch-ac/torch_ac/algos/ppo.py b/torch-ac/torch_ac/algos/ppo.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ba3c0f92b4792fbe31ab09fe0408bfd4b2ebee7
--- /dev/null
+++ b/torch-ac/torch_ac/algos/ppo.py
@@ -0,0 +1,494 @@
+import numpy
+import torch
+import torch.nn.functional as F
+from torch_ac.intrinsic_reward_models import compute_forward_dynamics_loss, compute_inverse_dynamics_loss
+from sklearn.metrics import f1_score
+
+from torch_ac.algos.base import BaseAlgo
+
+def compute_balance_mask(target, n_classes):
+ if target.float().var() == 0:
+ # all the same class, don't train at all
+ return torch.zeros_like(target).detach()
+
+ # compute the balance mask
+ per_class_n = torch.bincount(target, minlength=n_classes)
+
+ # number of times the least common class (that appeared) appeared
+ n_for_each_class = per_class_n[torch.nonzero(per_class_n)].min()
+
+ # undersample other classes
+ per_class_n = n_for_each_class # sample each class that many times
+
+ balanced_indexes_ = []
+
+ for c in range(n_classes):
+ c_indexes = torch.where(target == c)[0]
+ if len(c_indexes) == 0:
+ continue
+
+ # c_sampled_indexes = c_indexes[torch.randint(len(c_indexes), (per_class_n,))]
+ c_sampled_indexes = c_indexes[torch.randperm(len(c_indexes))[:per_class_n]]
+ balanced_indexes_.append(c_sampled_indexes)
+
+ balanced_indexes = torch.concat(balanced_indexes_)
+ balance_mask = torch.zeros_like(target)
+ balance_mask[balanced_indexes] = 1.0
+
+ return balance_mask.detach()
+
+
+class PPOAlgo(BaseAlgo):
+ """The Proximal Policy Optimization algorithm
+ ([Schulman et al., 2015](https://arxiv.org/abs/1707.06347))."""
+
+ def __init__(self, envs, acmodel, device=None, num_frames_per_proc=None, discount=0.99, lr=0.001, gae_lambda=0.95,
+ entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5, recurrence=4,
+ adam_eps=1e-5, clip_eps=0.2, epochs=4, batch_size=256, preprocess_obss=None,
+ reshape_reward=None, exploration_bonus=False, exploration_bonus_params=None,
+ expert_exploration_bonus=False, episodic_exploration_bonus=True, exploration_bonus_type="lang",
+ exploration_bonus_tanh=None, clipped_rewards=False, intrinsic_reward_epochs=0,
+ # default is set to fit RND
+ intrinsic_reward_coef=0.1,
+ intrinsic_reward_learning_rate=0.0001,
+ intrinsic_reward_momentum=0,
+ intrinsic_reward_epsilon=0.01,
+ intrinsic_reward_alpha=0.99,
+ intrinsic_reward_max_grad_norm=40,
+ intrinsic_reward_loss_coef=0.1,
+ intrinsic_reward_forward_loss_coef=10,
+ intrinsic_reward_inverse_loss_coef=0.1,
+ reset_rnd_ride_at_phase=False,
+ balance_moa_training=False,
+ moa_memory_dim=128,
+ schedule_lr=False,
+ lr_schedule_end_frames=0,
+ end_lr=0.0,
+ ):
+ num_frames_per_proc = num_frames_per_proc or 128
+
+ # save config
+ self.config = locals()
+
+ super().__init__(
+ envs=envs,
+ acmodel=acmodel,
+ device=device,
+ num_frames_per_proc=num_frames_per_proc,
+ discount=discount,
+ lr=lr,
+ gae_lambda=gae_lambda,
+ entropy_coef=entropy_coef,
+ value_loss_coef=value_loss_coef,
+ max_grad_norm=max_grad_norm,
+ recurrence=recurrence,
+ preprocess_obss=preprocess_obss,
+ reshape_reward=reshape_reward,
+ exploration_bonus=exploration_bonus,
+ expert_exploration_bonus=expert_exploration_bonus,
+ episodic_exploration_bonus=episodic_exploration_bonus,
+ exploration_bonus_params=exploration_bonus_params,
+ exploration_bonus_tanh=exploration_bonus_tanh,
+ exploration_bonus_type=exploration_bonus_type,
+ clipped_rewards=clipped_rewards,
+ intrinsic_reward_loss_coef=intrinsic_reward_loss_coef,
+ intrinsic_reward_coef=intrinsic_reward_coef,
+ intrinsic_reward_learning_rate=intrinsic_reward_learning_rate,
+ intrinsic_reward_momentum=intrinsic_reward_momentum,
+ intrinsic_reward_epsilon=intrinsic_reward_epsilon,
+ intrinsic_reward_alpha=intrinsic_reward_alpha,
+ intrinsic_reward_max_grad_norm=intrinsic_reward_max_grad_norm,
+ intrinsic_reward_forward_loss_coef=intrinsic_reward_forward_loss_coef,
+ intrinsic_reward_inverse_loss_coef=intrinsic_reward_inverse_loss_coef,
+ balance_moa_training=balance_moa_training,
+ moa_memory_dim=moa_memory_dim,
+ reset_rnd_ride_at_phase=reset_rnd_ride_at_phase,
+ )
+
+ self.clip_eps = clip_eps
+ self.epochs = epochs
+ self.intrinsic_reward_epochs = intrinsic_reward_epochs
+ self.batch_size = batch_size
+
+ assert self.batch_size % self.recurrence == 0
+
+ if self.exploration_bonus and "soc_inf" in self.exploration_bonus_type:
+ adam_params = list(dict.fromkeys(list(self.acmodel.parameters()) + list(self.moa_net.parameters())))
+ self.optimizer = torch.optim.Adam(adam_params, lr, eps=adam_eps)
+
+ else:
+ self.optimizer = torch.optim.Adam(self.acmodel.parameters(), lr, eps=adam_eps)
+
+ self.schedule_lr = schedule_lr
+
+ self.lr_schedule_end_frames = lr_schedule_end_frames
+
+ assert end_lr <= lr
+ def lr_lambda(step):
+ if self.lr_schedule_end_frames == 0:
+ # no schedule
+ return 1
+
+ end_factor = end_lr/lr
+ final_diminished_factor = 1-end_factor
+ n_frames = self.step_to_n_frames(step)
+ return 1 - (min(n_frames, self.lr_schedule_end_frames) / self.lr_schedule_end_frames) * final_diminished_factor
+
+ self.lr_scheduler = torch.optim.lr_scheduler.LambdaLR(self.optimizer, lr_lambda)
+
+ self.batch_num = 0
+
+ def load_status_dict(self, status):
+ super().load_status_dict(status)
+
+ if "optimizer_state" in status:
+ self.optimizer.load_state_dict(status["optimizer_state"])
+
+ if "lr_scheduler_state" in status:
+ self.lr_scheduler.load_state_dict(status["lr_scheduler_state"])
+
+ def get_status_dict(self):
+
+ status_dict = super().get_status_dict()
+
+ status_dict["optimizer_state"] = self.optimizer.state_dict()
+
+ status_dict["lr_scheduler_state"] = self.lr_scheduler.state_dict()
+
+ return status_dict
+
+ def update_parameters(self, exps):
+ # Collect experiences
+
+ self.acmodel.train()
+
+ self.update_epoch += 1
+
+ intr_rew_perf = torch.tensor(0.0)
+ intr_rew_perf_ = 0.0
+
+ social_influence = False
+
+ if self.exploration_bonus:
+ if "rnd" in self.exploration_bonus_type:
+ imgs = exps.obs.image.reshape(
+ self.num_procs, self.num_frames_per_proc, *exps.obs.image.shape[1:]
+ ).transpose(0, 1)
+ mask = exps.mask.reshape(
+ self.num_procs, self.num_frames_per_proc, 1,
+ ).transpose(0, 1)
+
+ self.random_target_network.train()
+ self.predictor_network.train()
+
+ random_embedding = self.random_target_network(imgs).reshape(self.num_frames_per_proc, self.num_procs, 128)
+ predicted_embedding = self.predictor_network(imgs).reshape(self.num_frames_per_proc, self.num_procs, 128)
+ intr_rew_loss = self.intrinsic_reward_loss_coef * compute_forward_dynamics_loss(mask*predicted_embedding, mask*random_embedding.detach())
+
+ # update the intr rew models
+ self.intrinsic_reward_optimizer.zero_grad()
+ intr_rew_loss.backward()
+ torch.nn.utils.clip_grad_norm_(self.predictor_network.parameters(), self.intrinsic_reward_max_grad_norm)
+ self.intrinsic_reward_optimizer.step()
+
+ intr_rew_perf = intr_rew_loss
+
+ elif "ride" in self.exploration_bonus_type:
+ imgs = exps.obs.image.reshape(
+ self.num_procs, self.num_frames_per_proc, *exps.obs.image.shape[1:]
+ ).transpose(0, 1)
+
+ mask = exps.mask.reshape(
+ self.num_procs, self.num_frames_per_proc
+ ).transpose(0, 1).to(torch.int64)
+
+ # we only take the first (primitive) action
+ action = exps.action[:, 0].reshape(
+ self.num_procs, self.num_frames_per_proc
+ ).transpose(0, 1).to(torch.int64)
+
+ _mask = mask[:-1]
+ _obs = imgs[:-1]
+ _actions = action[:-1]
+ _next_obs = imgs[1:]
+
+ self.state_embedding_model.train()
+ self.forward_dynamics_model.train()
+ self.inverse_dynamics_model.train()
+
+ state_emb = self.state_embedding_model(_obs.to(device=self.device))
+ next_state_emb = self.state_embedding_model(_next_obs.to(device=self.device))
+
+ pred_next_state_emb = self.forward_dynamics_model(state_emb, _actions.to(device=self.device))
+
+ pred_actions = self.inverse_dynamics_model(state_emb, next_state_emb)
+
+ forward_dynamics_loss = self.intrinsic_reward_forward_loss_coef * \
+ compute_forward_dynamics_loss(_mask[:,:,None]*pred_next_state_emb, _mask[:,:,None]*next_state_emb)
+
+ inverse_dynamics_loss = self.intrinsic_reward_inverse_loss_coef * \
+ compute_inverse_dynamics_loss(_mask[:,:,None]*pred_actions, _mask*_actions)
+
+ # update the intr rew models
+ self.state_embedding_optimizer.zero_grad()
+ self.forward_dynamics_optimizer.zero_grad()
+ self.inverse_dynamics_optimizer.zero_grad()
+
+ intr_rew_loss = forward_dynamics_loss + inverse_dynamics_loss
+ intr_rew_loss.backward()
+
+ torch.nn.utils.clip_grad_norm_(self.state_embedding_model.parameters(), self.intrinsic_reward_max_grad_norm)
+ torch.nn.utils.clip_grad_norm_(self.forward_dynamics_model.parameters(), self.intrinsic_reward_max_grad_norm)
+ torch.nn.utils.clip_grad_norm_(self.inverse_dynamics_model.parameters(), self.intrinsic_reward_max_grad_norm)
+
+ self.state_embedding_optimizer.step()
+ self.forward_dynamics_optimizer.step()
+ self.inverse_dynamics_optimizer.step()
+
+ intr_rew_perf = intr_rew_loss
+
+ elif "soc_inf" in self.exploration_bonus_type:
+
+ # trained together with the policy
+ social_influence = True
+ self.moa_net.train()
+ if self.intrinsic_reward_epochs > 0:
+ raise DeprecationWarning(f"Moa must be trained with the agent. intrinsic_reward_epochs must be 0 but is {self.intrinsic_reward_epochs}")
+
+ for _ in range(self.epochs):
+ # Initialize log values
+
+ log_entropies = []
+ log_values = []
+ log_policy_losses = []
+ log_value_losses = []
+ log_grad_norms = []
+ log_lrs = []
+
+ for inds in self._get_batches_starting_indexes():
+ # Initialize batch values
+
+ batch_entropy = 0
+ batch_value = 0
+ batch_policy_loss = 0
+ batch_value_loss = 0
+ batch_loss = 0
+
+ # intr reward metrics
+ batch_intr_rew_loss = 0
+ batch_intr_rew_acc = 0
+ batch_intr_rew_f1 = 0
+
+ # Initialize memory
+
+ if self.acmodel.recurrent:
+ memory = exps.memory[inds]
+
+ if social_influence:
+ # Initialize moa memory
+ moa_memory = exps.moa_memory[inds]
+ prev_npc_prim_action = None
+
+ for i in range(self.recurrence):
+ # Create a sub-batch of experience
+ sb = exps[inds + i]
+
+ # Compute loss
+ if self.acmodel.recurrent:
+ dist, value, memory, policy_embeddings = self.acmodel(sb.obs, memory * sb.mask, return_embeddings=True)
+ else:
+ dist, value, policy_embeddings = self.acmodel(sb.obs, return_embeddings=True)
+
+ losses = []
+
+ for head_i, d in enumerate(dist):
+ action_masks = self.acmodel.calculate_action_gradient_masks(sb.action).type(sb.log_prob.type())
+
+ entropy = (d.entropy() * action_masks[:, head_i]).mean()
+ ratio = torch.exp(d.log_prob(sb.action[:, head_i]) - sb.log_prob[:, head_i])
+ surr1 = ratio * sb.advantage
+ surr2 = torch.clamp(ratio, 1.0 - self.clip_eps, 1.0 + self.clip_eps) * sb.advantage
+ policy_loss = (
+ -torch.min(surr1, surr2) * action_masks[:, head_i]
+ ).mean()
+
+ value_clipped = sb.value + torch.clamp(value - sb.value, -self.clip_eps, self.clip_eps)
+ surr1 = (value - sb.returnn).pow(2)
+ surr2 = (value_clipped - sb.returnn).pow(2)
+ value_loss = (
+ torch.max(surr1, surr2) * action_masks[:, head_i]
+ ).mean()
+
+ head_loss = policy_loss - self.entropy_coef * entropy + self.value_loss_coef * value_loss
+ losses.append(head_loss)
+
+ if social_influence:
+ # moa loss
+ imgs = sb.obs.image
+ mask = sb.mask.to(torch.int64)
+ # we only take the first (primitive) action
+ agent_action = sb.action.to(torch.int64)
+ infos = numpy.array(sb.infos)
+ npc_prim_action = torch.tensor(
+ numpy.array([self.fn_name_to_npc_prim_act[info["NPC_prim_action"]] for info in infos]))
+ npc_utt_action = torch.tensor(
+ numpy.array([self.npc_utterance_to_id[info["NPC_utterance"]] for info in infos]))
+
+ assert infos.shape == imgs.shape[:1] == agent_action.shape[:1] # [bs]
+
+ if i == 0:
+ prev_npc_prim_action = npc_prim_action
+ prev_npc_utt_action = npc_utt_action
+
+ else:
+ # compute loss and train moa net
+ if self.utterance_moa_net:
+ # transform to long logits
+ target = npc_prim_action.detach().to(self.device) * self.num_npc_utterance_actions + npc_utt_action.detach().to(self.device)
+ else:
+ target = npc_prim_action.detach().to(self.device)
+
+ if self.balance_moa_training:
+ balance_mask = compute_balance_mask(target, n_classes=self.num_npc_all_actions)
+ else:
+ balance_mask = torch.ones_like(target)
+
+ moa_predictions_logs, moa_memory = self.moa_net(
+ embeddings=policy_embeddings,
+ npc_previous_prim_actions=prev_npc_prim_action.detach().to(self.device),
+ npc_previous_utterance_actions=prev_npc_utt_action.detach().to(self.device) if self.utterance_moa_net else None,
+ agent_actions=agent_action.detach().to(self.device),
+ memory=moa_memory * mask,
+ )
+
+ # moa_predictions_logs = moa_predictions_logs.reshape([*prev_shape, -1]) # is this needed
+
+ # loss
+ ce_loss = torch.nn.CrossEntropyLoss(reduction='none')
+ intr_rew_loss = (
+ balance_mask * mask * ce_loss(
+ input=moa_predictions_logs,
+ target=target,
+ )).mean() * self.intrinsic_reward_loss_coef
+
+ preds = moa_predictions_logs.detach().argmax(dim=-1)
+ intr_rew_f1 = f1_score(
+ y_pred=preds.detach().cpu().numpy(),
+ y_true=target.detach().cpu().numpy(),
+ average="macro"
+ )
+
+ intr_rew_acc = (
+ torch.argmax(moa_predictions_logs.to(self.device), dim=-1) == target
+ ).to(float).mean()
+
+ batch_intr_rew_loss += intr_rew_loss
+ batch_intr_rew_acc += intr_rew_acc.detach().cpu().numpy().mean()
+ batch_intr_rew_f1 += intr_rew_f1
+
+ losses.append(intr_rew_loss) # trained with the policy optimizer
+
+ loss = torch.stack(losses).mean()
+
+ # Update batch values
+ batch_entropy += entropy.item()
+ batch_value += value.mean().item()
+ batch_policy_loss += policy_loss.item()
+ batch_value_loss += value_loss.item()
+ batch_loss += loss
+
+ # Update memories for next epoch
+ # assert self.acmodel.recurrent == (self.recurrence > 1)
+ if self.acmodel.recurrent and i < self.recurrence - 1:
+ exps.memory[inds + i + 1] = memory.detach()
+
+ if social_influence and i < self.recurrence - 1:
+ exps.moa_memory[inds + i + 1] = moa_memory.detach()
+
+
+ # Update batch values
+ batch_entropy /= self.recurrence
+ batch_value /= self.recurrence
+ batch_policy_loss /= self.recurrence
+ batch_value_loss /= self.recurrence
+ batch_loss /= self.recurrence
+
+ # Update actor-critic
+ self.optimizer.zero_grad()
+ batch_loss.backward()
+ grad_norm = sum(p.grad.data.norm(2).item() ** 2 for p in self.acmodel.parameters()) ** 0.5
+ torch.nn.utils.clip_grad_norm_(self.acmodel.parameters(), self.max_grad_norm)
+ self.optimizer.step()
+
+ self.lr_scheduler.step()
+
+ if social_influence:
+ # recurrence-1 because we skipped the first step
+ batch_intr_rew_loss /= self.recurrence - 1
+ batch_intr_rew_acc /= self.recurrence - 1
+ batch_intr_rew_f1 /= self.recurrence - 1
+
+ intr_rew_perf_ = batch_intr_rew_f1
+ intr_rew_perf = batch_intr_rew_acc
+
+ # Update log values
+
+ log_entropies.append(batch_entropy)
+ log_values.append(batch_value)
+ log_policy_losses.append(batch_policy_loss)
+ log_value_losses.append(batch_value_loss)
+ log_grad_norms.append(grad_norm)
+ log_lrs.append(self.optimizer.param_groups[0]['lr'])
+
+ # Log some values
+
+ logs = {
+ "entropy": numpy.mean(log_entropies),
+ "value": numpy.mean(log_values),
+ "policy_loss": numpy.mean(log_policy_losses),
+ "value_loss": numpy.mean(log_value_losses),
+ "grad_norm": numpy.mean(log_grad_norms),
+ "intr_reward_perf": intr_rew_perf,
+ "intr_reward_perf_": intr_rew_perf_,
+ "lr": numpy.mean(log_lrs),
+ }
+
+ return logs
+
+ def _get_batches_starting_indexes(self):
+ """Gives, for each batch, the indexes of the observations given to
+ the model and the experiences used to compute the loss at first.
+
+ First, the indexes are the integers from 0 to `self.num_frames` with a step of
+ `self.recurrence`, shifted by `self.recurrence//2` one time in two for having
+ more diverse batches. Then, the indexes are splited into the different batches.
+
+ Returns
+ -------
+ batches_starting_indexes : list of list of int
+ the indexes of the experiences to be used at first for each batch
+ """
+
+ indexes = numpy.arange(0, self.num_frames, self.recurrence)
+ indexes = numpy.random.permutation(indexes)
+
+ # Shift starting indexes by self.recurrence//2 half the time
+ if self.batch_num % 2 == 1:
+ indexes = indexes[(indexes + self.recurrence) % self.num_frames_per_proc != 0]
+ indexes += self.recurrence // 2
+ self.batch_num += 1
+
+ num_indexes = self.batch_size // self.recurrence
+ batches_starting_indexes = [indexes[i:i+num_indexes] for i in range(0, len(indexes), num_indexes)]
+
+ return batches_starting_indexes
+
+ def get_config_dict(self):
+
+ del self.config['envs']
+ del self.config['acmodel']
+ del self.config['__class__']
+ del self.config['self']
+ del self.config['preprocess_obss']
+ del self.config['device']
+ return self.config
diff --git a/torch-ac/torch_ac/format.py b/torch-ac/torch_ac/format.py
new file mode 100644
index 0000000000000000000000000000000000000000..b42ebe5f3ba302c7c662a14b8b0963079d8a42b5
--- /dev/null
+++ b/torch-ac/torch_ac/format.py
@@ -0,0 +1,4 @@
+import torch
+
+def default_preprocess_obss(obss, device=None):
+ return torch.tensor(obss, device=device)
\ No newline at end of file
diff --git a/torch-ac/torch_ac/intrinsic_reward_models.py b/torch-ac/torch_ac/intrinsic_reward_models.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc9dbd5ac0570dffe716f7a7294c2060eacd9180
--- /dev/null
+++ b/torch-ac/torch_ac/intrinsic_reward_models.py
@@ -0,0 +1,212 @@
+from torch import nn
+import torch
+from torch.nn import functional as F
+
+
+def init(module, weight_init, bias_init, gain=1):
+ weight_init(module.weight.data, gain=gain)
+ bias_init(module.bias.data)
+ return module
+
+class MinigridInverseDynamicsNet(nn.Module):
+ def __init__(self, num_actions):
+ super(MinigridInverseDynamicsNet, self).__init__()
+ self.num_actions = num_actions
+
+ init_ = lambda m: init(m, nn.init.orthogonal_, lambda x: nn.init.
+ constant_(x, 0), nn.init.calculate_gain('relu'))
+ self.inverse_dynamics = nn.Sequential(
+ init_(nn.Linear(2 * 128, 256)),
+ nn.ReLU(),
+ )
+
+ init_ = lambda m: init(m, nn.init.orthogonal_,
+ lambda x: nn.init.constant_(x, 0))
+ self.id_out = init_(nn.Linear(256, self.num_actions))
+
+ def forward(self, state_embedding, next_state_embedding):
+ inputs = torch.cat((state_embedding, next_state_embedding), dim=2)
+ action_logits = self.id_out(self.inverse_dynamics(inputs))
+ return action_logits
+
+class MinigridForwardDynamicsNet(nn.Module):
+ def __init__(self, num_actions):
+ super(MinigridForwardDynamicsNet, self).__init__()
+ self.num_actions = num_actions
+
+ init_ = lambda m: init(m, nn.init.orthogonal_, lambda x: nn.init.
+ constant_(x, 0), nn.init.calculate_gain('relu'))
+
+ self.forward_dynamics = nn.Sequential(
+ init_(nn.Linear(128 + self.num_actions, 256)),
+ nn.ReLU(),
+ )
+
+ init_ = lambda m: init(m, nn.init.orthogonal_,
+ lambda x: nn.init.constant_(x, 0))
+
+ self.fd_out = init_(nn.Linear(256, 128))
+
+ def forward(self, state_embedding, action):
+ action_one_hot = F.one_hot(action, num_classes=self.num_actions).float()
+ inputs = torch.cat((state_embedding, action_one_hot), dim=2)
+ next_state_emb = self.fd_out(self.forward_dynamics(inputs))
+ return next_state_emb
+
+
+class MinigridStateEmbeddingNet(nn.Module):
+ def __init__(self, observation_shape):
+ super(MinigridStateEmbeddingNet, self).__init__()
+ self.observation_shape = observation_shape
+
+ init_ = lambda m: init(m, nn.init.orthogonal_, lambda x: nn.init.
+ constant_(x, 0), nn.init.calculate_gain('relu'))
+
+ self.feat_extract = nn.Sequential(
+ init_(nn.Conv2d(in_channels=self.observation_shape[2], out_channels=32, kernel_size=(3, 3),
+ stride=2, padding=1)),
+ nn.ELU(),
+ init_(nn.Conv2d(in_channels=32, out_channels=32, kernel_size=(3, 3), stride=2, padding=1)),
+ nn.ELU(),
+ init_(nn.Conv2d(in_channels=32, out_channels=128, kernel_size=(3, 3), stride=2, padding=1)),
+ nn.ELU(),
+ )
+
+ def forward(self, inputs):
+ # -- [unroll_length x batch_size x height x width x channels]
+ x = inputs
+ T, B, *_ = x.shape
+
+ # -- [unroll_length*batch_size x height x width x channels]
+ x = torch.flatten(x, 0, 1) # Merge time and batch.
+
+ x = x.float() / 255.0
+
+ # -- [unroll_length*batch_size x channels x width x height]
+ x = x.transpose(1, 3)
+ x = self.feat_extract(x)
+
+ state_embedding = x.view(T, B, -1)
+
+ return state_embedding
+
+def compute_forward_dynamics_loss(pred_next_emb, next_emb):
+ forward_dynamics_loss = torch.norm(pred_next_emb - next_emb, dim=2, p=2)
+ return torch.sum(torch.mean(forward_dynamics_loss, dim=1))
+
+def compute_inverse_dynamics_loss(pred_actions, true_actions):
+ inverse_dynamics_loss = F.nll_loss(
+ F.log_softmax(torch.flatten(pred_actions, 0, 1), dim=-1),
+ target=torch.flatten(true_actions, 0, 1),
+ reduction='none')
+ inverse_dynamics_loss = inverse_dynamics_loss.view_as(true_actions)
+ return torch.sum(torch.mean(inverse_dynamics_loss, dim=1))
+
+class LSTMMoaNet(nn.Module):
+ def __init__(self, input_size, num_npc_prim_actions, acmodel, num_npc_utterance_actions=None, memory_dim=128):
+ super(LSTMMoaNet, self).__init__()
+ self.num_npc_prim_actions = num_npc_prim_actions
+ self.num_npc_utterance_actions = num_npc_utterance_actions
+ self.utterance_moa = num_npc_utterance_actions is not None
+ self.input_size = input_size
+ self.acmodel = acmodel
+
+
+ init_ = lambda m: init(m, nn.init.orthogonal_, lambda x: nn.init.
+ constant_(x, 0), nn.init.calculate_gain('relu'))
+
+ self.hidden_size = 128 # 256 in the original paper
+ self.forward_dynamics = nn.Sequential(
+ init_(nn.Linear(self.input_size, self.hidden_size)),
+ nn.ReLU(),
+ )
+
+ self.memory_dim = memory_dim
+ self.memory_rnn = nn.LSTMCell(self.hidden_size, self.memory_dim)
+ self.embedding_size = self.semi_memory_size
+
+ init_ = lambda m: init(m, nn.init.orthogonal_,
+ lambda x: nn.init.constant_(x, 0))
+
+ self.fd_out_prim = init_(nn.Linear(self.embedding_size, self.num_npc_prim_actions))
+
+ if self.utterance_moa:
+ self.fd_out_utt = init_(nn.Linear(self.embedding_size, self.num_npc_utterance_actions))
+
+ @property
+ def memory_size(self):
+ return 2 * self.semi_memory_size
+
+ @property
+ def semi_memory_size(self):
+ return self.memory_dim
+
+ def forward(self, embeddings, npc_previous_prim_actions, agent_actions, memory, npc_previous_utterance_actions=None):
+
+
+ npc_previous_prim_actions_OH = F.one_hot(npc_previous_prim_actions, self.num_npc_prim_actions)
+
+ if self.utterance_moa:
+ npc_previous_utterance_actions_OH = F.one_hot(
+ npc_previous_utterance_actions,
+ self.num_npc_utterance_actions
+ )
+
+ # is_agent_speaking = self.acmodel.is_raw_action_speaking(agent_action[None, :])
+ # assert len(is_agent_speaking) == 1
+ # is_agent_speaking = is_agent_speaking[0]
+ # enocde agents' action
+
+ is_agent_speaking = self.acmodel.is_raw_action_speaking(agent_actions)
+
+ # prim_action_OH_ = prim_action_OH[None, :].repeat([len(npc_previous_actions_OH), 1])
+ # template_OH_ = template_OH[None, :].repeat([len(npc_previous_actions_OH), 1])
+ # word_OH_ = word_OH[None, :].repeat([len(npc_previous_actions_OH), 1])
+
+ prim_action_OH = F.one_hot(agent_actions[:, 0], self.acmodel.model_raw_action_space.nvec[0])
+ template_OH = F.one_hot(agent_actions[:, 2], self.acmodel.model_raw_action_space.nvec[2])
+ word_OH = F.one_hot(agent_actions[:, 3], self.acmodel.model_raw_action_space.nvec[3])
+
+ # if not speaking make the templates 0
+ template_OH = template_OH * is_agent_speaking[:, None]
+ word_OH = word_OH * is_agent_speaking[:, None]
+
+ if self.utterance_moa:
+ inputs = torch.cat((
+ embeddings, # obs
+ npc_previous_prim_actions_OH, # npc
+ npc_previous_utterance_actions_OH,
+ prim_action_OH, template_OH, word_OH # agent
+ ), dim=1).float()
+
+ else:
+ inputs = torch.cat((
+ embeddings, # obs
+ npc_previous_prim_actions_OH, # npc
+ prim_action_OH, template_OH, word_OH # agent
+ ), dim=1).float()
+
+ outs_1 = self.forward_dynamics(inputs)
+
+ # LSTM
+ hidden = (memory[:, :self.semi_memory_size], memory[:, self.semi_memory_size:])
+ hidden = self.memory_rnn(outs_1, hidden)
+
+ embedding = hidden[0]
+
+ memory = torch.cat(hidden, dim=1)
+
+ outs_prim = self.fd_out_prim(embedding)
+
+ if self.num_npc_utterance_actions:
+ outs_utt = self.fd_out_utt(embedding)
+
+ # cartesian product
+ # outs = torch.bmm(outs_prim.unsqueeze(2), outs_utt.unsqueeze(1)).reshape(-1, self.num_npc_prim_actions*self.num_npc_utterance_actions)
+
+ # outer sum
+ outs = (outs_prim[..., None] + outs_utt[..., None, :]).reshape(-1, self.num_npc_prim_actions*self.num_npc_utterance_actions)
+ else:
+ outs = outs_prim
+
+ return outs, memory
diff --git a/torch-ac/torch_ac/model.py b/torch-ac/torch_ac/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a6351e9e581dca3ce3c6b164c7e3ec291c20cf7
--- /dev/null
+++ b/torch-ac/torch_ac/model.py
@@ -0,0 +1,26 @@
+from abc import abstractmethod, abstractproperty
+import torch.nn as nn
+import torch.nn.functional as F
+
+class ACModel:
+ recurrent = False
+
+ @abstractmethod
+ def __init__(self, obs_space, action_space):
+ pass
+
+ @abstractmethod
+ def forward(self, obs):
+ pass
+
+class RecurrentACModel(ACModel):
+ recurrent = True
+
+ @abstractmethod
+ def forward(self, obs, memory):
+ pass
+
+ @property
+ @abstractmethod
+ def memory_size(self):
+ pass
\ No newline at end of file
diff --git a/torch-ac/torch_ac/utils/__init__.py b/torch-ac/torch_ac/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..56c042970a9c3306139e2f1e9db0b79cacc7760f
--- /dev/null
+++ b/torch-ac/torch_ac/utils/__init__.py
@@ -0,0 +1,2 @@
+from torch_ac.utils.dictlist import DictList
+from torch_ac.utils.penv import ParallelEnv
\ No newline at end of file
diff --git a/torch-ac/torch_ac/utils/dictlist.py b/torch-ac/torch_ac/utils/dictlist.py
new file mode 100644
index 0000000000000000000000000000000000000000..326be592677553fc012f29e2ad8069e40c5e40e5
--- /dev/null
+++ b/torch-ac/torch_ac/utils/dictlist.py
@@ -0,0 +1,24 @@
+class DictList(dict):
+ """A dictionnary of lists of same size. Dictionnary items can be
+ accessed using `.` notation and list items using `[]` notation.
+
+ Example:
+ >>> d = DictList({"a": [[1, 2], [3, 4]], "b": [[5], [6]]})
+ >>> d.a
+ [[1, 2], [3, 4]]
+ >>> d[0]
+ DictList({"a": [1, 2], "b": [5]})
+ """
+
+ __getattr__ = dict.__getitem__
+ __setattr__ = dict.__setitem__
+
+ def __len__(self):
+ return len(next(iter(dict.values(self))))
+
+ def __getitem__(self, index):
+ return DictList({key: value[index] for key, value in dict.items(self)})
+
+ def __setitem__(self, index, d):
+ for key, value in d.items():
+ dict.__getitem__(self, key)[index] = value
\ No newline at end of file
diff --git a/torch-ac/torch_ac/utils/penv.py b/torch-ac/torch_ac/utils/penv.py
new file mode 100644
index 0000000000000000000000000000000000000000..e92891cb2138265e8b8135f1fc444529aefde0e5
--- /dev/null
+++ b/torch-ac/torch_ac/utils/penv.py
@@ -0,0 +1,74 @@
+from multiprocessing import Process, Pipe
+import gym
+
+def worker(conn, env):
+ while True:
+ cmd, data = conn.recv()
+ if cmd == "step":
+ obs, reward, done, info = env.step(data)
+ if done:
+ obs = env.reset()
+ conn.send((obs, reward, done, info))
+ elif cmd == "set_curriculum_parameters":
+ env.set_curriculum_parameters(data)
+ conn.send(None)
+ elif cmd == "reset":
+ obs = env.reset()
+ conn.send(obs)
+ elif cmd == "get_mission":
+ ks = env.get_mission()
+ conn.send(ks)
+ else:
+ raise NotImplementedError
+
+class ParallelEnv(gym.Env):
+ """A concurrent execution of environments in multiple processes."""
+
+ def __init__(self, envs):
+ assert len(envs) >= 1, "No environment given."
+
+ self.envs = envs
+ self.observation_space = self.envs[0].observation_space
+ self.action_space = self.envs[0].action_space
+
+ if hasattr(self.envs[0], "curriculum"):
+ self.curriculum = self.envs[0].curriculum
+
+ self.locals = []
+ for env in self.envs[1:]:
+ local, remote = Pipe()
+ self.locals.append(local)
+ p = Process(target=worker, args=(remote, env))
+ p.daemon = True
+ p.start()
+ remote.close()
+
+ def broadcast_curriculum_parameters(self, data):
+ # broadcast curriculum_data to every worker
+ for local in self.locals:
+ local.send(("set_curriculum_parameters", data))
+ results = [self.envs[0].set_curriculum_parameters(data)] + [local.recv() for local in self.locals]
+
+ def get_mission(self):
+ for local in self.locals:
+ local.send(("get_mission", None))
+ results = [self.envs[0].get_mission()] + [local.recv() for local in self.locals]
+ return results
+
+ def reset(self):
+ for local in self.locals:
+ local.send(("reset", None))
+ results = [self.envs[0].reset()] + [local.recv() for local in self.locals]
+ return results
+
+ def step(self, actions):
+ for local, action in zip(self.locals, actions[1:]):
+ local.send(("step", action))
+ obs, reward, done, info = self.envs[0].step(actions[0])
+ if done:
+ obs = self.envs[0].reset()
+ results = zip(*[(obs, reward, done, info)] + [local.recv() for local in self.locals])
+ return results
+
+ def render(self):
+ raise NotImplementedError
\ No newline at end of file
diff --git a/utils/__init__.py b/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..39fb50d015dcc46d508fd0b8079818b6097f3738
--- /dev/null
+++ b/utils/__init__.py
@@ -0,0 +1,6 @@
+from .agent import *
+from .env import *
+from .format import *
+from .other import *
+from .storage import *
+from .tester import *
diff --git a/utils/agent.py b/utils/agent.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cd19035d9f5ca4acf6d9067937579ff61278ab5
--- /dev/null
+++ b/utils/agent.py
@@ -0,0 +1,63 @@
+import torch
+
+import utils
+from models import *
+
+
+class Agent:
+ """An agent.
+
+ It is able:
+ - to choose an action given an observation,
+ - to analyze the feedback (i.e. reward and done state) of its action."""
+
+ def __init__(self, obs_space, action_space, model_dir,
+ device=None, argmax=False, num_envs=1, use_memory=False, use_text=False, use_dialogue=False, agent_class=ACModel):
+ obs_space, self.preprocess_obss = utils.get_obss_preprocessor(obs_space)
+
+ self.acmodel = agent_class(obs_space, action_space, use_memory=use_memory, use_text=use_text, use_dialogue=use_dialogue)
+
+ self.device = device
+ self.argmax = argmax
+ self.num_envs = num_envs
+
+ if self.acmodel.recurrent:
+ self.memories = torch.zeros(self.num_envs, self.acmodel.memory_size, device=self.device)
+
+ self.acmodel.load_state_dict(utils.get_model_state(model_dir))
+ self.acmodel.to(self.device)
+ self.acmodel.eval()
+ if hasattr(self.preprocess_obss, "vocab"):
+ self.preprocess_obss.vocab.load_vocab(utils.get_vocab(model_dir))
+
+ def get_actions(self, obss):
+ preprocessed_obss = self.preprocess_obss(obss, device=self.device)
+
+ with torch.no_grad():
+ if self.acmodel.recurrent:
+ dist, _, self.memories = self.acmodel(preprocessed_obss, self.memories)
+ else:
+ dist, _ = self.acmodel(preprocessed_obss)
+
+ if isinstance(dist, torch.distributions.Distribution):
+ if self.argmax:
+ actions = dist.probs.max(1, keepdim=True)[1]
+ else:
+ actions = dist.sample()
+ else:
+ if self.argmax:
+ actions = torch.stack([d.probs.max(1)[1] for d in dist], dim=1)
+ else:
+ actions = torch.stack([d.sample() for d in dist], dim=1)
+ return self.acmodel.construct_final_action(actions.cpu().numpy())
+
+ def get_action(self, obs):
+ return self.get_actions([obs])[0]
+
+ def analyze_feedbacks(self, rewards, dones):
+ if self.acmodel.recurrent:
+ masks = 1 - torch.tensor(dones, dtype=torch.float, device=self.device).unsqueeze(1)
+ self.memories *= masks
+
+ def analyze_feedback(self, reward, done):
+ return self.analyze_feedbacks([reward], [done])
diff --git a/utils/babyai_utils/baby_agent.py b/utils/babyai_utils/baby_agent.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0846b0896fd73b407de104a3a68022417c686cb
--- /dev/null
+++ b/utils/babyai_utils/baby_agent.py
@@ -0,0 +1,116 @@
+from abc import ABC, abstractmethod
+import json
+import torch
+from .. import utils
+#from random import Random
+
+
+class Agent(ABC):
+ """An abstraction of the behavior of an agent. The agent is able:
+ - to choose an action given an observation,
+ - to analyze the feedback (i.e. reward and done state) of its action."""
+
+ def on_reset(self):
+ pass
+
+ @abstractmethod
+ def get_action(self, obs):
+ """Propose an action based on observation.
+
+ Returns a dict, with 'action` entry containing the proposed action,
+ and optionaly other entries containing auxiliary information
+ (e.g. value function).
+
+ """
+ pass
+
+ @abstractmethod
+ def analyze_feedback(self, reward, done):
+ pass
+
+
+class ModelAgent(Agent):
+ """A model-based agent. This agent behaves using a model."""
+
+ def __init__(self, model_dir, obss_preprocessor, argmax, num_frames=None):
+ if obss_preprocessor is None:
+ assert isinstance(model_dir, str)
+ obss_preprocessor = utils.ObssPreprocessor(model_dir, num_frames)
+ self.obss_preprocessor = obss_preprocessor
+ if isinstance(model_dir, str):
+ self.model = utils.load_model(model_dir, num_frames)
+ if torch.cuda.is_available():
+ self.model.cuda()
+ else:
+ self.model = model_dir
+ self.device = next(self.model.parameters()).device
+ self.argmax = argmax
+ self.memory = None
+
+ def random_act_batch(self, many_obs):
+ if self.memory is None:
+ self.memory = torch.zeros(
+ len(many_obs), self.model.memory_size, device=self.device)
+ elif self.memory.shape[0] != len(many_obs):
+ raise ValueError("stick to one batch size for the lifetime of an agent")
+ preprocessed_obs = self.obss_preprocessor(many_obs, device=self.device)
+
+ with torch.no_grad():
+ raw_action = self.model.model_raw_action_space.sample()
+ action = self.model.construct_final_action(raw_action[None, :])
+
+ return action[0]
+
+ def act_batch(self, many_obs):
+ if self.memory is None:
+ self.memory = torch.zeros(
+ len(many_obs), self.model.memory_size, device=self.device)
+ elif self.memory.shape[0] != len(many_obs):
+ raise ValueError("stick to one batch size for the lifetime of an agent")
+ preprocessed_obs = self.obss_preprocessor(many_obs, device=self.device)
+
+ with torch.no_grad():
+ dist, value, self.memory = self.model(preprocessed_obs, self.memory)
+ if self.argmax:
+ action = torch.stack([d.probs.argmax() for d in dist])[None, :]
+ else:
+ action = self.model.sample_action(dist)
+
+ action = self.model.construct_final_action(action.cpu().numpy())
+
+ return action[0]
+
+ def get_action(self, obs):
+ return self.act_batch([obs])
+
+ def get_random_action(self, obs):
+ return self.random_act_batch([obs])
+
+ def analyze_feedback(self, reward, done):
+ if isinstance(done, tuple):
+ for i in range(len(done)):
+ if done[i]:
+ self.memory[i, :] *= 0.
+ else:
+ self.memory *= (1 - done)
+
+def load_agent(env, model_name, argmax=False, num_frames=None):
+ # env_name needs to be specified for demo agents
+ if model_name is not None:
+
+ with open(model_name + "/config.json") as f:
+ conf = json.load(f)
+ text = conf['use_text']
+ curr_dial = conf.get('use_current_dialogue_only', False)
+ dial_hist = conf['use_dialogue']
+
+ _, preprocess_obss = utils.get_obss_preprocessor(
+ obs_space=env.observation_space,
+ text=text,
+ dialogue_current=curr_dial,
+ dialogue_history=dial_hist
+ )
+ vocab = utils.get_status(model_name, num_frames)["vocab"]
+ preprocess_obss.vocab.load_vocab(vocab)
+ print("loaded vocabulary:", vocab.keys())
+ return ModelAgent(model_name, preprocess_obss, argmax, num_frames)
diff --git a/utils/babyai_utils/supervised_losses.py b/utils/babyai_utils/supervised_losses.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ed52c2c500549adc4f9b85fd590390b870eec7b
--- /dev/null
+++ b/utils/babyai_utils/supervised_losses.py
@@ -0,0 +1,177 @@
+import torch
+
+import torch.nn.functional as F
+import numpy
+from torch_ac.utils import DictList
+
+# dictionary that defines what head is required for each extra info used for auxiliary supervision
+required_heads = {'seen_state': 'binary',
+ 'see_door': 'binary',
+ 'see_obj': 'binary',
+ 'obj_in_instr': 'binary',
+ 'in_front_of_what': 'multiclass9', # multi class classifier with 9 possible classes
+ 'visit_proportion': 'continuous01', # continous regressor with outputs in [0, 1]
+ 'bot_action': 'binary'
+ }
+
+class ExtraInfoCollector:
+ '''
+ This class, used in rl.algos.base, allows connecting the extra information from the environment, and the
+ corresponding predictions using the specific heads in the model. It transforms them so that they are easy to use
+ to evaluate losses
+ '''
+ def __init__(self, aux_info, shape, device):
+ self.aux_info = aux_info
+ self.shape = shape
+ self.device = device
+
+ self.collected_info = dict()
+ self.extra_predictions = dict()
+ for info in self.aux_info:
+ self.collected_info[info] = torch.zeros(*shape, device=self.device)
+ if required_heads[info] == 'binary' or required_heads[info].startswith('continuous'):
+ # we predict one number only
+ self.extra_predictions[info] = torch.zeros(*shape, 1, device=self.device)
+ elif required_heads[info].startswith('multiclass'):
+ # means that this is a multi-class classification and we need to predict the whole proba distr
+ n_classes = int(required_heads[info].replace('multiclass', ''))
+ self.extra_predictions[info] = torch.zeros(*shape, n_classes, device=self.device)
+ else:
+ raise ValueError("{} not supported".format(required_heads[info]))
+
+ def process(self, env_info):
+ # env_info is now a tuple of dicts
+ env_info = [{k: v for k, v in dic.items() if k in self.aux_info} for dic in env_info]
+ env_info = {k: [env_info[_][k] for _ in range(len(env_info))] for k in env_info[0].keys()}
+ # env_info is now a dict of lists
+ return env_info
+
+ def fill_dictionaries(self, index, env_info, extra_predictions):
+ for info in self.aux_info:
+ dtype = torch.long if required_heads[info].startswith('multiclass') else torch.float
+ self.collected_info[info][index] = torch.tensor(env_info[info], dtype=dtype, device=self.device)
+ self.extra_predictions[info][index] = extra_predictions[info]
+
+ def end_collection(self, exps):
+ collected_info = dict()
+ extra_predictions = dict()
+ for info in self.aux_info:
+ # T x P -> P x T -> P * T
+ collected_info[info] = self.collected_info[info].transpose(0, 1).reshape(-1)
+ if required_heads[info] == 'binary' or required_heads[info].startswith('continuous'):
+ # T x P x 1 -> P x T x 1 -> P * T
+ extra_predictions[info] = self.extra_predictions[info].transpose(0, 1).reshape(-1)
+ elif type(required_heads[info]) == int:
+ # T x P x k -> P x T x k -> (P * T) x k
+ k = required_heads[info] # number of classes
+ extra_predictions[info] = self.extra_predictions[info].transpose(0, 1).reshape(-1, k)
+ # convert the dicts to DictLists, and add them to the exps DictList.
+ exps.collected_info = DictList(collected_info)
+ exps.extra_predictions = DictList(extra_predictions)
+
+ return exps
+
+
+class SupervisedLossUpdater:
+ '''
+ This class, used by PPO, allows the evaluation of the supervised loss when using extra information from the
+ environment. It also handles logging accuracies/L2 distances/etc...
+ '''
+ def __init__(self, aux_info, supervised_loss_coef, recurrence, device):
+ self.aux_info = aux_info
+ self.supervised_loss_coef = supervised_loss_coef
+ self.recurrence = recurrence
+ self.device = device
+
+ self.log_supervised_losses = []
+ self.log_supervised_accuracies = []
+ self.log_supervised_L2_losses = []
+ self.log_supervised_prevalences = []
+
+ self.batch_supervised_loss = 0
+ self.batch_supervised_accuracy = 0
+ self.batch_supervised_L2_loss = 0
+ self.batch_supervised_prevalence = 0
+
+ def init_epoch(self):
+ self.log_supervised_losses = []
+ self.log_supervised_accuracies = []
+ self.log_supervised_L2_losses = []
+ self.log_supervised_prevalences = []
+
+ def init_batch(self):
+ self.batch_supervised_loss = 0
+ self.batch_supervised_accuracy = 0
+ self.batch_supervised_L2_loss = 0
+ self.batch_supervised_prevalence = 0
+
+ def eval_subbatch(self, extra_predictions, sb):
+ supervised_loss = torch.tensor(0., device=self.device)
+ supervised_accuracy = torch.tensor(0., device=self.device)
+ supervised_L2_loss = torch.tensor(0., device=self.device)
+ supervised_prevalence = torch.tensor(0., device=self.device)
+
+ binary_classification_tasks = 0
+ classification_tasks = 0
+ regression_tasks = 0
+
+ for pos, info in enumerate(self.aux_info):
+ coef = self.supervised_loss_coef[pos]
+ pred = extra_predictions[info]
+ target = dict.__getitem__(sb.collected_info, info)
+ if required_heads[info] == 'binary':
+ binary_classification_tasks += 1
+ classification_tasks += 1
+ supervised_loss += coef * F.binary_cross_entropy_with_logits(pred.reshape(-1), target)
+ supervised_accuracy += ((pred.reshape(-1) > 0).float() == target).float().mean()
+ supervised_prevalence += target.mean()
+ elif required_heads[info].startswith('continuous'):
+ regression_tasks += 1
+ mse = F.mse_loss(pred.reshape(-1), target)
+ supervised_loss += coef * mse
+ supervised_L2_loss += mse
+ elif required_heads[info].startswith('multiclass'):
+ classification_tasks += 1
+ supervised_accuracy += (pred.argmax(1).float() == target).float().mean()
+ supervised_loss += coef * F.cross_entropy(pred, target.long())
+ else:
+ raise ValueError("{} not supported".format(required_heads[info]))
+ if binary_classification_tasks > 0:
+ supervised_prevalence /= binary_classification_tasks
+ else:
+ supervised_prevalence = torch.tensor(-1)
+ if classification_tasks > 0:
+ supervised_accuracy /= classification_tasks
+ else:
+ supervised_accuracy = torch.tensor(-1)
+ if regression_tasks > 0:
+ supervised_L2_loss /= regression_tasks
+ else:
+ supervised_L2_loss = torch.tensor(-1)
+
+ self.batch_supervised_loss += supervised_loss.item()
+ self.batch_supervised_accuracy += supervised_accuracy.item()
+ self.batch_supervised_L2_loss += supervised_L2_loss.item()
+ self.batch_supervised_prevalence += supervised_prevalence.item()
+
+ return supervised_loss
+
+ def update_batch_values(self):
+ self.batch_supervised_loss /= self.recurrence
+ self.batch_supervised_accuracy /= self.recurrence
+ self.batch_supervised_L2_loss /= self.recurrence
+ self.batch_supervised_prevalence /= self.recurrence
+
+ def update_epoch_logs(self):
+ self.log_supervised_losses.append(self.batch_supervised_loss)
+ self.log_supervised_accuracies.append(self.batch_supervised_accuracy)
+ self.log_supervised_L2_losses.append(self.batch_supervised_L2_loss)
+ self.log_supervised_prevalences.append(self.batch_supervised_prevalence)
+
+ def end_training(self, logs):
+ logs["supervised_loss"] = numpy.mean(self.log_supervised_losses)
+ logs["supervised_accuracy"] = numpy.mean(self.log_supervised_accuracies)
+ logs["supervised_L2_loss"] = numpy.mean(self.log_supervised_L2_losses)
+ logs["supervised_prevalence"] = numpy.mean(self.log_supervised_prevalences)
+
+ return logs
diff --git a/utils/env.py b/utils/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f4fd5f0f995c733a789727bd869e4a8e4239829
--- /dev/null
+++ b/utils/env.py
@@ -0,0 +1,16 @@
+import gym
+import gym_minigrid
+
+
+def make_env(env_key, seed=None, env_args={}):
+ env = gym.make(env_key, **env_args)
+ env.seed(seed)
+ return env
+
+
+def env_args_str_to_dict(env_args_str):
+ if not env_args_str:
+ return {}
+ keys = env_args_str[::2] # Every even element is a key
+ vals = env_args_str[1::2] # Every odd element is a value
+ return dict(zip(keys, [eval(v) for v in vals]))
diff --git a/utils/format.py b/utils/format.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcf9d0741e618b484d7e18ebfccf7180c633cda4
--- /dev/null
+++ b/utils/format.py
@@ -0,0 +1,144 @@
+import os
+import json
+import numpy
+import re
+import torch
+import torch_ac
+import gym
+
+import utils
+
+
+def get_obss_preprocessor(obs_space, text=None, dialogue_current=None, dialogue_history=None, custom_image_preprocessor=None, custom_image_space_preprocessor=None):
+ # Check if obs_space is an image space
+ if isinstance(obs_space, gym.spaces.Box):
+ obs_space = {"image": obs_space.shape}
+
+ def preprocess_obss(obss, device=None):
+ assert custom_image_preprocessor is None
+ return torch_ac.DictList({
+ "image": preprocess_images(obss, device=device)
+ })
+
+ # Check if it is a MiniGrid observation space
+ elif isinstance(obs_space, gym.spaces.Dict) and list(obs_space.spaces.keys()) == ["image"]:
+
+ assert (custom_image_preprocessor is None) == (custom_image_space_preprocessor is None)
+
+ image_obs_space = obs_space.spaces["image"].shape
+
+ if custom_image_preprocessor:
+ image_obs_space = custom_image_space_preprocessor(image_obs_space)
+
+ obs_space = {"image": image_obs_space, "text": 100}
+
+ # must be specified in this case
+ if text is None:
+ raise ValueError("text argument must be specified.")
+ if dialogue_current is None:
+ raise ValueError("dialogue current argument must be specified.")
+ if dialogue_history is None:
+ raise ValueError("dialogue history argument must be specified.")
+
+ vocab = Vocabulary(obs_space["text"])
+ def preprocess_obss(obss, device=None):
+ if custom_image_preprocessor is None:
+ D = {
+ "image": preprocess_images([obs["image"] for obs in obss], device=device)
+ }
+ else:
+ D = {
+ "image": custom_image_preprocessor([obs["image"] for obs in obss], device=device)
+ }
+
+ if dialogue_current:
+ D["utterance"] = preprocess_texts([obs["utterance"] for obs in obss], vocab, device=device)
+
+ if dialogue_history:
+ D["utterance_history"] = preprocess_texts([obs["utterance_history"] for obs in obss], vocab, device=device)
+
+ if text:
+ D["text"] = preprocess_texts([obs["mission"] for obs in obss], vocab, device=device)
+
+
+ return torch_ac.DictList(D)
+
+ preprocess_obss.vocab = vocab
+
+ else:
+ raise ValueError("Unknown observation space: " + str(obs_space))
+
+ return obs_space, preprocess_obss
+
+def ride_ref_image_space_preprocessor(image_space):
+ return image_space
+
+def ride_ref_image_preprocessor(images, device=None):
+ # Bug of Pytorch: very slow if not first converted to numpy array
+
+ images = numpy.array(images)
+
+ # grid dimensions
+ size = images.shape[1]
+ assert size == images.shape[2]
+
+ # assert that 1, 2 are absolute cooridnates
+ # assert images[:,:,:,1].max() <= size
+ # assert images[:,:,:,2].max() <= size
+ assert images[:,:,:,1].min() >= 0
+ assert images[:,:,:,2].min() >= 0
+ #
+ # # 0, 1, 2 -> door state
+ # assert all([e in set([0, 1, 2]) for e in numpy.unique(images[:, :, :, 4].reshape(-1))])
+ #
+ # only keep the (obj id, colors, state) -> multiply others by 0
+ # print(images[:, :, :, 1].max())
+
+ images[:, :, :, 1] *= 0
+ images[:, :, :, 2] *= 0
+
+ assert images.shape[-1] == 5
+
+ return torch.tensor(images, device=device, dtype=torch.float)
+
+def preprocess_images(images, device=None):
+ # Bug of Pytorch: very slow if not first converted to numpy array
+ images = numpy.array(images)
+ return torch.tensor(images, device=device, dtype=torch.float)
+
+
+def preprocess_texts(texts, vocab, device=None):
+ var_indexed_texts = []
+ max_text_len = 0
+
+ for text in texts:
+ tokens = re.findall("([a-z]+)", text.lower())
+ var_indexed_text = numpy.array([vocab[token] for token in tokens])
+ var_indexed_texts.append(var_indexed_text)
+ max_text_len = max(len(var_indexed_text), max_text_len)
+
+ indexed_texts = numpy.zeros((len(texts), max_text_len))
+
+ for i, indexed_text in enumerate(var_indexed_texts):
+ indexed_texts[i, :len(indexed_text)] = indexed_text
+
+ return torch.tensor(indexed_texts, device=device, dtype=torch.long)
+
+
+class Vocabulary:
+ """A mapping from tokens to ids with a capacity of `max_size` words.
+ It can be saved in a `vocab.json` file."""
+
+ def __init__(self, max_size):
+ self.max_size = max_size
+ self.vocab = {}
+
+ def load_vocab(self, vocab):
+ self.vocab = vocab
+
+ def __getitem__(self, token):
+ if not token in self.vocab.keys():
+ if len(self.vocab) >= self.max_size:
+ raise ValueError("Maximum vocabulary capacity reached")
+ self.vocab[token] = len(self.vocab) + 1
+ return self.vocab[token]
diff --git a/utils/multimodalutils.py b/utils/multimodalutils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fbde04f41f6be2652bc59004a8ef17871a82768
--- /dev/null
+++ b/utils/multimodalutils.py
@@ -0,0 +1,17 @@
+from enum import IntEnum
+import numpy as np
+import gym.spaces as spaces
+import torch
+
+raise DeprecationWarning("Do not use this. Grammar is defined in the env class; SocialAIGrammar is socialaigrammar.py")
+
+# class Grammar(object):
+#
+# templates = ["Where is ", "Who is"]
+# things = ["me", "exit", "you", "him", "task"]
+#
+# grammar_action_space = spaces.MultiDiscrete([len(templates), len(things)])
+#
+# @classmethod
+# def construct_utterance(cls, action):
+# return cls.templates[int(action[0])] + " " + cls.things[int(action[1])] + ". "
diff --git a/utils/other.py b/utils/other.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c05f793ef3b2aa11af255a34f85047f90ae96dd
--- /dev/null
+++ b/utils/other.py
@@ -0,0 +1,33 @@
+import random
+import numpy
+import torch
+import collections
+
+
+def seed(seed):
+ random.seed(seed)
+ numpy.random.seed(seed)
+ torch.manual_seed(seed)
+ if torch.cuda.is_available():
+ torch.cuda.manual_seed_all(seed)
+
+
+def synthesize(array):
+ d = collections.OrderedDict()
+ d["mean"] = numpy.mean(array)
+ d["std"] = numpy.std(array)
+ d["min"] = numpy.amin(array)
+ d["max"] = numpy.amax(array)
+ return d
+
+# Function from https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/model.py
+def init_params(m):
+ classname = m.__class__.__name__
+ if classname.find("Linear") != -1:
+ m.weight.data.normal_(0, 1)
+ m.weight.data *= 1 / torch.sqrt(m.weight.data.pow(2).sum(1, keepdim=True))
+ if m.bias is not None:
+ m.bias.data.fill_(0)
+
+
+
diff --git a/utils/storage.py b/utils/storage.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a816844413364665a6d7a9f7186e2e7ca21464e
--- /dev/null
+++ b/utils/storage.py
@@ -0,0 +1,100 @@
+import csv
+import os
+import torch
+import logging
+import sys
+from pathlib import Path
+
+import utils
+
+
+def create_folders_if_necessary(path):
+ dirname = os.path.dirname(path)
+ if not os.path.isdir(dirname):
+ os.makedirs(dirname)
+
+
+def get_storage_dir():
+ if "RL_STORAGE" in os.environ:
+ return os.environ["RL_STORAGE"]
+ return "storage"
+
+
+def get_model_dir(model_name):
+ return os.path.join(get_storage_dir(), model_name)
+
+
+def get_status_path(model_dir, num_frames=None):
+ if num_frames:
+ return os.path.join(model_dir, "status_{}.pt".format(num_frames))
+ return os.path.join(model_dir, "status.pt")
+
+def get_model_path(model_dir, num_frames=None):
+ if num_frames:
+ return os.path.join(model_dir, "model_{}.pt".format(num_frames))
+ return os.path.join(model_dir, "model.pt")
+
+
+def load_status(status_path):
+ return torch.load(status_path, map_location=torch.device("cuda" if torch.cuda.is_available() else "cpu"))
+
+
+def get_status(model_dir, num_frames=None):
+ path = get_status_path(model_dir, num_frames)
+ # return torch.load(path, map_location=torch.device("cuda" if torch.cuda.is_available() else "cpu"))
+ return load_status(path)
+
+
+
+def save_status(status, model_dir, num_frames=None):
+ path = get_status_path(model_dir, num_frames)
+ utils.create_folders_if_necessary(path)
+ torch.save(status, path)
+
+def save_model(model, model_dir, num_frames=None):
+ path = get_model_path(model_dir, num_frames)
+ utils.create_folders_if_necessary(path)
+ torch.save(model, path)
+
+def load_model(model_name, raise_not_found=True):
+ path = get_model_path(model_name)
+ try:
+ if torch.cuda.is_available():
+ model = torch.load(path)
+ else:
+ model = torch.load(path, map_location=torch.device("cpu"))
+ model.eval()
+ return model
+ except FileNotFoundError:
+ if raise_not_found:
+ raise FileNotFoundError("No model found at {}".format(path))
+
+def get_vocab(model_dir):
+ return get_status(model_dir)["vocab"]
+
+
+def get_model_state(model_dir):
+ return get_status(model_dir)["model_state"]
+
+
+def get_txt_logger(model_dir):
+ path = os.path.join(model_dir, "log.txt")
+ utils.create_folders_if_necessary(path)
+
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(message)s",
+ handlers=[
+ logging.FileHandler(filename=path),
+ logging.StreamHandler(sys.stdout)
+ ]
+ )
+
+ return logging.getLogger()
+
+
+def get_csv_logger(model_dir):
+ csv_path = os.path.join(model_dir, "log.csv")
+ utils.create_folders_if_necessary(csv_path)
+ csv_file = open(csv_path, "a")
+ return csv_file, csv.writer(csv_file)
diff --git a/utils/tester.py b/utils/tester.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3b74742f32f1b20893747c9f69af20c43b0fba0
--- /dev/null
+++ b/utils/tester.py
@@ -0,0 +1,137 @@
+import numpy as np
+import utils
+import os
+import pickle
+import torch
+
+class AgentWrap:
+ """ Handles action selection without gradient updates for proper testing """
+
+ def __init__(self, acmodel, preprocess_obss, device, num_envs=1, argmax=False):
+
+ self.preprocess_obss = preprocess_obss
+ self.acmodel = acmodel
+
+ self.device = device
+ self.argmax = argmax
+ self.num_envs = num_envs
+
+ if self.acmodel.recurrent:
+ self.memories = torch.zeros(self.num_envs, self.acmodel.memory_size, device=self.device)
+
+ def get_actions(self, obss):
+ preprocessed_obss = self.preprocess_obss(obss, device=self.device)
+
+ with torch.no_grad():
+ if self.acmodel.recurrent:
+ dist, _, self.memories = self.acmodel(preprocessed_obss, self.memories)
+ else:
+ dist, _ = self.acmodel(preprocessed_obss)
+
+ if isinstance(dist, torch.distributions.Distribution):
+ if self.argmax:
+ actions = dist.probs.max(1, keepdim=True)[1]
+ else:
+ actions = dist.sample()
+ else:
+ if self.argmax:
+ actions = torch.stack([d.probs.max(1)[1] for d in dist], dim=1)
+ else:
+ actions = torch.stack([d.sample() for d in dist], dim=1)
+ return self.acmodel.construct_final_action(actions.cpu().numpy())
+
+ def get_action(self, obs):
+ return self.get_actions([obs])[0]
+
+ def analyze_feedbacks(self, rewards, dones):
+ if self.acmodel.recurrent:
+ masks = 1 - torch.tensor(dones, dtype=torch.float, device=self.device).unsqueeze(1)
+ self.memories *= masks
+
+ def analyze_feedback(self, reward, done):
+ return self.analyze_feedbacks([reward], [done])
+
+
+class Tester:
+
+ def __init__(self, env_args, seed, episodes, save_path, acmodel, preprocess_obss, device):
+
+ self.envs = [utils.make_env(
+ **env_args
+ ) for _ in range(episodes)]
+ self.seed = seed
+ self.episodes = episodes
+ self.ep_counter = 0
+ self.savefile = save_path + "/testing_{}.pkl".format(self.envs[0].spec.id)
+ print("Testing log: ", self.savefile)
+
+ self.stats_dict = {"test_rewards": [], "test_success_rates": [], "test_step_nb": []}
+ self.agent = AgentWrap(acmodel, preprocess_obss, device)
+
+ def test_agent(self, num_frames):
+ self.agent.acmodel.eval()
+
+ # set seed
+ # self.env.seed(self.seed)
+ # save test time (nb training steps)
+ self.stats_dict['test_step_nb'].append(num_frames)
+
+ rewards = []
+ success_rates = []
+
+
+ # cols = []
+ # s = "-".join([e.current_env.marble.color for e in self.envs])
+ # print("s:", s)
+
+ for episode in range(self.episodes):
+ # self.envs[episode].seed(self.seed)
+ self.envs[episode].seed(episode)
+ # print("current_seed", np.random.get_state()[1][0])
+
+ obs = self.envs[episode].reset()
+
+ # cols.append(self.envs[episode].current_env.marble.color)
+ # cols.append(str(self.envs[episode].current_env.marble.cur_pos))
+
+ done = False
+ while not done:
+ action = self.agent.get_action(obs)
+
+ obs, reward, done, info = self.envs[episode].step(action)
+ self.agent.analyze_feedback(reward, done)
+
+ if done:
+ rewards.append(reward)
+ success_rates.append(info['success'])
+ break
+
+ # from hashlib import md5
+ # hash_string = "-".join(cols).encode()
+
+ # print('hs:', hash_string[:20])
+ # print("hash test envs:", md5(hash_string).hexdigest())
+
+ mean_rewards = np.array(rewards).mean()
+ mean_success_rates = np.array(success_rates).mean()
+
+ self.stats_dict["test_rewards"].append(mean_rewards)
+ self.stats_dict["test_success_rates"].append(mean_success_rates)
+
+ self.agent.acmodel.train()
+ return mean_success_rates, mean_rewards
+
+ def load(self):
+ if os.path.isfile(self.savefile):
+ with open(self.savefile, 'rb') as f:
+ stats_dict_loaded = pickle.load(f)
+
+ for k, v in stats_dict_loaded.items():
+ self.stats_dict[k] = v
+ else:
+ raise ValueError(f"Save file {self.savefile} doesn't exist.")
+
+ def dump(self):
+ with open(self.savefile, 'wb') as f:
+ pickle.dump(self.stats_dict, f)
+
diff --git a/visualizer.sh b/visualizer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..49685fd4f28447f13d5c83b48644bf85693e4449
--- /dev/null
+++ b/visualizer.sh
@@ -0,0 +1,25 @@
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardGuide_lang64_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameWizardGuideLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-EB-Ablation --pause 0.2
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardGuide_lang64_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameWizardGuideLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-EB-Ablation-Deterministic --pause 0.2 --argmax
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardTwoGuides_lang64_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameNPCGuidesLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-EB-Original --pause 0.2
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardTwoGuides_lang64_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameNPCGuidesLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2_exploration-bonus-params_5_50/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-EB-Original-Deterministic --pause 0.2 --argmax
+# no explo
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardGuide_lang64_no_explo_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameWizardGuideLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-Ablation --pause 0.2
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardGuide_lang64_no_explo_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameWizardGuideLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-Ablation-Deterministic --pause 0.2 --argmax
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardTwoGuides_lang64_no_explo_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameNPCGuidesLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-Original --pause 0.2
+python -m scripts.visualize \
+--model 13-03_VIGIL4_WizardTwoGuides_lang64_no_explo_mm_baby_short_rec_env_MiniGrid-GoToDoorTalkHardSesameNPCGuidesLang64-8x8-v0_multi-modal-babyai11-agent_arch_original_endpool_res_custom-ppo-2/0 \
+--episodes 3 --seed=5 --gif graphics/gifs/MH-BabyAI-Original-Deterministic --pause 0.2 --argmax
diff --git a/web_demo/Dockerfile b/web_demo/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..ec09ce76073e90f6b26c1a614e1c2a049564d2fc
--- /dev/null
+++ b/web_demo/Dockerfile
@@ -0,0 +1,22 @@
+FROM python:3.7
+
+WORKDIR /code
+
+# Install graphviz
+RUN apt-get update && \
+ apt-get install -y graphviz && \
+ apt-get clean && \
+ rm -rf /var/lib/apt/lists/*
+
+COPY . .
+
+RUN pip install --upgrade -r web_demo/requirements.txt
+RUN pip install -e gym-minigrid
+
+#EXPOSE 7860
+
+CMD ["python", "web_demo/app.py"]
+
+
+# docker build -t sai_demo -f web_demo/Dockerfile .
+# docker run -p 7860:7860 sai_demo
diff --git a/web_demo/app.py b/web_demo/app.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cba1499040b6b5c41c628bb28cd2d27fe4f19de
--- /dev/null
+++ b/web_demo/app.py
@@ -0,0 +1,183 @@
+from flask import Flask, render_template, request, session, redirect, url_for, send_from_directory, jsonify
+from PIL import Image
+import io
+import base64
+import time
+
+import gym
+import gym_minigrid
+import numpy as np
+from gym_minigrid.window import Window
+
+import os
+
+app = Flask(__name__)
+
+env_types = ["Information_seeking", "Collaboration", "AppleStealing"]
+
+env_label_to_env_name = {
+ "Full SocialAI environment": "SocialAI-SocialAIParamEnv-v1", # all
+ "Pointing (Train)": "SocialAI-EPointingHeldoutDoorsTrainInformationSeekingParamEnv-v1", # Pointing Train
+ "Pointing (Test)": "SocialAI-EPointingBoxesTestInformationSeekingParamEnv-v1", # Pointing Test
+ "Role Reversal Single Role B (Pretrain - experimental)": "SocialAI-MarblePassBCollaborationParamEnv-v1",
+ "Role Reversal Single Asocial (Pretrain - control)": "SocialAI-AsocialMarbleCollaborationParamEnv-v1",
+ "Role Reversal Group Role B (Pretrain - experimental)": "SocialAI-RoleReversalGroupExperimentalCollaborationParamEnv-v1",
+ "Role Reversal Group Asocial (Pretrain - control)": "SocialAI-RoleReversalGroupControlCollaborationParamEnv-v1",
+ "Role Reversal Role A (Finetune - test)": "SocialAI-MarblePassACollaborationParamEnv-v1",
+ "Imitation (Train)": "SocialAI-EEmulationNoDistrInformationSeekingParamEnv-v1",
+ "Imitation (Test)": "SocialAI-EEmulationNoDistrDoorsInformationSeekingParamEnv-v1",
+ "Language Color (Train)": "SocialAI-ELangColorHeldoutDoorsTrainInformationSeekingParamEnv-v1",
+ "Language Color (Test)": "SocialAI-ELangColorDoorsTestInformationSeekingParamEnv-v1",
+ "Language Feedback (Train)": "SocialAI-ELangFeedbackHeldoutDoorsTrainInformationSeekingParamEnv-v1",
+ "Language Feedback (Test)": "SocialAI-ELangFeedbackDoorsTestInformationSeekingParamEnv-v1",
+ "Joint Attention Language Color (Train)": "SocialAI-ELangColorHeldoutDoorsTrainInformationSeekingParamEnv-v1",
+ "Joint Attention Language Color (Test)": "SocialAI-ELangColorDoorsTestInformationSeekingParamEnv-v1",
+ "Apple stealing": "SocialAI-AppleStealingObst_NoParamEnv-v1",
+ "Apple stealing (Occlusions)": "SocialAI-AppleStealingObst_MediumParamEnv-v1",
+ "AsocialBox (textworld)": "SocialAI-AsocialBoxInformationSeekingParamEnv-v1",
+ "ColorBoxes (textworld)": "SocialAI-ColorBoxesLLMCSParamEnv-v1",
+ "Scaffolding (train - scaf_8: Phase 1)": "SocialAI-AELangFeedbackTrainScaffoldingCSParamEnv-v1",
+ "Scaffolding/Formats (test)":"SocialAI-AELangFeedbackTrainFormatsCSParamEnv-v1",
+}
+
+# env = gym.make(args.env, **env_args_str_to_dict(args.env_args))
+global env_name
+global env_label
+env_label = list(env_label_to_env_name.keys())[0]
+env_name = env_label_to_env_name[env_label]
+
+global mask_unobserved
+mask_unobserved = False
+
+env = gym.make(env_name)
+
+def update_tree():
+ selected_parameters = env.current_env.parameters
+ selected_env_type = selected_parameters["Env_type"]
+
+ assert selected_env_type in env_types, f"Env_type {selected_env_type} not in {env_types}"
+
+ folded_nodes = [e for e in env_types if e != selected_env_type]
+
+ env.parameter_tree.draw_tree(
+ filename="./web_demo/static/current_tree",
+ ignore_labels=["Num_of_colors"],
+ selected_parameters=selected_parameters,
+ folded_nodes=folded_nodes
+
+ )
+
+update_tree()
+
+
+def np_img_to_base64(np_image):
+ image = Image.fromarray(np_image)
+ img_io = io.BytesIO()
+ image.save(img_io, 'JPEG', quality=70)
+ img_io.seek(0)
+ return base64.b64encode(img_io.getvalue()).decode('utf-8')
+
+
+def format_bubble_text(text):
+ lines = text.split("\n")
+
+ if len(lines) > 10:
+ # Keep the first line, add "....", and then append the last 8 lines
+ lines = [lines[0], "...."] + lines[-8:]
+
+ return "\n".join(lines)
+
+
+@app.route('/set_env', methods=['POST'])
+def set_env():
+ global env_name # Declare the variable as global to modify it
+ global env_label # Declare the variable as global to modify it
+ env_label = request.form.get('env_label') # Get the selected env_name from the form
+
+ env_name = env_label_to_env_name[env_label]
+
+ global env # Declare the env variable as global to modify it
+ env = gym.make(env_name) # Initialize the environment with the new name
+ update_tree() # Update the tree for the new environment
+ return redirect(url_for('index')) # Redirect back to the main page
+
+
+@app.route('/set_mask_unobserved', methods=['POST'])
+def set_mask_unobserved():
+ global mask_unobserved
+ mask_unobserved_value = request.form.get('mask_unobserved')
+ mask_unobserved = bool(mask_unobserved_value)
+
+ return redirect(url_for('index'))
+
+
+
+@app.route('/update_image', methods=['POST'])
+def update_image():
+ action_name = request.form.get('action')
+
+
+ if action_name == 'done':
+ # reset the env and update the tree image
+ obs = env.reset()
+ update_tree()
+
+ else:
+ if action_name == "speak":
+ action_template = request.form.get('template')
+ action_word = request.form.get('word')
+
+ temp_ind, word_ind = env.grammar.get_action(action_template, action_word)
+ action = [np.nan, temp_ind, word_ind]
+
+ elif action_name == 'left':
+ action = [int(env.actions.left), np.nan, np.nan]
+ elif action_name == 'right':
+ action = [int(env.actions.right), np.nan, np.nan]
+ elif action_name == 'forward':
+ action = [int(env.actions.forward), np.nan, np.nan]
+ elif action_name == 'toggle':
+ action = [int(env.actions.toggle), np.nan, np.nan]
+ elif action_name == 'noop':
+ action = [np.nan, np.nan, np.nan]
+ else:
+ action = [np.nan, np.nan, np.nan]
+
+ obs, reward, done, info = env.step(action)
+
+ image = env.render('rgb_array', tile_size=32, mask_unobserved=mask_unobserved)
+ image_data = np_img_to_base64(image)
+
+
+ bubble_text = format_bubble_text(env.current_env.full_conversation)
+
+ return jsonify({'image_data': image_data, "bubble_text": bubble_text})
+
+
+@app.route('/', methods=['GET', 'POST'])
+def index():
+ image = env.render('rgb_array', tile_size=32, mask_unobserved=mask_unobserved)
+ image_data = np_img_to_base64(image)
+
+ bubble_text = format_bubble_text(env.current_env.full_conversation)
+
+ available_env_labels = env_label_to_env_name.keys()
+
+ grammar_templates = env.grammar.templates
+ grammar_words = env.grammar.things
+
+ return render_template(
+ 'index.html',
+ image_data=image_data,
+ bubble_text=bubble_text,
+ mask_unobserved=mask_unobserved,
+ timestamp=time.time(),
+ available_env_labels=available_env_labels,
+ current_env_label=env_label,
+ grammar_templates=grammar_templates,
+ grammar_words=grammar_words,
+ )
+
+
+if __name__ == '__main__':
+ app.run(host='0.0.0.0', port=7860, debug=True)
diff --git a/web_demo/requirements.txt b/web_demo/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f770087e17316df0fe53356e52583c54ba806e73
--- /dev/null
+++ b/web_demo/requirements.txt
@@ -0,0 +1,6 @@
+flask==2.2.5
+pillow==9.5.0
+astar==0.93
+termcolor==2.3.0
+matplotlib==3.5.3
+graphviz==0.20.1
\ No newline at end of file
diff --git a/web_demo/static/current_tree b/web_demo/static/current_tree
new file mode 100644
index 0000000000000000000000000000000000000000..e3b4835b1d2733be3b3883454a5afcea1a7adf98
--- /dev/null
+++ b/web_demo/static/current_tree
@@ -0,0 +1,152 @@
+digraph unix {
+ size="30,100"
+ node [color=lightblue3 fontcolor=black fontsize=18 shape=box style=filled]
+ node_1 [label=Env_type type=parameter]
+ node [color=grey95 fontcolor=gray70 fontsize=18 shape=ellipse style=filled]
+ node_2 [label=Information_seeking type=value]
+ node [color=white fontcolor=gray70 fontsize=30 shape=none]
+ node_2_fold [label="..." type=value]
+ node_2 -> node_2_fold [color=gray70]
+ node_1 -> node_2
+ node [color=lightblue2 fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_50 [label=Collaboration type=value]
+ node_1 -> node_50
+ node [color=grey95 fontcolor=gray70 fontsize=18 shape=ellipse style=filled]
+ node_95 [label=AppleStealing type=value]
+ node [color=white fontcolor=gray70 fontsize=30 shape=none]
+ node_95_fold [label="..." type=value]
+ node_95 -> node_95_fold [color=gray70]
+ node_1 -> node_95
+ node [color=lightblue3 fontcolor=black fontsize=18 shape=box style=filled]
+ node_51 [label=Problem type=parameter]
+ node_50 -> node_51
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_52 [label=Boxes type=value]
+ node_51 -> node_52
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_58 [label=Switches type=value]
+ node_51 -> node_58
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_64 [label=Generators type=value]
+ node_51 -> node_64
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_70 [label=Marble type=value]
+ node_51 -> node_70
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_76 [label=MarblePass type=value]
+ node_51 -> node_76
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_83 [label=MarblePush type=value]
+ node_51 -> node_83
+ node [color=lightblue2 fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_89 [label=LeverDoor type=value]
+ node_51 -> node_89
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_53 [label=Role type=parameter]
+ node_52 -> node_53
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_56 [label=Version type=parameter]
+ node_52 -> node_56
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_59 [label=Role type=parameter]
+ node_58 -> node_59
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_62 [label=Version type=parameter]
+ node_58 -> node_62
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_65 [label=Role type=parameter]
+ node_64 -> node_65
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_68 [label=Version type=parameter]
+ node_64 -> node_68
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_71 [label=Role type=parameter]
+ node_70 -> node_71
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_74 [label=Version type=parameter]
+ node_70 -> node_74
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_77 [label=Role type=parameter]
+ node_76 -> node_77
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_80 [label=Version type=parameter]
+ node_76 -> node_80
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_84 [label=Role type=parameter]
+ node_83 -> node_84
+ node [color=grey60 fontcolor=black fontsize=18 shape=box style=filled]
+ node_87 [label=Version type=parameter]
+ node_83 -> node_87
+ node [color=lightblue3 fontcolor=black fontsize=18 shape=box style=filled]
+ node_90 [label=Role type=parameter]
+ node_89 -> node_90
+ node [color=lightblue3 fontcolor=black fontsize=18 shape=box style=filled]
+ node_93 [label=Version type=parameter]
+ node_89 -> node_93
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_54 [label=A type=value]
+ node_53 -> node_54
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_55 [label=B type=value]
+ node_53 -> node_55
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_57 [label=Social type=value]
+ node_56 -> node_57
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_60 [label=A type=value]
+ node_59 -> node_60
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_61 [label=B type=value]
+ node_59 -> node_61
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_63 [label=Social type=value]
+ node_62 -> node_63
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_66 [label=A type=value]
+ node_65 -> node_66
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_67 [label=B type=value]
+ node_65 -> node_67
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_69 [label=Social type=value]
+ node_68 -> node_69
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_72 [label=A type=value]
+ node_71 -> node_72
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_73 [label=B type=value]
+ node_71 -> node_73
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_75 [label=Social type=value]
+ node_74 -> node_75
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_78 [label=A type=value]
+ node_77 -> node_78
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_79 [label=B type=value]
+ node_77 -> node_79
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_81 [label=Social type=value]
+ node_80 -> node_81
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_82 [label=Asocial type=value]
+ node_80 -> node_82
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_85 [label=A type=value]
+ node_84 -> node_85
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_86 [label=B type=value]
+ node_84 -> node_86
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_88 [label=Social type=value]
+ node_87 -> node_88
+ node [color=lightgray fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_91 [label=A type=value]
+ node_90 -> node_91
+ node [color=lightblue2 fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_92 [label=B type=value]
+ node_90 -> node_92
+ node [color=lightblue2 fontcolor=black fontsize=18 shape=ellipse style=filled]
+ node_94 [label=Social type=value]
+ node_93 -> node_94
+}
diff --git a/web_demo/static/current_tree.svg b/web_demo/static/current_tree.svg
new file mode 100644
index 0000000000000000000000000000000000000000..5452e0e7420d181e66956a6665150b9e59154654
--- /dev/null
+++ b/web_demo/static/current_tree.svg
@@ -0,0 +1,607 @@
+
+
+
+
+
diff --git a/web_demo/static/results.png b/web_demo/static/results.png
new file mode 100644
index 0000000000000000000000000000000000000000..c450a2daba31fb3f9b59672738c782025afc3615
Binary files /dev/null and b/web_demo/static/results.png differ
diff --git a/web_demo/templates/index.html b/web_demo/templates/index.html
new file mode 100644
index 0000000000000000000000000000000000000000..0573b9a5ed3bcf0c994c1958b59ac54fbd8c6350
--- /dev/null
+++ b/web_demo/templates/index.html
@@ -0,0 +1,309 @@
+
+
+
+ SocialAI School Demo
+
+
+
+
+
+
+ Select Environment:
+
+
+
+
+ Mask unobserved cells:
+
+
+
+
+
This is the sampling tree. The current sampled parameters are highlighted blue
+
+
+
+
+
This is the environment.
+
+
You can control the agent using the following actions: