DROID Annotations

This repo contains additional annotation data for the DROID dataset which we completed after the initial dataset release.

Concretely, it contains the following information:

the full language annotations for the DROID dataset (3x annotations for 95+% of the 75k successful demonstrations)
higher-accuracy camera calibrations (for 36k episodes)

Language Annotations

The released RLDS dataset only contains a subset of the language labels. Here, we provide a single json file with a mapping from episode ID to three natural language annotations, for 75k success episodes in DROID (95% of all DROID success episodes).

Camera Calibrations

The original DROID release included noisy camera calibrations. In post-hoc processing, we generated more accurate extrinsic camera calibration parameters for a subset of the DROID episodes. Concretely, we provide the following three calibration files:

cam2base_extrinsics.json: Contains ~36k entries with either the left or right camera calibrated with respect to base.
cam2cam_extrinsics.json: Contains ~90k entries with cam2cam relative poses and camera parameters for all of DROID.
cam2base_extrinsic_superset.json: Contains ~24k unique entries, total ~48k poses for both left and right camera calibrated with respect to the base.

These files map episodes' unique ID (see Accessing Annotation Data below) to another dictionary containing metadata (e.g., detection quality metrics, see Appendix G of paper), as well as a map from camera ID to the extrinsics values. Said extrinsics is represented as a 6-element list of floats, indicating the translation and rotation. It can be easily converted into a homogeneous pose matrix:

from scipy.spatial.transform import Rotation as R

# Assume extrinsics is that 6-element list
pos = extrinsics[0:3]
rot_mat = R.from_euler("xyz", extracted_extrinsics[3:6]).as_matrix()

# Make homogenous transformation matrix
cam_to_target_extrinsics_matrix = np.eye(4)
cam_to_target_extrinsics_matrix[:3, :3] = rot_mat
cam_to_target_extrinsics_matrix[:3, 3] = pos

This represents a transformation matrix from the camera's frame to the target frame. Inverting it gets the transformation from target frame to camera frame (which is usually desirable, e.g., if one wants to project a point in the robot frame into the camera frame).

As the raw DROID video files were recorded on Zed cameras and saved in SVO format, they contain camera intrinsics which can be used in conjunction with the above. For convenience, we have extracted and saved all these annotations to intrinsics.json (~72k entries). This json has the following format:

<episode ID>:
    <external camera 1's serial>: [fx, cx, fy, cy for camera 1]
    <external camera 2's serial>: [fx, cx, fy, cy for camera 2]
    <wrist camera 1's serial>: [fx, cx, fy, cy for wrist camera]

One can thus convert the list for a particular camera to a projection matrix via the following:

import numpy as np

# Assume intrinsics is that 4-element list
fx, cx, fy, cy = intrinsics
intrinsics_matrix = np.array([
        [fx, 0, cx],
        [0, fy, cy],
        [0, 0, 1]
])

Note that the intrinsics tend to not change much between episodes, but using the specific values corresponding to a particular episode tends to give the best results.

Example Calibration Use Case

Using the calibration information, one can project points in the robot's frame into pixel coordinates for the cameras. We will demonstrate how to map the robot gripper position to pixel coordinates for the external cameras with extrinsics in cam2base_extrinsics.json, see CalibrationExample.ipynb for the full code.

gripper_position_base = <Homogeneous gripper position in the base frame, as gotten from TFDS episode. Shape 4 x 1>
cam_to_base_extrinsics_matrix = <extrinsics matrix for some camera>
intrinsics_matrix = <intrinsics matrix for that same camera>

# Invert to get transform from base to camera frame
base_to_cam_extrinsics_matrix = np.linalg.inv(cam_to_base_extrinsics_matrix)

# Transform gripper position to camera frame, then remove homogeneous component
robot_gripper_position_cam = base_to_cam_extrinsics_matrix @ gripper_position_base
robot_gripper_position_cam = robot_gripper_position_cam[:3] # Now 3 x 1

# Project into pixel coordinates
pixel_positions = intrinsics_matrix @ robot_gripper_position_cam
pixel_positions = pixel_positions[:2] / pixel_positions[2] # Shape 2 x 1 # Done!

Accessing Annotation Data

All annotations are stored in json files which you can download from this repository. To access the respective annotation for a particular episode, you can compute the episode's ID. It corresponds to the file name metadata_<episode_id>.json in every episode's folder in the raw DROID data. To extract episode IDs and corresponding file paths, you can use the code below. You can then e.g. match the filepath to the one stored in the RLDS dataset if you want to match language annotations / camera extrinsics to RLDS episodes.

import tensorflow as tf

episode_paths = tf.io.gfile.glob("gs://gresearch/robotics/droid_raw/1.0.1/*/success/*/*/metadata_*.json")
for p in episode_paths:
    episode_id = p[:-5].split("/")[-1].split("_")[-1]

As using the above annotations requires these episode IDs (but the TFDS dataset only contains paths), we have included episode_id_to_path.json for convenience. The below code snippet loads this json, then gets the mapping from episode paths to IDs.

import json
episode_id_to_path_path = "<path/to/episode_id_to_path.json>"
with open(episode_id_to_path_path, "r") as f:
    episode_id_to_path = json.load(f)
episode_path_to_id = {v: k for k, v in episode_id_to_path.items()}