Garbage In, Garbage Out: The Case For Better Robot Data Understanding

Community Article Published October 26, 2025

Low quality robot data → Poor robot performance.

Robot data collection is expensive, requiring hundreds of human expert teleoperation hours.

At the same time, collecting high quality robot data is difficult - even for a highly skilled teleoperator. For example, idle trajectories may occur when the teleoperator pauses or poor lighting might affect visual clarity.

While the precise definition of what constitutes a high quality training example is a complicated question (e.g. does a dark video increase policy resilience or reduce performance?), a few quality indicators can provide an insightful snapshot into your dataset. Data understanding is the first step to data improvement.

In this article, we introduce a lightweight Open Source toolkit to find low quality examples in the Open X-Embodiment datasets and show that training on 20% of low quality examples hurts policy training loss by 30%.

Want to know how good your dataset is?

score_lerobot_episodes

Try out the tool here: https://github.com/RoboticsData/score_lerobot_episodes


Robot Data Defined

We ran experiments on the Open X-Embodiment datasets - a large collection of over 60 robot manipulation datasets gathered through teleoperation on robots such as Franka, UR5, and DROID.

image
Overview of the Open X-Embodiment dataset

A dataset is represented as a collection of N episodes (or trajectories): D={E1,,EN}\mathcal{D} = \{ \mathcal{E}_1, \ldots, \mathcal{E}_N \}

Each episode consists of timestamped frames: Ei={(oi,t,si,t,ai,t)}t=1Ti\mathcal{E}_i = \{ (o_{i,t}, s_{i,t}, a_{i,t}) \}_{t=1}^{T_i}

where oi,to_{i,t} is the observation, si,ts_{i,t} is the robot state, and ai,ta_{i,t} is the action at time tt.

Observation space

Each observation oi,to_{i,t} includes two synchronized RGB camera streams:

  1. An over-the-shoulder view showing the full scene and robot.
  2. A wrist-mounted camera providing a close-up view of the robot’s end-effector.

State space

si,t=[qi,t,gi,t]RJ+1s_{i,t} = [q_{i,t}, g_{i,t}] \in \mathbb{R}^{J+1} where qi,tRJq_{i,t} \in \mathbb{R}^J represents the robot’s joint angles at time tt, and gi,tRg_{i,t} \in \mathbb{R} represents the gripper position (e.g., open or closed).

Action space

ai,t=[qi,t,gi,t]RC+1a_{i,t} = [q'_{i,t}, g'_{i,t}] \in \mathbb{R}^{C+1} where qi,tq'_{i,t} are the motor commands and gi,tg'_{i,t} is the gripper command applied by the teleoperator.

Visual Scoring

We evaluate each episode for visual clarity indicators such as lighting and blur. This allows us to ensure that only high-quality visual data is retained for training.

Our process involves:

  1. Uniformly sampling 10 frames per episode - This strategy provides a representative snapshot of the visual quality throughout the episode without the computational expense of analyzing every frame.
  2. Compute per-episode aggregate visual score - For each frame II, we compute a penalty based on blur and brightness where frames which are too dark or too blurry are penalized. This penalty is then averaged across the episode and used to compute an aggregate visual score.
  3. Remove episodes that score below a threshold - This final filtering step discards any episode whose aggregate visual score falls below a pre-defined quality threshold, ensuring only visually high-quality data is retained for training.

Blur

We estimate blur using the variance of the Laplacian. This metric is used because it measures the magnitude of high-frequency content in an image, which is drastically reduced by blur. The high-frequency components correspond to sharp edges and details; a sharper image will have a higher variance of the Laplacian.

The Laplace operator, Δ\Delta, for a 2D image I(x,y)I(x, y) is defined as the sum of its second partial derivatives: ΔI=2Ix2+2Iy2\Delta I = \frac{\partial^2 I}{\partial x^2} + \frac{\partial^2 I}{\partial y^2} In discrete form, it is often approximated using a convolution kernel e.g., a 3×33 \times 3 kernel such as (010141010)\begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} where ΔI\Delta{I} is the discrete convolution of the kernel Δ\Delta with the frame II.

The metric used for blur is the variance of the Laplacian Var(ΔI)Var(\Delta I), which is calculated across all pixels in the frame: Var(ΔI)=1HWx=1Wy=1H(ΔI(x,y)μΔI)2Var(\Delta I) = \frac{1}{HW} \sum_{x=1}^W \sum_{y=1}^H \left( \Delta I(x, y) - \mu_{\Delta I} \right)^2 where W×HW \times H is the frame size, and μΔI\mu_{\Delta I} is the mean of the Laplacian values over the frame μΔI=1HWx=1Wy=1HΔI(x,y).\mu_{\Delta I} = \frac{1}{HW} \sum_{x=1}^W \sum_{y=1}^H \Delta I(x, y) .

We compute a blur penalty score by scaling the variance value, apply a threshold and return the proportion of blurred frames across the episode.

Here is an example of a blurry episode discovered by our tool

Example of motion and autofocus blur

Lighting

We use mean-brightness as a proxy for good lighting. If the mean intensity μ\mu of the grayscale image is below 50 (on a scale of 0 to 255), we assign a linearly proportional penalty:

Penalty=max(0.0,50.0μ50.0)\text{Penalty} = \max\left(0.0, \frac{50.0 - \mu}{50.0}\right)

Note we only penalize images that are too dark; overexposure is not penalized since it is not a failure mode that occurs frequently and we observe that penalizing it can result in an excessive rate of false positives.

The following example illustrates the effects of our visual scoring function on the droid_100 dataset. The video that received a low score was one that was captured in a dark room.

Low lighting score due to dimly lit room

In many cases, the wrist camera suffers from a noticeable drop in visual quality compared to the overhead view due to occlusions, shadows and focus issues.

Dataset: Droid Episode 53 (Wrist view) Dataset: Droid Episode 53 (Overhead view)

Motion Scoring

In addition to visual observations, each episode contains motion data representing the robot’s internal state and control commands over time. This includes joint positions, velocities, and actions, which specify the robot’s kinematic behavior.

We introduce scoring functions that classify the quality of motion of a given episode. We describe how we score for categories such as collision, path efficiency, and idle-time.

Collision

We aim to detect potential physical collisions by analyzing spikes in the robot's joint-space acceleration. The idea is that when a robot physically impacts an object, it experiences abrupt changes in joint motion—typically seen as sudden decelerations upon contact, and rapid accelerations as it moves away.

Our collision scoring function estimates an acceleration threshold for each joint, based on the median absolute acceleration over time. A spike is detected when any joint's acceleration at a given timestamp exceeds its respective threshold.

  • Joint acceleration proxy:
    at=qt+12qt+qt1(tt+1tt)2 a_t = \frac{q_{t+1} - 2q_t + q_{t-1}}{(t_{t+1} - t_t)^2}

  • Per-joint robust threshold:
    spike_ratio=θ>15×median(a) \text{spike\_ratio} = \theta > 15 \times \mathrm{median}(|a|)

  • Collision score (goodness):
    1spike_ratio 1 - \text{spike\_ratio}

The function returns a score between 0 and 1 where a low-score indicates high likelihood of collision.

Path Efficiency

We also score path efficiency by calculating how close the motion of the robot is to a straight line. This is motivated by the fact that expert trajectories are often the most direct path between two points. A path with excessive meandering or corrections suggests poor human control or indecision, which introduces noise and complexity that hampers policy training and results in inefficient robot behavior.

Path length:
L=tqt+1qt2 L = \sum_t \lVert q_{t+1} - q_t \rVert_2

Straight-line distance:
D=qendqstart2 D = \lVert q_{\text{end}} - q_{\text{start}} \rVert_2

Path Efficiency Score: path_eff={clip(DL,0,1),L1060,otherwise \text{path\_eff} = \begin{cases} \mathrm{clip}\left(\dfrac{D}{L}, 0, 1\right), & L \ge 10^{-6} \\ 0, & \text{otherwise} \end{cases}

The final score is the ratio of the straight-line distance in joint space to the actual path length traveled.

Example of meandering path from Droid (note path after picking up object)

There may be situations where the direct path between two points isn't the ideal trajectory. For example, manipulating an object or maneuvering past obstacles in the scene might result in high amounts of nonlinear local perturbations or the robot's joint configuration may not permit a direct path. In these cases, path efficiency may not be a useful metric and can be excluded from the scoring criteria.

Actuator Saturation

When the exptected action ata_t doesn't result in the next state qt+1q_{t+1} we use this as a proxy for actuator saturation. Too much actuator saturation can indicate that the robot is lifting heavier loads than it is equipped for or is running into resistence within the scene or has faulty motors. In order to reduce robot wear and tear over time, we want to prevent this behavior inheriting this behavior from training data.

Specifically, we check how often atqt+1>threshold_deg|a_t - q_{t+1}| > \text{threshold\_deg} for any joint (threshold default: 7).

To downweight minor transient divergence caused by slippage or inertia, we impose a non-linearity on the saturation ratio:

ssat=exp(4saturation_ratio) s_{\text{sat}} = \exp(-4 \cdot \text{saturation\_ratio})

Idle Time

We score the robot’s motion based on how idle it is throughout the task. This is bad because periods of near-zero velocity are often the result of distraction, indecision, or taking a break rather than intentional action to complete the task. These stationary periods slow down learning and result in undesirable robot behavior.

Specifically, we calculate the proportion of time the robot's velocity lies below a certain threshold during its motion. We first calculate the joint-space velocity magnitude vt\left\lVert \mathbf{v}_t \right\rVert for each time step tt: vt=j=1J(ΔqjΔt)2\left\lVert \mathbf{v}_t \right\rVert = \sqrt{\sum_{j=1}^{J} \left( \frac{\Delta q_j}{\Delta t} \right)^2}

The raw score is 1.01.0 minus the proportion of steps where vt\left\lVert \mathbf{v}_t \right\rVert is below a threshold default of 0.10.1.

We also note that filtering idle intervals improves performance in openpi

Example of idle time discovered by our tool

Idle time may sometimes be useful to attenuate inertia or allow the dynamics of the environment to settle to rest before the next action. In these cases, idle time is repeated and predictable rather than incidental.

We are currently in the process of launching a tool that accounts for these cases so if your use case exhibits such behavior, reach out to us!


Corruption Experiment

We evaluate the effect of noisy examples on robot learning by artificially corrupting the robot using the sources of noise we aim to identify using our tool. Specifically, for 20% of episodes, we introduce the following realistic corruptions:

  • Visual Noise: We artificially darken frames and apply an unsharp mask or motion blur to a random subset of frames in the episode.
  • Motion Noise: We introduce short, random idle periods (setting joint velocities to zero for 0.5-1.0 seconds) and insert single-step acceleration spikes (random, large joint command outliers) to simulate teleoperator jerks or minor collisions.

The following is the side by side comparison between an original video of an episode and the corrupted version.

Original Corrupted

Note on evaluations

In a typical machine learning experiment, the most direct approach would be to compare the held-out validation loss on a model train on the filtered and unfiltered versions of a real-world dataset.

However, in the context of robotics, validation loss isn't representative of real-world performance and the training loss is not directly comparable since the filtered set has fewer examples.

For simplicity's sake, we opt to motivate the importance of data understanding by using an artificial corruption experiment to illustrate the impact of low quality data on robot learning.

We invite the community to share real world or simulation evaluation results and plan to share more of our own results in a future post.

Retrieval Precision-Recall

Just for illustrative purposes, we can design the following recall/percision experiment where we use our scoring algorithm to detect corrupted episodes within a dataset that has been corrupted. Essentially, we have the following binary classification problem where we set a threshold where episodes whose score fell below this result where classifed as corrupt.

For example, we have the following results for the detecting corrupted videos using the visual scoring function in the following Stanford HYDRA dataset. This is a robotic manipulation dataset containing 570 episodes of Franka robot demonstrations across 3 tasks, with synchronized wrist and external camera views and 7-DOF action sequences.

For this experiment, we corrupted 50% of these episodes and obtained the following precision and recall results for various thresholds.

Header Image

Precision, Recall and F-1 curves on the Stanford HYDRA dataset (50% corrupted).

Effects on Training

We also compared the training losses between the original datasets and the corrupted datasets where we used ACT as our policy.

The figure below compares policy training on a corrupted dataset—where 20% of the episodes were intentionally degraded—with training on the original, clean dataset.

We see that models trained on corrupted data require significantly more optimization steps to reach the same loss threshold as those trained on clean data, highlighting the sensitivity of policy learning to data quality.

Header Image

Standard Hydra training loss curve.

Header Image

Berkeley Autolab UR5 training loss curve.

In conclusion, these results quantify the large impact of incorporating low quality data into an otherwise useful dataset, resulting in poor performance and wasted GPU hours.


We'd love to hear from you - do send us any comments or feedback you have!

https://github.com/RoboticsData/score_lerobot_episodes

Community

Sign up or log in to comment