1 hour to learn to walk, 10 minutes to learn to turn over, the world model allows the robot to quickly master multiple skills

Teaching robots to solve complex tasks in the real world has always been a fundamental problem in robotics research. Deep reinforcement learning provides a popular method of robot learning that allows robots to improve their behavior through trial and error. However, current algorithms require too much interaction with the environment to learn successfully, which makes them unsuitable for some real-world tasks.

Learning an accurate world model for the real world is a huge open challenge. In a recent study, researchers at UC Berkeley leveraged recent advances in Dreamer’s world model to train a variety of robots in the most straightforward and fundamental problem setting: real-world learning without simulators or demonstrations online reinforcement learning.

Paper link: https://arxiv.org/pdf/2206.14176.pdf

The Dreamer world model is one proposed by Google, the University of Toronto and other institutions in 2021. As shown in Figure 2 below, Dreamer learns a world model from a replay cache of past experiences, learns behaviors from imagined rollouts in the latent space of the world model, and continuously interacts with the environment to explore and improve its behavior. The researchers aim to push the limits of robotic learning in the real world and provide a powerful platform to support future work.

Overall, the contributions of this study are:

1. Dreamer on Robots. The researchers applied Dreamer to four robots and demonstrated successful learning in the real world without introducing new algorithms. These tasks cover a range of challenges, including different action spaces, sensory modalities, and reward structures.

2. Learn to walk within 1 hour. The researchers taught a quadruped robot from scratch in the real world to roll over, stand up and learn to walk in under an hour.

In addition, they found that within 10 minutes the robot could learn to withstand a thrust or quickly roll over and get back on its feet.

3. Visual pick and place. The researchers trained a robotic arm to learn to pick and place objects from sparse rewards, which required locating objects from pixels and fusing images with proprioceptive input. The behavior learned here outperforms model-free agents and approaches human performance.

4. Open source. The researchers publicly released the software infrastructure for all experiments, which supports different action spaces and sensory modalities, providing a flexible platform for future research into world models learned by robots in the real world.

This study uses the Dreamer algorithm (Hafner et al., 2019; 2020) to perform online learning on physical robots without simulators, and the overall architecture is shown in Figure 2 above. Dreamer learns a world model from a replay buffer of past experience, uses an actor-evaluator algorithm to learn behaviors from trajectories predicted by the learned model, and deploys its behavior in the environment to continuously improve the replay buffer.

This study decouples learning updates from data collection to meet latency requirements and enable fast training without waiting for environmental changes. In the implementation of this study, a learning thread continuously trains the world model and participant-evaluator behavior, while a participant thread computes environmental interactions in parallel.

The world model is a deep neural network that learns to predict the dynamics of the environment, as shown in Figure 3(a) below.

A world model can be thought of as a fast simulator of a robot’s autonomous learning environment, continuously improving its model as it explores the real world. The world model is based on the recurrent state space model (RSSM; Hafner et al., 2018), which consists of four components:

The world model represents task-independent dynamic knowledge, while the actor-rater algorithm is responsible for learning the current task-specific behavior. As shown in Figure 3(b) above. The study learns behavior from rollouts predicted in the latent space of the world model without decoding observations. This enables massively parallel behavioral learning with batch sizes of 16K on a single GPU, similar to specialized modern simulators (Makoviychuk et al., 2021). The participant-rater algorithm consists of two neural networks:

The role of the actor network is to learn a distribution of successful actions for each latent model state s_t to maximize the sum of future prediction task rewards. The rater network learns to predict the sum of future task rewards through temporal difference learning (Sutton and Barto, 2018), which allows the algorithm to learn long-term policies.

In contrast to Hafner et al. (2020), the Dreamer method does not have a training frequency hyperparameter because the learner optimization of the neural network occurs in parallel with data collection and there is no rate limit.

The researchers evaluated Dreamer on 4 robots, assigning each robot a different task, and compared its performance to algorithmic and human baselines, with the goal of assessing whether recent successes in learning world models can be directly implemented in the real world. Sample-efficient robotic learning.

These experiments represent common robotic tasks such as locomotion, manipulation, and navigation, presenting a variety of challenges including continuous and discrete actions, dense and sparse rewards, proprioception and image observation, and sensor fusion.

A1 robot dog walking on four legs

As shown in Figure 4, after an hour of training, Dreamer learned to repeatedly have the robot turn over on its back, stand up, and walk forward. During the first 5 minutes of training, the robot managed to roll over from its back and land on its feet. After 20 minutes, it learned how to stand up. After about an hour, the robot learned a fork gait, walking forward at the desired speed.

After successfully completing this task, the researchers repeatedly tapped the robot’s quadrupeds with a stick to test the robustness of the algorithm, as shown in Figure 8. During the additional 10 minutes of eLearning, the robot adapts and takes a thrust or rolls over quickly to stand. In contrast, SAC also quickly learned to roll over, but was unable to stand or walk due to a data budget that was too small.

UR5 Multi-Object Vision Pick and Place

Pick-and-place tasks are common in warehouse and logistics environments, requiring robotic arms to transport items from one box to another. Figure 5 shows a successful pick and place loop. The task is challenging due to sparse rewards, the need to infer object locations from pixels, and the challenging dynamics of multiple moving objects.

XArm Visual Pick and Place

The UR5 robot mentioned above is a high-performance industrial robot, but XArm is an accessible, low-cost 7 DOF operation, where the task is similar and requires positioning and grasping a soft object to move it from one container to another container and back, as shown in Figure 6.

Sphero Navigation

In addition, the researchers also evaluated Dreamer on a visual navigation task, which requires manipulating the wheeled robot to a fixed target location, given only RGB images as input. Here, the Sphero Ollie robot is used, a cylindrical robot with two controllable motors that the researchers control with a continuous torque command at 2 Hz. Given that the robot is symmetrical, and the robot can only obtain image observations, it must infer heading from the observation history.

Within 2 hours, the Dreamer learned to navigate quickly and consistently to the target and stay close to it. As shown in Figure 7, the Dreamer has an average distance of 0.15 from the target (measured in area size and averaged across time steps).