Reinforcement learning in robotics

Reinforcement Learning (RL) is a subfield of Machine Learning where an agent learns by interacting with its environment, observing the results of these interactions and receiving a reward (positive or negative) accordingly. This way of learning mimics the fundamental way in which we humans (and animals alike) learn.

We humans, have a direct sensori-motor connection to our environment that can perform actions and witness the results of these actions on the environment itself. The idea is commonly known as “cause and effect”, and this undoubtedly is the key to building up knowledge of our environment throughout our lifetime. The content below will get into the following topics:

How’s Reinforcement Learning being used today?

A variety of different problems can be solved using Reinforcement Learning. Because RL agents can learn without expert supervision, the type of problems that are best suited to RL are complex problems where there appears to be no obvious or easily programmable solution. Two of the most popular ones are:

  • Game playing - determining the best move to make in a game often depends on a number of different factors, hence the number of possible states that can exist in a particular game is usually very large. To cover this many states using a standard rule based approach would mean specifying an also large number of hard coded rules. RL cuts out the need to manually specify rules, agents learn simply by playing the game. For two player games such as backgammon, agents can be trained by playing against other human players or even other RL agents.

  • Control problems - such as elevator scheduling. Again, it is not obvious what strategies would provide the best, most timely elevator service. For control problems such as this, RL agents can be left to learn in a simulated environment and eventually they will come up with good controlling policies. Some advantages of using RL for control problems is that an agent can be retrained easily to adapt to environment changes, and trained continuously while the system is online, improving performance all the time.

A nice and relatively recent (at the time of writing) example of reinforcement learning is presented at DeepMind’s paper Human-level control through deep reinforcement learning.

Why is Reinforcement Learning relevant for robots?

As J. Kober, J. Andrew (Drew) Bagnell, and J. Peters point out in Reinforcement Learning in Robotics: A Survey:

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors.

which tend to be most of the existing ones in the real world. As a rule of thumb, someone can probably think that those tasks that are complex for a human could probably be easily done by a robot while things that we humans do easily and with little effort, tend to be highly complex for a robot. This leads many to think that robots excel at solving complexity but perform poorly on trivial tasks (from a human perspective). In other words:

What’s complex for a human, robots can do easily and viceversa - Víctor Mayoral Vilches

Let’s take a simple example to illustrate this claim. Image we have a robot manipulator with three joints on top of a table repeatedly performing some task. Traditionally, robotic engineers in charge of the specific task will either design the whole application or use existing tools (produced by the manufacturer) to program such application. Regardless of the tools and complexity level, at some point someone should’ve derived its inverse kinematics accounting for possible errors within each motor/joint, included sensing to create a closed loop that enhances the accuracy of the model, designed the overall control scheme, programmed calibration routines, … all to get a robot that produces deterministic movements in a controlled environment.

But the fact is that (real) environments are not controllable and robots nowadays pay the bill (someone that has ever produced robot demonstrations should know what i mean).

Problems in robotics are often best represented with high-dimensional, continuous states and actions (note that the 10-30 dimensional continuous actions common in robot reinforcement learning are considered large (Powell, 2012)). In robotics, it is often unrealistic to assume that the true state is completely observable and noise-free.

Coming back to J. Kober et al.’s paper:

Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. Instead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot.

Which actually makes a lot of sense!. Think of how we learn about speficic tasks. Let’s take me shooting a basketball:

  1. I get myself behind the 3 point line and get ready for a shot (it’s relevant to note it should be behind the 3 point line, otherwise i would totally make it every time)

  2. At this point, my consciousness has no whatsoever information about the exact distance to the basket, neither the strength I should use to use to make the shot so my brain produces an estimate based on the model that I have (built upon years of trial an error shots).

  3. With this estimate, I produce a shot. Let’s assume I miss the shot, which I notice through my eyes (sensors). Again, the information perceived through my eyes is not accurate thereby what I pass to my brain is not: “I missed the shot by 5.45 cm to the right” but more like “The shot was slightly too much to the right and i missed”. This information updates the model in my brain which receives a negative reward. We could get ourselves discussing about why did my estimate failed. Was it because the model is wrong regardless of the fact that I’ve made hundreds of 3-pointers before with that exact model? Was it because of the wind (playing outdoors generally)? or was it because i didn’t eat properly that morning?. It could easily be all of those or none, but the fact is that many of those aspects can’t really be controlled be me. So i proceed iterating:

  4. With the updated model, i make another shot which in case it fails drives me to step 2) but if I make it, I proceed to step 5).

  5. Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.

For robotics to grow exponentially, we need to be able to tackle complex problems and the trial-error approach that RL represents seems the right way.

Robot Reinforcement Learning, an introduction

The goal of reinforcement learning is to find a mapping from states x to actions, called policy \( \pi \), that picks actions a in given states s maximizing the cumulative expected reward r.

To do so, reinforcement learning discovers an optimal policy \( \pi* \) that maps states (or observations) to actions so as to maximize the expected return J, which corresponds to:

where \( p_\pi (\tau) \) is the distribution over the trajectory \( \tau = (x_0, a_0, x_1, a_1, …) \) and \( R(\tau) \) is the accumulated reward in the trajectory defined as:

being \( \gamma_t \in [0, 1) \) the discount factor that discounts rewards further in the future. With this general mathematical setup, many tasks in robotics can be naturally formulated as reinforcement learning (RL) problems.

Traditional methods in RL, typically try to estimate the expected long-term reward of a policy for each state x and time step t, also called the value function \( V_t^\pi (x) \). Value function methods are sometimes called called critic-only methods. The idea of a critic is to first observe and estimate the performance of choosing controls on the system (i.e., the value function), then derive a policy based on the gained knowledge. In contrast, policy search methods which directly try to deduce the optimal policy \( \pi* \) are sometimes called actor-only methods.

Function Approximation

Function approximation is a family of mathematical and statistical techniques used to represent a function of interest when it is computationally or information theoretically intractable to represent the function exactly or explicitly (e.g. in tabular form).

According to J. Kober et al.:

Typically, in reinforcement learning the function approximation is based on sample data collected during interaction with the environment. Function approximation is critical in nearly every RL problem, and becomes inevitable in continuous state ones. In large discrete spaces it is also often impractical to visit or even represent all states and actions, and function approximation in this setting can be used as a means to generalize to neighboring states and actions. Function approximation can be employed to represent policies, value functions, and forward models.

Challenges of Robot Reinforcement Learning

Curse of dimensionality

Bellman coined the term “Curse of Dimensionality” in 1957 when he explored optimal control in discrete high-dimensional spaces and faced an exponential explosion of states and actions. As the number of dimensions grows, exponentially more data and computation are needed to cover the complete state-action space.

For example, let’s take a 7 degree-of-freedom robot arm, a representation of the robot’s state would consist of its joint angles and velocities for each of its seven degrees of freedom as well as the Cartesian position and velocity of end efector. That accounts for \(2 × (7 + 3) = 20 \) states and 7-dimensional continuous actions.

If we assume that each dimension of the state-space is discretized into ten levels, we have 10 states for a one-dimensional state-space. In our robot arm example we’ll have \( 10^{20} \) unique states.

Curse of real-world samples

As pointed by J. Kober et al.: >Robot reinforcement learning suffers from most of the resulting real-world problems. For example, robot hardware is usually expensive, suffers from wear and tear, and requires careful maintenance. Repairing a robot system is a non-negligible effort associated with cost, physical labor and long waiting periods.

The curse of real-world samples is covered in Kober’s paper. A summary of some relevant aspects is presented below:

  • Applying reinforcement learning in robotics demands safe exploration which becomes a key issue of the learning process, a problem often neglected in the general reinforcement learning community (due to the use of simulated environments).
  • While learning, the dynamics of a robot can change due to many external factors ranging from temperature to wear thereby the learning process may never fully converge (i.e. how light conditions affect the performance of the vision system and, as a result, the task’s performance). This problem makes comparing algorithms particularly hard.
  • Reinforcement learning algorithms are implemented on a digital computer where the discretization of time is unavoidable despite that physical systems are inherently continuous time systems. Time discretization of the actuation can generate undesirable artifacts (e.g., the distortion of distance between states) even for idealized physical systems, which cannot be avoided.

Curse of under-modeling and model uncertainty

Simulation with accurate models could potentially be used to offset the cost of real-world interaction. Such seems to be the case with robots nowadays using the Gazebo simulator. Supported by the Open Source Robotics Foundation, Gazebo is compatible with different physics engines and has proved to be a relevant tool in every roboticists toolbox.

In an ideal setting, this approach would allow to learn the behavior in simulation and subsequently transfer it to the real robot. Unfortunately, creating a sufficiently accurate model of the robot and its environment is challenging and often requires very many data samples. As small model errors due to this under-modeling accumulate, the simulated robot can quickly diverge from the real-world system.

Simulation transfer to the real robots is generally classified in two big scenarios:

  • Tasks where the system is self-stabilizing (that is, where the robot does not require active control to remain in a safe state or return to it), transferring policies often works well.
  • Unstable tasks where small variations have drastic consequences. In such scenarios transferred policies often perform poorly.

Principles of Robot Reinforcement Learning

Taking into account the aforementioned challenges for Robot Reinforcement Learning, one can easily conclude that a naive application of reinforcement learning techniques in robotics is likely to be doomed to failure. In order for robot reinforcement learning to leverage good results the following principles should be taken into account:

  • Effective representations
  • Approximate models
  • Prior knowledge or information

The following sections will summarize each one of these principles which are covered in more detail at J. Kober et al.’s paper:

Effective representations

Much of the success of reinforcement learning methods has been due to the clever use of approximate representations. The need of such approximations is particularly pronounced in robotics, where table-based representations are rarely scalable.

  • Smart State-Action discretization: Reducing the dimensionality of states or actions by smart state-action discretization is a representational simplification that may enhance both policy search and value function-based methods.
  • Value Function Approximation: A value function-based approach requires an accurate and robust but general function approximator that can capture the value function with sufficient precision while maintaining stability during learning (e.g. ANNs).
  • Pre-structured policies: Policy search methods require a choice of policy representation that controls the complexity of representable policies to enhance learning speed.

Approximate models

Experience collected in the real world can be used to learn a forward model (Åström and Wittenmark, 1989) from data. Reduced learning on the real robot is highly desirable as simulations are frequently faster than real-time while safer for both the robot and its environment. In robot reinforcement learning, the learning step on the simulated system is often called mental rehearsal.

The core issues of mental rehearsal are: simulation biases, stochasticity of the real world, and efficient optimization when sampling from a simulator. These issues are addresses using techniques such as Iterative Learning Control, Value Function Methods with Learned Models or Locally Linear Quadratic Regulators.

Simulation biases Given how hard is to obtain a forward model that is accurate enough to simulate a complex real-world robot system, many robot RL policies learned on simulation perform poorly on the real robot. This is known as simulation bias. It is analogous to over-fitting in supervised learning – that is, the algorithm is doing its job well on the model and the training data, respectively, but does not generalize well to the real system or novel data. It has been proved that simulation biases can be addressed by introducing stochastic models or distributions over models even if the system is very close to deterministic.

Prior knowledge or information

Prior knowledge can dramatically help guide the learning process. These approaches significantly reduce the search space and, thus, speed up the learning process.

  • Prior Knowledge Through Demonstration: Providing a (partially) successful initial policy allows a reinforcement learning method to focus on promising regions in the value function or in policy space.
  • Prior Knowledge Through Task Structuring: Pre-structuring a complex task such that it can be broken down into several more tractable ones can significantly reduce the complexity of the learning task.


  • Chris Watkins, Learning from Delayed Rewards, Cambridge, 1989 (thesis)
  • Awesome Reinforcement Learning repository,
  • J. Kober, J. Andrew (Drew) Bagnell, and J. Peters, “Reinforcement Learning in Robotics: A Survey,” International Journal of Robotics Research, July, 2013. (pdf)
  • Marc Peter Deisenroth, Gerhard Neumann and Jan Peters, A Survey on Policy Search for Robotics (pdf)