Reinforcement Learning of Little Br0 – “First attempt”

Taking a bit of a break from the comms board prototype, I spent some time working on the IsaacLab reinforcement learning setup.

Up until now, the Isaac Lab setup imports the robot asset but doesn’t actually do anything of value, so it’s just falling over.

While it has not got to a stage to be able to say that it’s made any progress with walking or standing, progress has been made towards it. Understanding the skeleton of the `*_env.py` file has greatly improved, and having some at least arbitrary rewards and done criteria with logs from runs will help me shape the future iterations of it.

An amusing example of it after training:

Existing setups

I spent some time trying to understand how others had framed the different sections in their environments with the main focus being on commands, observations, and rewards. The locomotion/humanoid and Anymal C have been the primary influences shaping the current state. I do want to better understand the BD-X and Olaf robot training setups as they seem to use pre-defined gaits as some of the commands to train against (at least in some of their policies). The Berkeley Humanoid also seems like a good point of reference. At the various sections I’ve added a non-exhaustive list for comparing (hopefully accurate)

Creating `biped_robot_b1_env.py`

A lot of what I have in the environment has come from the locomotion (humanoid) and Anymal C environment set ups. Some of the approaches are relevant, some less so. I wanted this first pass to be built on those as a stepping stone to better my understanding of the different aspects and different trade-offs.

The setup

The environment is setup quite basic. Flat floor with the environments spaced a small distance apart.

The robot is brought in from the urdf turned usd file created from my engineering CAD.
To attempt to simplify and shrink the model I made a URDF assembly and created volumes of each of the links to use as approximations instead of a dozen or so parts for each. This helps keep it looking good in the sim but also means I can get a bit more out contact between parts and the ground.
Importing the urdf to usd does merge links with fixed joints together.

The commands being passed to the robot for the training are linear velocities in x, y, and z. x, and y values are generated from a normal distribution and z being set to zero. I’d like to control the robot from a controller, so mapping thumb stick movement to x and y velocities makes sense. These commands are normalised to unit length, which I want to better understand if this is hindering the training, that is should the range be a bit more than a single unit.

From a visualisation perspective, markers of the commanded velocity vector (red) and actual vector (blue) are created and repeatedly updated above each env/robot instance. While things are rough it lets me have an idea of how different the two values are with a quick glance when the UI is running, generally post training.

Robot/System	Commands
Cartpole	No direct command
Locomotion (Humanoid)	Target of (1000, 0, 0) from the env
Anymal	– X/Y linear velocity – Yaw angular velocity
BD-X	Time-varying control input – one perpetual policy for standing while controlling the head and torso – one periodic policy for walking with separate head control – episodic policies to imitate a single animation – torso position – torso orientation – torso linear velocity – torso angular velocity

Actions

Actions in the sim are currently positions relative to the default joint position. However, I think modifying this to be torques makes more sense, primarily from better reflection of my approach on the real hardware. The humanoid/locomotion env uses gear ratios to scale this up. I want to better understand the reasoning behind this.
Is it due to:

truncated action value range
don’t have to wait for actions to increase to torque magnitudes of 20 to 100+ so more efficient
how energy use is modelled in rewards, i.e. decouple the gearbox and motor
increased model flexibility

Observations

Linear velocities in torso frame
Angular velocities in torso frame
Projection of up, i.e. how upright it is
Joint positions, normalised within the movement limits
Joint velocities
Actions
Root linear velocity in world frame
Root angular velocity in world frame
Commands
Joint positions relative to default joint positions
Torso position

There’s some overlap, largely due to not having gone through with sufficient detail on which aspects makes sense.
For example, how well do the normalised joint positions work when the default joint position is not halfway between the range limits?

Robot/System	Observations
Cartpole	– Pole position – Pole velocity – Cart position – Cart velocity
Locomotion (Humanoid)	– Root/”Torso” Z-position (n, 1) – Velocities in torso/local frame – Angular velocities (scaled) in torso/local frame – Yaw – Roll – Angle to target – Up Projection (how upright it is) – Heading projection (how on target it is) – Joint position scaled – Joint velocity scaled – Potential – Actions
Anymal	– Root (COM) linear velocity of base frame – Root (COM) angular velocity of base frame – Gravity proj on base frame – Commands: x & y lin vel, and yaw ang vel – Joint positions – Default joint positions – Joint velocities – Height scanner, if `AnymalCRough` – Actions
BD-X	TBD

Rewards and Penalties

The rewards are:

Still alive
How upright it is
Tracking of commanded linear velocity
Tracking of commanded yaw rate

The penalties are:

If joints are effectively at the end of the allowable range, i.e. within 1% of either end
If the torso is not level. I am thinking that this may be less relevant than on a quadruped
High action rate, want smooth actions
High joint acceleration rate, want to avoid jerky/high frequency movements
Large torques
Pitch and Roll velocities, i.e. non-zero
Z-velocity, non-zero
Energy cost, a product of actions (effectively torque pre-gearbox) and joint velocity which should reflect instantaneous power

The energy cost seems like one to remove for initial trials and brought in mid-way to begin to move towards something real. Similar to the inclusion of the torso’s level metric.

Robot/System	Penalties and Rewards
Cartpole	+ Length alive – Pole position – Cart velocity – Pole velocity – Terminated
Locomotion (Humanoid)	+ Alive + Upwards + Heading – Death – Action cost – Energy cost – DoF near limits
Anymal	+ Foot air time + yaw rate error small + velocity x, y error small – velocity z error – pitch and roll error – torque – acceleration – action rate – thigh contact with ground – body tilt
BD-X	– Torso position xy – Torso orientation – Linear velocity xy – Linear velocity z – Angular velocity xy – Angular velocity z – Leg joint positions – Neck joint positions – Leg joint velocities – Neck joint velocities – Contact – Joint torques – Joint accelerations – Leg action rate – Neck action rate – Leg action accel. – Neck action accel. – Survival

An example of how the different metrics contribute to the overall score over time:

Env Termination

Currently there are three ways for the environment to end and be reset:

Time out
Contact with the base/torso
Falling over

The current time out window is 15s. I don’t think this is currently terminating many environments but could be extended to 20s (like Anymal C) if it made sense.
Contact with the base is triggered with any collision, I’d love to spend a bit more time exploring to ensure there’s no false triggers or avoidable triggers (like the head just dropping onto it) which could possible be programmed to be excluded. When incorporating the contact sensor it did lead to some missed collisions in the CAD, both in the approximation (an easy fix with some cuts to the volumes) and also in the main GA (a bit more annoying to fix).
The falling over metric is great how it’s implemented in the locomotion/humanoid env but with the root position for the base starting at z=0 leads to a slightly different interpretation needed. In locomotion, when the body’s root is below a value it’s fallen over. With the biped USD, when the robot falls (pitches or rolls) the body’s root can’t really go down but will rotate up. Moving the position of the root seems like the logical move, while time intensive it does avoid unintuitive logic entering the simulation (a hindrance when returning to the code after a few months focused elsewhere).

Robot/System	Done
Cartpole	– Cart position out of bounds – Pole position out of bounds – Timeout
Locomotion (Humanoid)	– Time out – Fallen over (torso height below threshold)
Anymal	– Time out – Base contacts the ground
BD-X	TBD

Creating `biped_robot_v1_env_cfg.py`

The config file largely holds all the constants used, along with some specific configuration details for things like the contact sensors.
Some of these constants are:

Penalty scalers
Reward scalers
Joint gear ratios
Death cost
Termination height for falling over
Physics friction values

One open question I have had since the initial robot design was the presence of an articulable foot, passive or active. Going through the rounds of changing rewards and weightings to see what they do did make me wonder if it would all be magically solved with having feet. Though I do expect it’s not quite so binary, I want to explore an experiment of creating a modified version of the biped with some rough feet and actuators to see how it performs to the same environment set up.

I also want and need to better understand how values are passed around within the simulation, and if/where values are normalised, truncated, or hold no value. Spending some time reviewing the logs through tensor board versus the rewards and done criteria to better understand how I can tune and shape these.

Matter Evolved

Reinforcement Learning of Little Br0 – “First attempt”

Existing setups