Taking a bit of a break from the comms board prototype, I spent some time working on the IsaacLab reinforcement learning setup.
Up until now, the Isaac Lab setup imports the robot asset but doesn’t actually do anything of value, so it’s just falling over.
While it has not got to a stage to be able to say that it’s made any progress with walking or standing, progress has been made towards it. Understanding the skeleton of the `*_env.py` file has greatly improved, and having some at least arbitrary rewards and done criteria with logs from runs will help me shape the future iterations of it.
An amusing example of it after training:
Existing setups
I spent some time trying to understand how others had framed the different sections in their environments with the main focus being on commands, observations, and rewards. The locomotion/humanoid and Anymal C have been the primary influences shaping the current state. I do want to better understand the BD-X and Olaf robot training setups as they seem to use pre-defined gaits as some of the commands to train against (at least in some of their policies). The Berkeley Humanoid also seems like a good point of reference. At the various sections I’ve added a non-exhaustive list for comparing (hopefully accurate)
Creating biped_robot_b1_env.py
A lot of what I have in the environment has come from the locomotion (humanoid) and Anymal C environment set ups. Some of the approaches are relevant, some less so. I wanted this first pass to be built on those as a stepping stone to better my understanding of the different aspects and different trade-offs.
The setup
The environment is setup quite basic. Flat floor with the environments spaced a small distance apart.
The robot is brought in from the urdf turned usd file created from my engineering CAD.
To attempt to simplify and shrink the model I made a URDF assembly and created volumes of each of the links to use as approximations instead of a dozen or so parts for each. This helps keep it looking good in the sim but also means I can get a bit more out contact between parts and the ground.
Importing the urdf to usd does merge links with fixed joints together.
The commands being passed to the robot for the training are linear velocities in x, y, and z. x, and y values are generated from a normal distribution and z being set to zero. I’d like to control the robot from a controller, so mapping thumb stick movement to x and y velocities makes sense. These commands are normalised to unit length, which I want to better understand if this is hindering the training, that is should the range be a bit more than a single unit.
From a visualisation perspective, markers of the commanded velocity vector (red) and actual vector (blue) are created and repeatedly updated above each env/robot instance. While things are rough it lets me have an idea of how different the two values are with a quick glance when the UI is running, generally post training.
| Robot/System | Commands |
|---|---|
| Cartpole | No direct command |
| Locomotion (Humanoid) | Target of (1000, 0, 0) from the env |
| Anymal | – X/Y linear velocity – Yaw angular velocity |
| BD-X | Time-varying control input – one perpetual policy for standing while controlling the head and torso – one periodic policy for walking with separate head control – episodic policies to imitate a single animation – torso position – torso orientation – torso linear velocity – torso angular velocity |


Actions
Actions in the sim are currently positions relative to the default joint position. However, I think modifying this to be torques makes more sense, primarily from better reflection of my approach on the real hardware. The humanoid/locomotion env uses gear ratios to scale this up. I want to better understand the reasoning behind this.
Is it due to:
- truncated action value range
- don’t have to wait for actions to increase to torque magnitudes of 20 to 100+ so more efficient
- how energy use is modelled in rewards, i.e. decouple the gearbox and motor
- increased model flexibility
Observations
- Linear velocities in torso frame
- Angular velocities in torso frame
- Projection of up, i.e. how upright it is
- Joint positions, normalised within the movement limits
- Joint velocities
- Actions
- Root linear velocity in world frame
- Root angular velocity in world frame
- Commands
- Joint positions relative to default joint positions
- Torso position
There’s some overlap, largely due to not having gone through with sufficient detail on which aspects makes sense.
For example, how well do the normalised joint positions work when the default joint position is not halfway between the range limits?
| Robot/System | Observations |
|---|---|
| Cartpole | – Pole position – Pole velocity – Cart position – Cart velocity |
| Locomotion (Humanoid) | – Root/”Torso” Z-position (n, 1) – Velocities in torso/local frame – Angular velocities (scaled) in torso/local frame – Yaw – Roll – Angle to target – Up Projection (how upright it is) – Heading projection (how on target it is) – Joint position scaled – Joint velocity scaled – Potential – Actions |
| Anymal | – Root (COM) linear velocity of base frame – Root (COM) angular velocity of base frame – Gravity proj on base frame – Commands: x & y lin vel, and yaw ang vel – Joint positions – Default joint positions – Joint velocities – Height scanner, if AnymalCRough– Actions |
| BD-X | TBD |
Rewards and Penalties
The rewards are:
- Still alive
- How upright it is
- Tracking of commanded linear velocity
- Tracking of commanded yaw rate
The penalties are:
- If joints are effectively at the end of the allowable range, i.e. within 1% of either end
- If the torso is not level. I am thinking that this may be less relevant than on a quadruped
- High action rate, want smooth actions
- High joint acceleration rate, want to avoid jerky/high frequency movements
- Large torques
- Pitch and Roll velocities, i.e. non-zero
- Z-velocity, non-zero
- Energy cost, a product of actions (effectively torque pre-gearbox) and joint velocity which should reflect instantaneous power
The energy cost seems like one to remove for initial trials and brought in mid-way to begin to move towards something real. Similar to the inclusion of the torso’s level metric.
| Robot/System | Penalties and Rewards |
|---|---|
| Cartpole | + Length alive – Pole position – Cart velocity – Pole velocity – Terminated |
| Locomotion (Humanoid) | + Alive + Upwards + Heading – Death – Action cost – Energy cost – DoF near limits |
| Anymal | + Foot air time + yaw rate error small + velocity x, y error small – velocity z error – pitch and roll error – torque – acceleration – action rate – thigh contact with ground – body tilt |
| BD-X | – Torso position xy – Torso orientation – Linear velocity xy – Linear velocity z – Angular velocity xy – Angular velocity z – Leg joint positions – Neck joint positions – Leg joint velocities – Neck joint velocities – Contact – Joint torques – Joint accelerations – Leg action rate – Neck action rate – Leg action accel. – Neck action accel. – Survival |
An example of how the different metrics contribute to the overall score over time:

Env Termination
Currently there are three ways for the environment to end and be reset:
- Time out
- Contact with the base/torso
- Falling over
The current time out window is 15s. I don’t think this is currently terminating many environments but could be extended to 20s (like Anymal C) if it made sense.
Contact with the base is triggered with any collision, I’d love to spend a bit more time exploring to ensure there’s no false triggers or avoidable triggers (like the head just dropping onto it) which could possible be programmed to be excluded. When incorporating the contact sensor it did lead to some missed collisions in the CAD, both in the approximation (an easy fix with some cuts to the volumes) and also in the main GA (a bit more annoying to fix).
The falling over metric is great how it’s implemented in the locomotion/humanoid env but with the root position for the base starting at z=0 leads to a slightly different interpretation needed. In locomotion, when the body’s root is below a value it’s fallen over. With the biped USD, when the robot falls (pitches or rolls) the body’s root can’t really go down but will rotate up. Moving the position of the root seems like the logical move, while time intensive it does avoid unintuitive logic entering the simulation (a hindrance when returning to the code after a few months focused elsewhere).
| Robot/System | Done |
|---|---|
| Cartpole | – Cart position out of bounds – Pole position out of bounds – Timeout |
| Locomotion (Humanoid) | – Time out – Fallen over (torso height below threshold) |
| Anymal | – Time out – Base contacts the ground |
| BD-X | TBD |
Creating biped_robot_v1_env_cfg.py
The config file largely holds all the constants used, along with some specific configuration details for things like the contact sensors.
Some of these constants are:
- Penalty scalers
- Reward scalers
- Joint gear ratios
- Death cost
- Termination height for falling over
- Physics friction values
Next
One open question I have had since the initial robot design was the presence of an articulable foot, passive or active. Going through the rounds of changing rewards and weightings to see what they do did make me wonder if it would all be magically solved with having feet. Though I do expect it’s not quite so binary, I want to explore an experiment of creating a modified version of the biped with some rough feet and actuators to see how it performs to the same environment set up.
I also want and need to better understand how values are passed around within the simulation, and if/where values are normalised, truncated, or hold no value. Spending some time reviewing the logs through tensor board versus the rewards and done criteria to better understand how I can tune and shape these.



