January 2025
2MR Report
Author: Eymeric Chauchat
Version: V1.0 (7 Jan 2025)
Precedent report - No next report yet !
Goal and motivation of the report
This present report is a direct following of the past month report December 2024. It aims a providing insight into the result produced in the past month. Like the other report of this kind, the report -after a short introduction- will be structured in the following :
- A theoretical part showcasing the goal of the report
- A practical part highlighting some valuable knowledge about code structure for a future student on this project
- A result part demonstrating the result obtained lately with some explanation.
Introduction
This report will be the first one to showcase result about the Reinforcement Learning part of the project of Reinforcement learning with Lagrangian sparse identification of non-linear dynamics. In order to keep track of every new addition to this framework, a deep insight of the code used will be presented. As of conclusion the preliminary result will be presented.
For IROS conference, real-robot demonstration are really valuable. In this sense the goal is to provide a real robot experiment in the paper -thus work in order to converge for double pendulum experiment is still undergoing-.
It can be noted that this report won’t detail for the moment the complete XL-sindy framework and this one will focus on the establishment of PPO through lagrangian made environment.
Theorical advancement
In the past month, the goal has been to switch from pure Lagrangian-SINDY experiment to a Reinforcement Learning framework. It has been decided to build up a reinforcement learning framework from the ground up to understand deeply the nature of Reinforcement Learning. @huangCleanRLHighqualitySinglefile2021 has been decided as the base of knowledge to implement the base of the PPO algorithm. In this sense much of the code has been directly taken from the CleanRL repository after being understood.
For the first part of the Reinforcement Learning framework an environment based on RK4 integration and a lagrangian formula has been designed : a lagrangian formula is brought to the algorithm and a complete environment is generated, the action on the different joint makes the input of our step function that return position and velocity of the different part of our system. This first generic Reinforcement learning brick has been proven successful on the double pendulum swing up movement when provided with an adequate reward function. Detail about reward function are present in the practical part of the report.
Practical advancement
The object of the first part will be break down with precision in order to give an overview of the code enabling these different new feature. Most of the programming effort has been shifted to reinforcement-learning-XLSindy repositery. A more research and experimentation intensive coding way has been chosen in order to enable fast development of new feature.
Concerning the Reinforcement learning features multiple script has been developed.
Environment design
Even if usual environment are still used in order to perform action on the complete framework - for exemple a mujoco environment is generated in order to simulate the full order system -. A new class of environment has been developed
Agent design
Agent code is enterily taken from cleanRL (this type of agent can also be found in most of the other Reinforcement Learning framework) it can be founded in rl_util/agent.py
. For the kind of system which are dealt in this project there is not a clear need to use other type of neural network architecture.
Reinforcement learning scripts
Base PPO mainly from cleanRL takes place in test/ppo_continuous_fixed_rk4_env.py
any agent created in this script can be tested and visualised in test/environment_test_agent.py
. The whole setup uses Tensorboard to visualize experiment during creation and afterward to anylise result.
This scripts are launched in a parametrized manner using -for example- the following command :
1 |
|
Reward function engineering
The attention for this report has been focused toward the double pendulum experiment since the robot is available in the lab. Other system are ongoing development.
Double pendulum experiment
PPO agent and critic has been trained in order to fulfill multiple control task :
- double action, inverted pendulum control (recover for disturbance)
- double action, pendulum swing up
- single action, pendulum swing up
for these task a complex reward function has been designed. Since PPO is not the most exploratory algorithm in Reinforcement Learning this reward function has been made in order to guide agent.
A detail explanation of this reward function will be given :
An upward reward term : to reward the upright position of the pendulum
$$\text{upward_reward} = - |1 - \frac{ \theta_1}{\pi} | -|1 - \frac{ \theta_2}{\pi} | + 2$$
for complex experiment involving balancing a non-actuated pole, prioritization of one term over this second one is ongoing test, this reward is also called the goal_state in the experiment traking the performance of the agent for a human reviewerAction penalty : that penalize agent for taking big action
$$\text{action_penalty} = - a_1^2 -a_2^2 $$Velocity penalty : that penalize agent for letting the system acquire big speed
$$\text{velocity_penalty} = - v_1^2 -v_2^2 $$Energy reward : this reward encourage the agent to accumulate energy (potential and kinetic) in order to swing up. It predominates in the early state in order to imply the learning of swing movement
$$E_{\text{pot}} = l_{1},m_{1},g,\bigl(1 - \cos(\theta_{1})\bigr) + l_{2},m_{2},g,\bigl(1 - \cos(\theta_{2})\bigr)$$
$$ E_{\text{kin}} = \frac{m_{1},l_{1},v_{1}^{2}}{2} + \frac{m_{2},l_{2},v_{2}^{2}}{2} $$
$$ E_{\text{total}} = E_{\text{pot}} + E_{\text{kin}} $$
$$E_{ratio} = \frac{E_{\text{total}}}{E_{\text{pot,max}}}$$
$$E_{reward} = -\Bigl|E_{ratio} - E_{max}\Bigr| + E_{max}$$
The different constant can be discarted, $E_{\text{pot,max}}$ is the potential energy of pendulum in upright position, $\text{max_energy}$ is an hyperparameter that need to be choosen taking in account of velocity friction forces -velocity friction discount some of the total energy thus needing more than the theoritical limit-
The whole $\text{energy_reward}$ is mainly discarted when $\text{upward_reward} >0.7$ in order to enter in a phase of control instead of movement determination
Result
The result will be divided in two clear part, the first one featuring solely results about Reinforcement learning task concerning the double pendulum robot experiment. The second will feature result on other -classical reinforcement learning- environment.
These two part serves two clear purpose :
- The first one goal is to run the total framework on the real double pendulum robot and to achieve different control task in the least full-order interaction.
- The second one goal is to demonstrate the great adaptability of XL-Sindy Reinforcement Learning method.
In the analysis of these results it should be keep in min that PPO is mainly an algorithm that was made to converge on highly complex problem, and even if it converge slowly on simple problem it also converge on more complex problem which is deeply impressive.
Auxiliary result and PPO observation
Before diving into the result, a list of observation about PPO observation on the different system studies has been writen here : PPO hyperparameter observation
Double Pendulum
Concerning this system multiple task and sub task has been defined :
- Swing-up task :
- Double action : The double pendulum swing up using both of it motor
- Single action : The double pendulum swing up using only one of it motor
- Disturbance control :
- Double action : The double pendulum stabilise itself in the upward position using both of its motor
- Single action : The double pendulum stabilise itself in the upward position using one of it motor
From this hierarchy of task it can be observed that even if disturbance control can be solved by classical control scheme (in the both single and double actuated scenario). Classical control usually fail at achieving swing-up task -due to distance between goal and initial state-.
Disturbance control - Double action task
Usually double action task are solved in less than one million timestep and offer impressive accuracy while not showing use of non feasible action (due to action penalty refining the path of action after initial goal has been rewarded)
Disturbance control - Single action task
Even if this task is theoretically more complex, it has been solved without much more issue than the double action one
Swing-up Double action task
Double action swing-up task are way more difficult to complete. Over the multiple experiment it can been shown that after a first stalling state the return spike a first time but at the price of the episode length being reduced highly. This does mean that the pendulum find a way to swing-up successfully but without stabilizing it which terminate early the simulation. Usually after this first phase the episodic return is reduced but the the episodic lenght gradually increasing until reaching a first “max”. From now the algorithm will mainly reduce the action and velocity penalty until the end of the simulation.
Usually the task is solved in around 1million step.
Swing-up Single action task
This task has yet not been solved with the RL-XL-sindy framework.
The main issue of why the task remain for the moment unsolved even if it can be solved by PPO is because the environment creation using RK4 is for the moment not parallel. In average 150 step per second are rolled while rolling experiment. In comparison system that converge on this task need hundreds million step that can only be achieved through parallel rolling of hundreds or thousands of environment. (meaning at least 150 000 step per second)
Note : double pendulum using PPO is in average solved in 8.5days of training following a youtube video about AI 1million step is usually 0.25 days of training
Difficulties undertaken
As demonstrated in the last result, even if the whole framework is well suited to solve the problem wanted, right now some optimization issue has been raised. These purely implementation related issue shouldn’t be the bottleneck of the framework implemented and will be solved as soon as possible.
In a second time, it can be noted that xl-sindy hasn’t been yet implemented in the framework but it shouldn’t raise that much concern in the sense that the reinforcement learning part is developped as a continuity of the Xl-Sindy library.
Conclusion
As a conclusion, it can be noted that even if the time constraint are really tight -a paper draft containing as much result as possible should be writen for the first of february- nothing is yet over and hopefully xl-sindy reinforcement learning framework should be ready in time.
The main difficulty encountered in the last few days fall into my main domain of expertise that is parallel computing and high performance programming and these issues should be solved soon.
If no other “big issues” are encountered during the link between xl-sindy and reinforcement learning. Full framework result will be available soon.
References
- Huang, S., Dossa, R. F. J., Ye, C. & Braga, J. CleanRL: High-Quality Single-File Implementations of Deep Reinforcement Learning Algorithms. (arXiv, 2021). doi:10.48550/arXiv.2111.08819.