Thèse Apprentissage des Solutions Hjb pour l'Apprentissage par Renforcement en Temps Continu H/F - Doctorat.Gouv.Fr
- CDD
- Doctorat.Gouv.Fr
Les missions du poste
Établissement : Université Paris-Saclay GS Informatique et sciences du numérique École doctorale : Sciences et Technologies de l'Information et de la Communication Laboratoire de recherche : Laboratoire Interdisciplinaire des Sciences du Numérique Direction de la thèse : Matthieu KOWALSKI Début de la thèse : 2026-10-01 Date limite de candidature : 2026-05-12T23:59:59 Malgré les avancées remarquables de l'intelligence artificielle dans des domaines comme les jeux, le traitement du langage naturel ou la vision par ordinateur, son application aux systèmes dynamiques en temps continu - fondements de la robotique, de la navigation autonome ou de la gestion énergétique - reste un défi majeur. Ces systèmes, régis par des dynamiques complexes, souvent non linéaires et évoluant en temps continu, exigent des politiques de contrôle à la fois optimales et robustes face aux incertitudes, au bruit et aux interactions irrégulières. Les méthodes d'apprentissage par renforcement (RL, Reinforcement Learning), bien que performantes dans des environnements en temps discret (comme les jeux tour par tour), peinent à s'adapter au temps continu. Les raisons en sont multiples : la complexité algorithmique lorsque le pas de temps tend vers zéro, la difficulté d'attribuer correctement les récompenses, ainsi que la sensibilité au choix de la discrétisation temporelle.
L'équation de Hamilton-Jacobi-Bellman (HJB) offre un cadre théorique solide pour le contrôle optimal. Elle généralise l'équation de Bellman, pilier du RL, au temps continu. Cette équation exprime la fonction de valeur optimale comme solution d'une équation aux dérivées partielles (EDP) et permet de dériver des politiques de contrôle optimales. Cependant, résoudre l'équation HJB pour des espaces d'états de grande dimension reste hors de portée des méthodes numériques classiques.
Les réseaux de neurones informés par la physique (PINNs, Physics-Informed Neural Networks) ont récemment ouvert de nouvelles perspectives en approximant les solutions d'EDP via l'apprentissage profond. Pourtant, leur application à l'équation HJB soulève des défis spécifiques : (i) l'absence de solutions lisses peut nécessiter l'emploi de solutions de viscosité pour garantir l'unicité ; (ii) l'entraînement est extrêmement sensible au choix de l'optimiseur et des stratégies d'échantillonnage ; (iii) la dépendance au modèle limite leur applicabilité aux systèmes dont la dynamique est connue. S'ajoutent à cela la non-unicité des solutions et la malédiction de la dimensionalité, qui compliquent davantage la conception de politiques de contrôle évolutives et robustes.
Ce projet de thèse relève ces défis en proposant un cadre unifié pour l'apprentissage par renforcement en temps continu, s'appuyant sur l'équation HJB comme fondement mathématique tout en surmontant les limites des approches existantes à base de PINNs. L'objectif est de concevoir des politiques de contrôle scalables et basées sur des modèles, capables de fonctionner efficacement en temps continu et de gérer des espaces d'états de haute dimension. The Hamilton-Jacobi-Bellman (HJB) equation is a cornerstone for addressing optimal control problems in continuous time, bridging the gap between classical control theory and modern reinforcement learning. In the context of continuous-time dynamical systems, the HJB equation provides a mathematical formulation for the optimal value function
V(s), which quantifies the cumulative reward achievable from any given state s, integrated over an infinite horizon with discounting. This formulation is particularly suited to systems where actions and states evolve continuously, such as in robotics, autonomous systems, and energy management, where discrete-time assumptions often fail to capture the underlying dynamics accurately. The HJB equation is derived from the principle of optimality, stating that the optimal policy at any state is the one that maximizes the sum of the immediate reward and the discounted future rewards, weighted by the system's dynamics. Solving the HJB equation yields not only the optimal value function but also the corresponding optimal control policy, which is critical for designing high-performance control systems.
Despite its theoretical elegance, solving the HJB equation in practical settings presents significant challenges. Classical numerical methods, such as finite difference or finite element discretization, struggle with high-dimensional state spaces, rendering them computationally intractable for most real-world applications. To overcome this limitation, recent work has turned to deep learning techniques, particularly Physics-Informed Neural Networks (PINNs) [Raissi et al., 2019], which approximate solutions to PDEs by minimizing the residual of the governing equations. PINNs have demonstrated the ability to solve a wide range of PDEs, including the HJB equation, by leveraging neural networks to represent the value function and automatically differentiating through the PDE constraints.
However, the application of PINNs to the HJB equation introduces several challenges. In particular, the non-uniqueness of solutions to the HJB equation presents a significant problem for PINNs application to HJB, which can admit infinitely many generalized solutions. To address this, the concept of viscosity solutions [Crandall & Lions, 1983] has been introduced, ensuring that the value function corresponds to a unique viscosity solution. This requires the use of techniques such as vanishing viscosity, where a small viscosity term proportional to a laplacian of value function is added to the HJB equation, and the solution is recovered when this viscosity term goes to zero. This approach has been successfully applied in various control tasks, such as the inverted pendulum, cartpole, and acrobot systems, where PINNs-based methods like HJBPINNs [Shilova et al., 2024] have shown promise in learning effective control policies. While this approach stabilizes training, it introduces additional hyperparameters and computational overhead, complicating the optimization process. Moreover, PINNs in general are highly sensitive to hyperparameters, including the choice of optimizer, and the choice of training points, which can significantly impact the quality and generalization of the learned solution. The training data for PINNs is typically either predefined or adaptively sampled, contrasting with the rollout-based data collection in reinforcement learning, which further complicates integration with RL paradigms.
A recent work has explored policy-dependent formulations of the HJB equation, such as PINN Policy Iteration (PINN PI) [Meng et al., 2024], which alternates between policy evaluation and improvement steps, drawing parallels with classical policy iteration in discrete-time RL. This approach can also be seen as solving a series of parametrised PDEs, where policy functions serve as parameters and thus it also depends on a curriculum learning optimization process.
Another critical limitation lies in the model dependency of PINN-based HJB solvers. These methods require prior knowledge of the system dynamics and the reward function, restricting their applicability to environments where such models are available. A few papers, like [Yildiz et al., 2021] and [Treven et al., 2023] have considered a model learning in continuous time context, providing the proof of concept for the model-based continuous-time reinforcement learning, without investigating the best policy learning algorithm. The integration of model learning with HJB-based policies remain under-studied.
In summary, while the HJB equation provides a powerful framework for continuous-time optimal control and reinforcement learning, its practical application is hindered by lack of scalable algorithms that can solve this equation for general non-linear dynamical systems. PINNs offer a promising avenue for approximating HJB solutions, but their effectiveness is constrained by the need for known dynamics, the challenges of non-uniqueness of generalized solutions of the HJB, and the complexities of training and generalization. Thus, addressing these limitations is crucial for advancing the field of continuous-time reinforcement learning and unlocking its potential in real-world applications. This project pursues three main objectives. Objective 1 aims at improving the existing PINN-based methods for solving HJB in the case when the dynamics and rewards of the environment are known. The existing work has outlined a few challenges that should be addressed: how to design scalable and efficient curriculum learning optimization schemes for solving parametrized PDEs that arise when solving value-based HJB and policy-based HJB. A promising approach is to adopt functional perspective on PINN optimization akin to natural gradient works. For example, implicit curriculum learning performs neural network updates that are aware of PDE shifts and thus it can benefit the HJB equation as well. To further scale PINNs optimization, we should also consider how to apply adaptive sampling techniques without damaging curriculum learning optimization. Careful optimization and adaptive sampling should help to develop efficient and scalable training schemes that can be further used for continuous-time RL algorithms.
Objective 2 studies how to integrate PINN-based strategies developed within Objective 1 into model based RL framework. To learn the model, the previous works have shown the potential of neural ordinary differential equations (NODE) models in approximating continuous-time dynamics. As NODE learns explicitly dynamics function f(s,a), it can be plugged directly in the HJB equation, which in turn returns a policy. This serves as a basis for a model-based continuous-time reinforcement learning. However, this raises additional challenges of how to perform efficient exploration (data collection) in continuous time and how to account for model errors when solving HJB.
Finally, if continuous-time policies are the most suitable for high frequency control systems, the physical systems are often subject to irregular observations and controls, so the best policy should know how to adapt to variable control frequencies. Therefore, we will explore how to adapt continuous-time RL strategies to arbitrary time discretizations within Objective 3.
Le profil recherché
- Apprentissage automatique (y compris l'apprentissage profond)
- Compétences en mathématiques appliquées : probabilités et statistiques, optimisation
- (Optionnel mais apprécié) : contrôle optimal, apprentissage par renforcement, équations aux dérivées partielles, analyse fonctionnelle, réseaux de neurones informés par la physique