Recherchez une offre d'emploi

Thèse Alignement de Representation avec les Humains des Modeles Vision-Langage-Action pour les Taches Hierarchiques H/F - 75

Description du poste

Établissement : Institut Polytechnique de Paris École nationale supérieure de techniques avancées
École doctorale : Ecole Doctorale de l'Institut Polytechnique de Paris
Laboratoire de recherche : U2IS - Unité d'Informatique et d'Ingénierie des Systèmes
Direction de la thèse : Sao Mai NGUYEN
Début de la thèse : 2026-10-01Alors que les avancées des modeles vision langage impactent la robotique car ils sont exploités pour la planification des taches compositionnelles, ils butent contre le manque d'incarnation des actions physiques des LLMs et de leur mauvais capacité pour la planification long-terme afin d'accomplir des taches compositionnelles. Une autre limitation des modeles de fondation est le manque en robotique de bases de données massives pour l'apprentissage d'action incarnée multi-taches. De plus, plus la complexité des taches augmente, plus la taille des bases de données doit augmenter, exponentiellement. En fait, en apprentissage non-fini, l'ensemble des taches et les changement d'environnement, par définition, rendent impossible l'apprentissage d'une base de donnée pré-définie, aussi grande soit-elle.

Dans cette thèse theorique, en adoptant la perspective de l'apprentissage continu, nous proposons de nous attaquer à la limitation des bases de données prédéfinies avec les mécanismes d'apprentissage bio-inspirés :

- l'apprentissage par renforcement par motivation intrinsèque pour collecter des données de manière efficace

- l'apprentissage hiérarchique pour mettre à profit l'apprentissage par transfert à partir de taches simples pour construire des taches de plus en plus complexes

- l'apprentissage par imitation actif pour exploiter l'expertise humaine, en particulier les compositions haut niveau de taches.

Cette thèse a pour but d'apporter les bases theoriques pour aligner aux representations des humains, des modeles de fondation robotique multi-modales adaptatif de tache en incorporant la proprioception, la vision, le langage et l'apprentissage auto-supervisé, permettant aux robots de généraliser des taches primitives pour améliorer en taches compositionnelles, pour l'apprentissage non-fini dans in environnement incarné.

Foundation Models (FMs) [1] in Natural Language Processing (NLP) and Computer Vision (CV) refer to large-scale, pre-trained models that serve as a base for a wide range of downstream tasks. These models leverage massive datasets and self-supervised learning to develop general-purpose representations, which can then be fine-tuned for specific downstream applications.

In NLP, Large Language Models (LLMs), such as GPT, PaLM, and LLaMA, have revolutionized the field by leveraging the transformer architecture, self-supervised pre-training paradigms and web-scale data. These models, often containing tens or hundreds of billions of parameters, demonstrate remarkable zero-shot and few-shot generalization capabilities across a diverse range of tasks, including dialogue systems, step-by-step reasoning, mathematical problem-solving, and code generation.
Although impressive, these foundation models lack sensorimotor skills, preventing them from directly perceiving, manipulating, or interacting with the physical world in an embodied manner. A natural and important question is whether these data-absorbent FMs can be extended to robotics and
endowed with sensorimotor skills to interact with the real world. Such an extension could be a transformative leap in embodied AI, integrating high-level semantic understanding with low-level robotic control.

The objective is to propose a theoretical methodology for a Vision Language Action Model that aligns with human representations for hierarchical tasks.
This thesis will lay the theoretical work to explore human representations of hierarchical tasks, and how to represent that hierarchy in Vision Language Action models

As such, an accelerating rapid move in AI and robotics is research on Foundation Models for Robotics (FMRs), with increasingly intensive efforts to integrate sensorimotor capabilities and multimodal learning to enhance robotic autonomy and real-world generalization skills. The latest FMRs, such as OpenVLA, 0, and Octo, represent a significant advancement in generalist robotic capabilities by
integrating vision-language-action (VLA) models with large-scale real-world robotic data. OpenVLA[5], a 7B parameter model, is trained on 970,000 real robot demonstrations and outperforms larger models like RT-2-X (55B), while maintaining adaptability across multiple robotic platforms. 0, developed by Physical Intelligence, employs a flow-matching architecture built on vision-language models (VLMs) to enable smooth real-time 50Hz control across various tasks, including laundry folding and grocery bagging. Octo, a diffusion-based policy trained on 800,000 trajectories, provides flexible task execution by allowing natural language or goal image inputs, demonstrating high adaptability across nine robotic embodiments.

While FMRs are mostly black box models, the question of alignment with human representation of actions remains unaddressed, despite its importance in communication with humans about actions or tasks, and to enhance human-robot collaboration with coordination and turn taking between humans and robots. Seurin et al. has explored how Natural Language can convey multiple sub-tasks by describing what the agent must accomplish, and showed that the efficiency of the communication instructions in natural language can be increased by repeated interaction between the robot and the
tutor. Interaction between the robot and the tutor seems key to aligning representations. However, for FMRs that need massive data to train, they need to be complemented with self-exploration. As part of human-in-the-loop approach, the field of Interactive learning assumes that a human will be able to assist the robot in the evaluation by providing feedback, guidance and/or showing optimal actions. Whereas reinforcement learning and supervised (or imitation) learning have been traditionally opposed we argue that both worlds are on the contrary complementary and highlight the merits of merging the two fields. While an increasing stream of machine learning works propose to combine the two paradigms, including reinforcement learning from human feedback, most often, the agent in these
works undergoes the interaction passively. We will refer to the approaches where the robot optimizes its interaction with tutors in an active way as active imitation learning, as an imitation learning paradigm covering cases where an observer can influence the frequency and the value of demonstrations that it is shown.

For robot learning, Nguyen [2024] proposes to frame the interaction with human teachers as a reinforcement learning problem, to enable learning agents to learn multiple parametrised tasks by devising their own learning strategy: they choose actively what do learn, when to learn and thus their own curriculum; and also what, when and whom to imitate. Reinforcement learning is therefore not
only about an agent interacting with a physical environment but also with a social environment. Using intrinsic motivation as an active learning criterion, SGIM-ACTS learns several parametrised tasks by choosing its teachers and the timing of its requests, while SGIM-PB uses transfer of knowledge between tasks by learning the hierarchical relationship between the parametrised tasks, showing an
alignment with the human representation of task hierarchy.

THis thesis bring a theoretical frameowrk to lay out the formal theory for human alignment,

Je postule sur HelloWork

Offres similaires

Responsable de Magasin H/F

  • Promod

  • Paris 15e - 75

  • CDI

  • 21 Mars 2026

Gestionnaire de Paie en Alternance H/F

  • Walter Learning

  • Paris 2e - 75

  • Alternance

  • 21 Mars 2026

Analyste Financier H/F

  • Team.is

  • Paris 16e - 75

  • CDI

  • 21 Mars 2026

Déposez votre CV

Soyez visible par les entreprises qui recrutent à Paris.

J'y vais !

Chiffres clés de l'emploi à Paris

  • Taux de chomage : 9%
  • Population : 2165423
  • Médiane niveau de vie : 28570€/an
  • Demandeurs d'emploi : 205650
  • Actifs : 1177663
  • Nombres d'entreprises : 490838

Sources :


Un site du réseaux :

Logo HelloWork