Recherchez une offre d'emploi

Phd Position F - M Physically-Grounded Video Generation H/F - 75

Description du poste

INRIA
Paris - 75
CDD
Publié le 16 Octobre 2025

A propos d'Inria

Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.PhD Position F/M Physically-Grounded Video Generation
Le descriptif de l'offre ci-dessous est en Anglais
Type de contrat : CDD

Niveau de diplôme exigé : Bac +5 ou équivalent

Fonction : Doctorant

Niveau d'expérience souhaité : Jeune diplômé

Contexte et atouts du poste

The Phd will be done at Inria in the Willow research team.

Mission confiée

Short Overview of the PhD Project:
This PhD thesis aims to enhance the physical consistency of current video generation
models by exploring various techniques to inject physics awareness into them.
PhD Project Description:
The motivation for this PhD thesis is to address a critical limitation in current video
generation models: their lack of consistency with the laws of physics. Although these models
are increasingly adept at generating high-quality content that can almost perfectly match
real-world scenes, their capabilities to effectively model the underlying laws governing
dynamic interactions remain limited [1,2,3,4,6]. Simple scenarios, such as object freefall, are
sufficient to demonstrate these limitations [3]. Improving these capabilities is a fundamental
step towards building more robust models that can function as true world simulators.
Proposed Research Directions:
Different approaches have been explored to overcome the aforementioned limitations. Some
works integrate 3D geometry and dynamics awareness as critical elements for generating
physically plausible videos [7]. Another interesting approach is model-based simulation
guidance, where physics engine simulations are used as an intermediate step to guide the
video generation process [4]. Furthermore, we consider post-training techniques to be
particularly promising. In [3], the authors present a two-stage post-training pipeline
consisting of self-supervised fine-tuning on high-quality data and an Object Reward
Optimization (ORO) phase. In [5], a novel framework called VideoREPA is proposed, which
distills physics understanding from video foundational models into text-to-video generation
models by aligning token-level relations.
Building on this, a primary direction for our research is the use of reasoning-capable models,
such as Large Language Models (LLMs) or Vision-Language Models (VLMs), to create
physically grounded scene descriptions that can guide the video generation process. We
hypothesize that this could be a direct way to transfer the reasoning capabilities of
understanding models to generative ones. Different settings and formats for this guidance,
from free-form text to more structured inputs, will be explored.
Moreover, we aim to investigate post-training techniques based on physics-informed reward
methods, such as those presented in [3]. Given that this work focuses on the specific case of
object freefall, a logical first step is to extend this approach to more complex and diverse
physical scenarios.
During the PhD thesis, the initial research directions will be adapted based on the evolution
of the field and the insights obtained during experimentation.
Evaluation and Benchmarking:
Recent benchmarks such as VideoPhy-2 [1], Phy-World [2], and PISA [3] are valuable
resources for measuring our contributions. However, a key part of this project will also
involve identifying the limitations of current benchmarks. Consequently, designing novel
tasks and evaluation strategies to better assess physical plausibility presents an additional
opportunity for contribution for this PhD project.
References:
[1] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video
Generation
H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, K. W. Chang
[2] How Far is Video Generation from World Model: A Physical Law Perspective
B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, J. Feng
[3] PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by
Watching Stuff Drop
C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, S. Xie
[4] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
S. Liu, Z. Ren, S. Gupta, S. Wang
[5] VideoREPA: Learning Physics for Video Generation through Relational Alignment with
Foundation Models
X. Zhang, J. Liao, S. Zhang, F. Meng, X. Wan, J. Yan, Y. Cheng
[6] MOTIONCRAsFT: Physics-based Zero-Shot Video Generation
L. S. Aira, A. Montanaro, E. Aiello, D. Valsesia, E. Magli
[7] Towards Physical Understanding in Video Generation: A 3D Point Regularization
Approach
Y. Chen, J. Cao, A. Kag, V. Goel, S. Korolev, C. Jiang, S. Tulyakov, J. Ren

Principales activités

Main activities:

Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.

Compétences

Technical skills and level required : programming skills are required.

Languages : English and possibly French.

Relational skills : Good communication skills.

Avantages

- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage