Phd Position F - M Toward Grounded Consistent And Temporally Faithful Video Reasoning H/F - INRIA
- CDD
- INRIA
Les missions du poste
A propos d'Inria
Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.
PhD Position F/M Toward Grounded, Consistent, and Temporally Faithful Video Reasoning
Le descriptif de l'offre ci-dessous est en Anglais
Type de contrat : CDD
Niveau de diplôme exigé : Bac +5 ou équivalent
Fonction : Doctorant
Niveau d'expérience souhaité : Jeune diplômé
Mission confiée
Recent video multimodal large language models (Video-MLLMs) have achieved
strong results on standard benchmarks, yet remain systematically unreliable on tasks
requiring temporally consistent, spatially grounded reasoning. Video-LLMs achieve
near-chance consistency (50%) in temporal grounding even after task-specific fine-
tuning; they hallucinate actions, temporal sequences, and scene transitions at
high rates; and they perform close to random on 4D spatiotemporal tasks (GPT-
4o: 57.5% vs. 98.8% human) and multi-object dynamic spatial reasoning.
These failures are structural: current systems compress all perceptual history into a
flat token sequence and ask the language model to simultaneously act as the archive
of what happened and the reasoner about what it means. These are architecturally
distinct operations, and conflating them in a single attention pass makes temporal
inconsistency, hallucination, and spatial failure modes unavoidable by design. This
PhD addresses the design of an explicit memory and state space to improve long-video
reasoning.
Principales activités
Main activities:
Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.
Compétences
Technical skills and level required : programming skills are required.
Languages : English and possibly French.
Relational skills : Good communication skills.
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Compétences requises
- Anglais
- Reporting
- Français