Recherchez une offre d'emploi

Thèse Design Inverse de Molécules par Chemo-Llms H/F - 75

Description du poste

Université Paris-Saclay GS Informatique et sciences du numérique
Paris - 75
CDD
Publié le 17 Mars 2026

Établissement : Université Paris-Saclay GS Informatique et sciences du numérique
École doctorale : Sciences et Technologies de l'Information et de la Communication
Laboratoire de recherche : Données Algorithmes pour une ville intelligente et durable
Direction de la thèse : Jérémie CABESSA ORCID 0000000253945249
Début de la thèse : 2026-10-01
Date limite de candidature : 2026-05-01T23:59:59

La conception inverse de molécules et de matériaux est un domaine d'importance majeure, avec des implications allant de la découverte de médicaments à la mise au point de nouveaux matériaux. Ce projet propose une approche de la conception inverse de molécules basée sur la spectroscopie vibrationnelle, en s'appuyant sur des grands modèles de langage chimiques (chemo-LLMs). Plus précisément, nous visons à développer des stratégies efficaces pour prédire des structures moléculaires à partir de données spectroscopiques, en exploitant la connaissance de chemo-LLMs pré-entraînés tels que ChemBERTa, MolBERT et ChemGPT. Le projet se concentre sur trois axes complémentaires : (i) l'apprentissage léger de chemo-LLMs par des techniques de soft prompting, incluant le prompt tuning et le prefix tuning, (ii) l'incorporation des données spectrales sous forme de vecteurs continus pour améliorer la fidélité des prédictions spectre-structure, et (iii) l'inversion de chemo-LLMs structure-spectre grâce à une adaptation chimique de méthodes telle que Vec2Text. Ces approches permetteraient de générer efficacement des structures moléculaires valides à partir de signatures spectrales. Dans l'ensemble, ce projet contribue à la fois au domaine de la chimie et de l'apprentissage automatique, en proposant de nouvelles méthodes pour la conception inverse de molécules, tout en abordant le défi plus large des problèmes inverses dans le contexte des grands modèles de langage.

In machine learning, inverse problems involve reconstructing the latent states of a physical system from indirect, incomplete, or noisy observations. Numerous strategies have been proposed to address these challenges. Early approaches include regularization-based neural networks [1], which embed prior knowledge directly into the reconstruction process. Invertible neural networks (INNs), hybrid and flow-based models [2-4] enable exact or approximate posterior inference by modeling bijective mappings between data and latent spaces. Variational autoencoders (VAEs) and GAN-based frameworks [5] provide tractable approximations of complex posterior distributions. More recently, diffusion models [6, 7] and large language models (LLMs) [8] have emerged as powerful models, capable of guiding reconstructions toward plausible solutions across a broad spectrum of inverse problems.

In chemistry and materials science, inverse molecular and materials design is a topic of central importance with broad technological and societal implications, from drug discovery to the development of sustainable energy sources [9, 10]. The task involves predicting or generating stable, synthesizable compounds that exhibit desired properties and functionalities. Moreover, infrared (IR), Raman, ultraviolet-visible (UV-Vis), nuclear magnetic resonance (NMR), and mass spectrometry (MS) are among the most widely used spectroscopic techniques for characterizing molecular and material systems, each providing distinct and complementary structural or compositional information [11-13]. Consequently, leveraging this discriminative power, vibrational spectroscopy emerges as a particularly promising foundation for inverse design, as it encodes detailed information about functional groups, bonding environments, and molecular structure.

Early approaches to inverse design relied on exhaustive exploration of the potential energy surface (PES) to identify stable conformations, a process that is computationally prohibitive. Machine learning has drastically accelerated this search [10], enabling molecular generation strategies based on recurrent neural networks (RNNs), variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), transformers, diffusion models, and hybrid methods [9, 14-21]. In these frameworks, compounds are generated with the aid of generative models, where target properties and functionalities are encoded as inputs to guide the design process. More recently, owing to their ability to represent molecules as textual sequences such as SMILES and SELFIES [22, 23], large language models (LLMs) have emerged as powerful tools for molecular science, supporting tasks ranging from property prediction to structure generation. Numerous domain-specific LLMs - including ChemBERTa, MolBERT, SMILES-BERT, ChemBERTa-2, Taiga, SELFormer, and ChemGPT - have been pre-trained in a self-supervised manner on large molecular datasets and successfully used for important chemical tasks [24-28].

This project pursues a spectral-based approach to inverse molecular design using chemo-LLMs. The forward prediction of spectral signatures from molecular structures has been successfully addressed with graph neural networks (GNNs) [29-32]. The more challenging inverse problem of molecular structure elucidation from spectral data has evolved from early neural network approaches [33, 34] to transformer-based architectures [35] and diffusion models [36]. Recent advances include the Molecular Transformer for IR-based structure elucidation [37-41], NMR-driven transformer frameworks [42, 43], multimodal models integrating IR, UV, and NMR spectra [44], and sequence-to-sequence translation of spectra to structures using recurrent or transformer-based architectures [45]. Complementary strategies encompass contrastive learning with joint spectral-molecular embeddings [46], as well as multimodal LLM-based frameworks that prompt models in natural language [47-49]. All these approaches generate molecular structures as SMILES or SELFIES representations. In addition, diffusion transformer models [50] have been trained to generate molecular graphs conditioned on physicochemical properties [21, 51] and mass spectra [52].

This project focuses on a spectral-based approach to inverse molecular design using chemo-LLMs (see Figure 1). Specifically, we aim to leverage the extensive chemical knowledge embedded in pre-trained language models such as ChemBERTa, MolBERT, SMILES-BERT, ChemBERTa-2, Taiga, SELFormer, and ChemGPT. Building on these powerful representations, our goal is to develop lightweight training strategies that enable efficient adaptation to spectral inversion tasks without the high computational cost of full model fine-tuning.

Our first research direction is to evaluate lightweight training strategies based on soft prompting [53], a general paradigm in which learnable, non-readable tensors (soft prompts) are concatenated with input embeddings and optimized for a specific task (see Figure 2). Within this framework, we will focus on two prominent variants: prompt tuning [54] and prefix tuning [53]. These approaches encode task descriptions as trainable embeddings, enabling the model to adapt its behavior without altering its core parameters. By framing spectroscopic inversion tasks as soft prompts, we aim to guide molecular structure generation while substantially reducing training costs compared to full fine-tuning.

As a second research direction, we aim to extend our prompt tuning and prefix tuning strategies beyond task specification, to also incorporate spectral data directly in the form of continuous vector embeddings. This approach departs from purely text-based encodings of spectra, which may introduce discretization artifacts and information loss. By injecting spectral signatures as numerical representations, we expect the transformer to

Our third research direction addresses the spectrum-to-structure prediction task by leveraging its easier counterpart, the structure-to-spectrum problem (see Figure 3). To this end, we build on recent advances in NLP that explore the inversion of large language models (LLMs). Specifically, we aim to invert a fine-tuned structure-to- spectrum chemo-LLM, by generalizing the Vec2Text technique [55], originally developed for inverting text embeddings, to the chemical domain. In this framework, spectral signatures are treated as molecular embeddings, and a chemical variant of Vec2Text is developed to iteratively generate molecular structures from spectral inputs. The resulting inverse model accepts spectra as input and outputs valid molecular structures in the form of SELFIES. Compared to transformer- or diffusion-based approaches trained from scratch, this invertibility framework is expected to require substantially fewer computational resources while providing a principled pathway for spectral-based inverse molecular design.

Overall, this project has the potential to make significant contributions to both the chemical and machine learning communities. From a chemistry perspective, it advances the fields of inverse molecular design and molecular structure elucidation by providing novel strategies to generate valid molecules directly from spectral data. From a machine learning standpoint, it addresses the challenging subfield of inverse problems within the context of large language models, exploring lightweight adaptation techniques, continuous embeddings, and model invertibility techniques. exploit their native structure more effectively, thereby enhancing the fidelity of spectrum-to-structure inversion.

The project is organized into four main work packages:

- WP1: Literature Review and Knowledge Acquisition. Conduct an extensive review of spectral-based in- verse molecular design, chemo-LLMs, and recent advances in lightweight adaptation strategies such as soft prompting, prompt tuning, and prefix tuning. Acquire detailed knowledge of relevant spectral datasets, molecular representations (SMILES, SELFIES), and state-of-the-art NLP inversion techniques, including Vec2Text.

- WP2: Soft Prompting Strategies for Spectral Inversion. Develop and evaluate lightweight training strate- gies based on soft prompting [53], focusing on prompt tuning [54] and prefix tuning [53]. Encode task specifications as learnable prompt embeddings and assess their effectiveness in guiding chemo-LLMs for molecular structure generation from spectral inputs while minimizing computational cost.

- WP3: Continuous Spectral Embeddings. Extend the soft prompting approaches to incorporate spec- tral data directly as continuous vector embeddings, rather than textual encodings. Investigate how the transformer can leverage the native numerical structure of spectra to improve the fidelity of spectrum-to- structure inversion and reduce information loss.

- WP4: Invertible Chemo-LLMs. Adapt the Vec2Text technique [55] to the chemical domain by invert- ing a fine-tuned structure-to-spectrum chemo-LLM. Treat spectral signatures as molecular embeddings and develop a chemical variant of Vec2Text to iteratively generate valid molecular structures from spectral in- puts. Benchmark the resulting inverse model against existing transformer- or diffusion-based approaches, evaluating validity, fidelity, and computational efficiency.